Mail Archives: djgpp/2000/04/17/14:43:00
Alexei A. Frounze wrote:
>Dieter Buerssner wrote:
>> Alexei A. Frounze wrote:
>> >Not really. The inner loop in my tmapper can not be written in pure C.
>> >Belive me.
>>
>> This is not true.
>
>Okay, interpolate U and V over a group of pixels, and don't forget &0xFF
>to be sure U and V don't exceed the 0...255 range (the span() function
>does this). I doubt your C code will be as fast as my ASM. Tell me what
>you've got when you're done.
I refered to the T_Map() function, you posted to this group. This
can clearly be (quite efficiently) written in C. I didn't look at
span().
[Most of stupid bet deleted. Seriously, I wouldn't have hold the bet,
because I already knew the results.]
>> I get rid of all your inline assembly in T_Map. I will be allowed
>> to add one single line (say less than 50 characters from __asm__
>> upto the closing ')' ) of inline assembly to your source. I bet,
>> the plain C code will perform about the same, as your inline
>> code. I win, when my code is no more than 2 FPS slower, or faster, than
>> your code (The executable you sent reports 70 FPS here).
>
>How many are there such lines in your oppinion? :)
I don't understand this question.
To elaborate, and make this on-topic again. Some of the code Alexei
posted uses just "normal" floating point math. He coded almost all
of this inline. I replaced this by the equivalent C-Code, that
mostly was already there. Some minor modifications where something
like
/* a=d/c; b=e/c; */ /* This was already there in comments */
replaced by
#if USEC
f=1.0/c; a=d*f; b=e*f;
#else
__asm__ /* ... */
#endif
The same optimizatition, Alexei has made in his inline assembler.
After this I recompiled, and the speed went up from 70 FPS to 72 FPS.
This, I think, proves, that gcc is capable to produce quite efficient
floating point code. Of course, Alexei's code would have won, if he
had replaced
__asm__ volatile("fldl (%0)\n ...\nfstpl (%0)" : : "r" (&dbl));
(Alexei, you got rid of the "g", but I think, here "memory"
is needed in the clobber-list. I'm not totally certain, though.)
with
__asm__ volatile("fldl %0\n ...\nfstpl %0" : "=m" (dbl) : "0" (dbl));
This would give gcc more chances to optimize. It uses
less registers, and also needs less instructions. I have not tried
this, but even then I think, the C code would not produce much
less efficient code, than the inline assembly.
Where gcc produces considerably less efficient code, is when you have
int i;
double a, b;
i = (int)(a*b);
Here, gcc always needs to save and restore the FPU control word, and
there are a few occurences of this type in Alexei's code. (I don't
blame gcc here, I think it is almost impossible to do much better
for a compiler.)
I replaced the above code with
/* can be #ifdefed and replaced by
#define to_int(x) ((int)(x))
for non gcc and i386, to make it even portable. Comments for other
or more efficient methods to do double -> int conversions are wellcome. */
__inline__ static int to_int(double x)
{
int r;
__asm__ volatile ("fistl %0" : "=m" (r) : "t" (x)); /* "t" is for st0 */
return r;
}
...
#if USEC2
i = to_int(a*b);
#else
__asm__ /* ... */
#endif
This is essentially, what the inline code of Alexei does. (I have
not bothered to look up, whether the fistl instruction rounds,
or chops, so this may be not the same as the C-code.)
While the to_int function is not optimal (gcc will have to code
one superflous fstp instruction, compared to fistpl), it is still
quite a bit more efficient than C code. With these modifications,
I got rid of all the other inline assembly. I got 70 FPS, the
same as the original (either the self compiled sources, or the
executable Alexei sent to me).
Alexei's code will "cache" some values on the FPU stack, which
gcc is not able to see (with the switches I used). Nevertheless,
even here, with the help of only one line of inline assembly,
it produces comparable results. Again, it would loose, when all
those references and adress-off operations would be omitted.
It should be clear, that the compiler won't reach the efficiency
of hand optimzed assembler code. Whether the relative small
difference here is worth all the trouble, ...
One last comment, on the T_Map function. The C-code version actually
got quite a bit slower (5 FPS, IIRC), when compiled with -O2 or -O3,
compared to -O only. The assembler version, not surprisingly, was
not effected.
There was one bug in the other part of the sources, that may be of
general interest.
[All the context omitted (Alexei, it's in your linev)]
int c; /* only low byte used */
__asm__ volatile("movb %0, al" : : "g" (c));
This actually compiled with -O2, but got an error with -O by gas.
It should be clear why - when gcc decides that c will live in memory
or in a/b/c/dx, it will work, when it is in (say) esi, it won't.
So, this is a nice example, why "but it work's", doesn't buy you
too much.
Alexei, I have made some fun. I hope I have made up for it, by this
post, that took actually longer to write, than the coding.
I will send you the modified source by email. The post hopefully
was of general interest.
--
Regards, Dieter
- Raw text -