Mail Archives: djgpp/2000/04/17/18:03:34
Dieter Buerssner wrote:
> I refered to the T_Map() function, you posted to this group. This
> can clearly be (quite efficiently) written in C. I didn't look at
> span().
Then you didn't have to say "This is not true.":
------------------8<----------------------
...
>Not really. The inner loop in my tmapper can not be written in pure C.
>Belive me.
This is not true.
...
------------------8<----------------------
???
> >> I get rid of all your inline assembly in T_Map. I will be allowed
> >> to add one single line (say less than 50 characters from __asm__
> >> upto the closing ')' ) of inline assembly to your source. I bet,
> >> the plain C code will perform about the same, as your inline
> >> code. I win, when my code is no more than 2 FPS slower, or faster, than
> >> your code (The executable you sent reports 70 FPS here).
> >
> >How many are there such lines in your oppinion? :)
>
> I don't understand this question.
I thought you could find something over the simple (int)(x) replacement and so I
asked if there are many such lines. :)
> To elaborate, and make this on-topic again. Some of the code Alexei
> posted uses just "normal" floating point math. He coded almost all
> of this inline. I replaced this by the equivalent C-Code, that
> mostly was already there. Some minor modifications where something
> like
> /* a=d/c; b=e/c; */ /* This was already there in comments */
> replaced by
>
> #if USEC
> f=1.0/c; a=d*f; b=e*f;
> #else
> __asm__ /* ... */
> #endif
Btw, don't forget that this is so only in one place while other similar things
are written in C this way. So, it's not a serious thing. :)
> The same optimizatition, Alexei has made in his inline assembler.
Yup.
> After this I recompiled, and the speed went up from 70 FPS to 72 FPS.
Oh man, 2/70 = 2.9% :))
> This, I think, proves, that gcc is capable to produce quite efficient
> floating point code.
Surely it proves. But this is a bit strange, though. :)
> Of course, Alexei's code would have won, if he
> had replaced
>
> __asm__ volatile("fldl (%0)\n ...\nfstpl (%0)" : : "r" (&dbl));
Well, I didn't know that (int)(x) is slower. Btw, I need to take a look at the
.S file. I've not seen how this "round" is made yet.
> (Alexei, you got rid of the "g", but I think, here "memory"
> is needed in the clobber-list. I'm not totally certain, though.)
>
> with
>
> __asm__ volatile("fldl %0\n ...\nfstpl %0" : "=m" (dbl) : "0" (dbl));
>
> This would give gcc more chances to optimize. It uses
> less registers, and also needs less instructions. I have not tried
> this, but even then I think, the C code would not produce much
> less efficient code, than the inline assembly.
I cleaned up all my source today before your post. :) There are "memory" words
everywhere now. And some other stuff is also improved.
> Where gcc produces considerably less efficient code, is when you have
>
> int i;
> double a, b;
>
> i = (int)(a*b);
>
> Here, gcc always needs to save and restore the FPU control word, and
> there are a few occurences of this type in Alexei's code. (I don't
> blame gcc here, I think it is almost impossible to do much better
> for a compiler.)
Stupid thing. It doesn't have to save/load the state of FPU. I think it's needed
only for such things as ceil() and floor(). (int)(x) should be w/o save/restore.
>
> I replaced the above code with
>
> /* can be #ifdefed and replaced by
>
> #define to_int(x) ((int)(x))
>
> for non gcc and i386, to make it even portable. Comments for other
> or more efficient methods to do double -> int conversions are wellcome. */
Joking? :))
> __inline__ static int to_int(double x)
> {
> int r;
> __asm__ volatile ("fistl %0" : "=m" (r) : "t" (x)); /* "t" is for st0 */
> return r;
> }
>
> ...
> #if USEC2
> i = to_int(a*b);
> #else
> __asm__ /* ... */
> #endif
>
> This is essentially, what the inline code of Alexei does. (I have
> not bothered to look up, whether the fistl instruction rounds,
> or chops, so this may be not the same as the C-code.)
Sure, FIST(P)L. :)
> While the to_int function is not optimal (gcc will have to code
> one superflous fstp instruction, compared to fistpl), it is still
> quite a bit more efficient than C code. With these modifications,
> I got rid of all the other inline assembly. I got 70 FPS, the
> same as the original (either the self compiled sources, or the
> executable Alexei sent to me).
>
> Alexei's code will "cache" some values on the FPU stack, which
> gcc is not able to see (with the switches I used). Nevertheless,
> even here, with the help of only one line of inline assembly,
> it produces comparable results. Again, it would loose, when all
> those references and adress-off operations would be omitted.
> It should be clear, that the compiler won't reach the efficiency
> of hand optimzed assembler code. Whether the relative small
> difference here is worth all the trouble, ...
Don't forget that my code didn't compile with either -O or -O2 then. It makes
difference. Note this.
> One last comment, on the T_Map function. The C-code version actually
> got quite a bit slower (5 FPS, IIRC), when compiled with -O2 or -O3,
> compared to -O only. The assembler version, not surprisingly, was
> not effected.
>
> There was one bug in the other part of the sources, that may be of
> general interest.
>
> [All the context omitted (Alexei, it's in your linev)]
> int c; /* only low byte used */
> __asm__ volatile("movb %0, al" : : "g" (c));
>
> This actually compiled with -O2, but got an error with -O by gas.
> It should be clear why - when gcc decides that c will live in memory
> or in a/b/c/dx, it will work, when it is in (say) esi, it won't.
> So, this is a nice example, why "but it work's", doesn't buy you
> too much.
I cleaned up this in the morning.
> Alexei, I have made some fun. I hope I have made up for it, by this
> post, that took actually longer to write, than the coding.
> I will send you the modified source by email. The post hopefully
> was of general interest.
Well, let me tell some words in conclusion. ;)
1. You simply proved that GCC has an optimizer efficient enough. Okay, I agree.
Your code that works 2 FPS fater for you works the same for me as before. I
think it doesn't mean faster than mine (just 2.9%).
So, we have a good optimizer and you proved this. Great. I'm glad. This means I
can throw away a lot of inline ASM now.
2. If I knew that (int)(x) is slow and if I had proper manual on inline ASM, I
would achived the same but with less problems.
3. Dieter, I hope you won't try to convert span() to plane C. :)
This replacement doesn't work even nearly fast:
--------------8<----------------
while (n--) {
*scr++ = *(texture+((v1>>8)&0xFF00)+((u1>>16)&0xFF));
u1 += du;
v1 += dv;
};
--------------8<----------------
Anyway thank you. And btw, thank to myself. If I didn't write efficient C code
between /* */ :), Dieter would never prove that GCC has a good optimizer because
he doesn't know the tmapping algorithm (do you?).
Seems that this is a story that can teach everyone (me=best example). :))
I think this thread is almost closed. Just one short question is left (I mean
span() :).
thanks.
Alexei A. Frounze
-----------------------------------------
Homepage: http://alexfru.chat.ru
Mirror: http://members.xoom.com/alexfru
- Raw text -