delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/2000/04/17/18:03:34

From: "Alexei A. Frounze" <alex DOT fru AT mtu-net DOT ru>
Newsgroups: comp.os.msdos.djgpp
Subject: Re: inefficiency of GCC output code & -O problem
Date: Tue, 18 Apr 2000 00:47:20 +0400
Organization: MTU-Intel ISP
Lines: 214
Message-ID: <38FB7858.41B090DB@mtu-net.ru>
References: <Pine DOT LNX DOT 4 DOT 10 DOT 10004161837540 DOT 1138-100000 AT darkstar DOT grendel DOT net> <38F9D717 DOT 9438A3F6 AT mtu-net DOT ru> <8df84a DOT 3vvqu6v DOT 0 AT buerssner-17104 DOT user DOT cis DOT dfn DOT de> <38FB4094 DOT DE7B5F4C AT mtu-net DOT ru> <8dfum2 DOT 3vvqu6v DOT 0 AT buerssner-17104 DOT user DOT cis DOT dfn DOT de>
NNTP-Posting-Host: ppp97-207.dialup.mtu-net.ru
Mime-Version: 1.0
X-Trace: gavrilo.mtu.ru 956007958 75509 212.188.97.207 (17 Apr 2000 21:45:58 GMT)
X-Complaints-To: usenet-abuse AT mtu DOT ru
NNTP-Posting-Date: 17 Apr 2000 21:45:58 GMT
Cc: buers AT gmx DOT de, eliz AT is DOT elta DOT co DOT il
X-Mailer: Mozilla 4.72 [en] (Win95; I)
X-Accept-Language: en,ru
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp
Reply-To: djgpp AT delorie DOT com

Dieter Buerssner wrote:
> I refered to the T_Map() function, you posted to this group. This
> can clearly be (quite efficiently) written in C. I didn't look at
> span().

Then you didn't have to say "This is not true.":
------------------8<----------------------
...
>Not really. The inner loop in my tmapper can not be written in pure C. 
>Belive me. 

This is not true.
...
------------------8<----------------------

???

> >> I get rid of all your inline assembly in T_Map. I will be allowed
> >> to add one single line (say less than 50 characters from __asm__
> >> upto the closing ')' ) of inline assembly to your source. I bet,
> >> the plain C code will perform about the same, as your inline
> >> code. I win, when my code is no more than 2 FPS slower, or faster, than
> >> your code (The executable you sent reports 70 FPS here).
> >
> >How many are there such lines in your oppinion? :)
> 
> I don't understand this question.

I thought you could find something over the simple (int)(x) replacement and so I
asked if there are many such lines. :)


> To elaborate, and make this on-topic again. Some of the code Alexei
> posted uses just "normal" floating point math. He coded almost all
> of this inline. I replaced this by the equivalent C-Code, that
> mostly was already there. Some minor modifications where something
> like
>   /* a=d/c; b=e/c; */ /* This was already there in comments */
> replaced by
> 
> #if USEC
>   f=1.0/c; a=d*f; b=e*f;
> #else
>   __asm__ /* ... */
> #endif

Btw, don't forget that this is so only in one place while other similar things
are written in C this way. So, it's not a serious thing. :)


> The same optimizatition, Alexei has made in his inline assembler.

Yup.

> After this I recompiled, and the speed went up from 70 FPS to 72 FPS.

Oh man, 2/70 = 2.9% :))

> This, I think, proves, that gcc is capable to produce quite efficient
> floating point code. 

Surely it proves. But this is a bit strange, though. :)

> Of course, Alexei's code would have won, if he
> had replaced
> 
>   __asm__ volatile("fldl (%0)\n ...\nfstpl (%0)" : : "r" (&dbl));

Well, I didn't know that (int)(x) is slower. Btw, I need to take a look at the
.S file. I've not seen how this "round" is made yet.


> (Alexei, you got rid of the "g", but I think, here "memory"
> is needed in the clobber-list. I'm not totally certain, though.)
> 
> with
> 
>   __asm__ volatile("fldl %0\n ...\nfstpl %0" : "=m" (dbl) : "0" (dbl));
> 
> This would give gcc more chances to optimize. It uses
> less registers, and also needs less instructions. I have not tried
> this, but even then I think, the C code would not produce much
> less efficient code, than the inline assembly.

I cleaned up all my source today before your post. :) There are "memory" words
everywhere now. And some other stuff is also improved.

> Where gcc produces considerably less efficient code, is when you have
> 
>   int i;
>   double a, b;
> 
>   i = (int)(a*b);
> 
> Here, gcc always needs to save and restore the FPU control word, and
> there are a few occurences of this type in Alexei's code. (I don't
> blame gcc here, I think it is almost impossible to do much better
> for a compiler.)

Stupid thing. It doesn't have to save/load the state of FPU. I think it's needed
only for such things as ceil() and floor(). (int)(x) should be w/o save/restore.

> 
> I replaced the above code with
> 
> /* can be #ifdefed and replaced by
> 
> #define to_int(x) ((int)(x))
> 
> for non gcc and i386, to make it even portable. Comments for other
> or more efficient methods to do double -> int conversions are wellcome. */

Joking? :))

> __inline__ static int to_int(double x)
> {
>   int r;
>   __asm__ volatile ("fistl %0" : "=m" (r) : "t" (x)); /* "t" is for st0 */
>   return r;
> }
> 
> ...
> #if USEC2
>   i = to_int(a*b);
> #else
>   __asm__ /* ... */
> #endif
> 
> This is essentially, what the inline code of Alexei does. (I have
> not bothered to look up, whether the fistl instruction rounds,
> or chops, so this may be not the same as the C-code.)

Sure, FIST(P)L. :)

> While the to_int function is not optimal (gcc will have to code
> one superflous fstp instruction, compared to fistpl), it is still
> quite a bit more efficient than C code. With these modifications,
> I got rid of all the other inline assembly. I got 70 FPS, the
> same as the original (either the self compiled sources, or the
> executable Alexei sent to me).
> 
> Alexei's code will "cache" some values on the FPU stack, which
> gcc is not able to see (with the switches I used). Nevertheless,
> even here, with the help of only one line of inline assembly,
> it produces comparable results. Again, it would loose, when all
> those references and adress-off operations would be omitted.
> It should be clear, that the compiler won't reach the efficiency
> of hand optimzed assembler code. Whether the relative small
> difference here is worth all the trouble, ...

Don't forget that my code didn't compile with either -O or -O2 then. It makes
difference. Note this.

> One last comment, on the T_Map function. The C-code version actually
> got quite a bit slower (5 FPS, IIRC), when compiled with -O2 or -O3,
> compared to -O only. The assembler version, not surprisingly, was
> not effected.
> 
> There was one bug in the other part of the sources, that may be of
> general interest.
> 
> [All the context omitted (Alexei, it's in your linev)]
>   int c; /* only low byte used */
>   __asm__ volatile("movb %0, al" : : "g" (c));
> 
> This actually compiled with -O2, but got an error with -O by gas.
> It should be clear why - when gcc decides that c will live in memory
> or in a/b/c/dx, it will work, when it is in (say) esi, it won't.
> So, this is a nice example, why "but it work's", doesn't buy you
> too much.

I cleaned up this in the morning.

> Alexei, I have made some fun. I hope I have made up for it, by this
> post, that took actually longer to write, than the coding.
> I will send you the modified source by email. The post hopefully
> was of general interest.

Well, let me tell some words in conclusion. ;)

1. You simply proved that GCC has an optimizer efficient enough. Okay, I agree.
Your code that works 2 FPS fater for you works the same for me as before. I
think it doesn't mean faster than mine (just 2.9%).
So, we have a good optimizer and you proved this. Great. I'm glad. This means I
can throw away a lot of inline ASM now.

2. If I knew that (int)(x) is slow and if I had proper manual on inline ASM, I
would achived the same but with less problems.

3. Dieter, I hope you won't try to convert span() to plane C. :)
This replacement doesn't work even nearly fast:
--------------8<----------------
      while (n--) {
        *scr++ = *(texture+((v1>>8)&0xFF00)+((u1>>16)&0xFF));
        u1 += du;
        v1 += dv;
      };
--------------8<----------------

Anyway thank you. And btw, thank to myself. If I didn't write efficient C code
between /* */ :), Dieter would never prove that GCC has a good optimizer because
he doesn't know the tmapping algorithm (do you?). 

Seems that this is a story that can teach everyone (me=best example). :))

I think this thread is almost closed. Just one short question is left (I mean
span() :).

thanks.
Alexei A. Frounze
-----------------------------------------
Homepage: http://alexfru.chat.ru
Mirror:   http://members.xoom.com/alexfru

- Raw text -


  webmaster     delorie software   privacy  
  Copyright 2019   by DJ Delorie     Updated Jul 2019