Mail Archives: djgpp/2000/04/17/14:43:00

From: buers AT gmx DOT de (Dieter Buerssner)
Newsgroups: comp.os.msdos.djgpp
Subject: Re: inefficiency of GCC output code & -O problem
Date: 17 Apr 2000 19:17:53 GMT
Lines: 147
Message-ID: <>
References: <Pine DOT LNX DOT 4 DOT 10 DOT 10004161837540 DOT 1138-100000 AT darkstar DOT grendel DOT net> <38F9D717 DOT 9438A3F6 AT mtu-net DOT ru> <8df84a DOT 3vvqu6v DOT 0 AT buerssner-17104 DOT user DOT cis DOT dfn DOT de> <38FB4094 DOT DE7B5F4C AT mtu-net DOT ru>
NNTP-Posting-Host: (
Mime-Version: 1.0
X-Trace: 955999073 8147136 (16 [17104])
X-Posting-Agent: Hamster/
User-Agent: Xnews/03.02.04
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp
Reply-To: djgpp AT delorie DOT com

Alexei A. Frounze wrote:

>Dieter Buerssner wrote:
>> Alexei A. Frounze wrote:

>> >Not really. The inner loop in my tmapper can not be written in pure C.
>> >Belive me.
>> This is not true.
>Okay, interpolate U and V over a group of pixels, and don't forget &0xFF
>to be sure U and V don't exceed the 0...255 range (the span() function
>does this). I doubt your C code will be as fast as my ASM. Tell me what
>you've got when you're done.

I refered to the T_Map() function, you posted to this group. This
can clearly be (quite efficiently) written in C. I didn't look at 

[Most of stupid bet deleted. Seriously, I wouldn't have hold the bet, 
because I already knew the results.]

>> I get rid of all your inline assembly in T_Map. I will be allowed
>> to add one single line (say less than 50 characters from __asm__
>> upto the closing ')' ) of inline assembly to your source. I bet,
>> the plain C code will perform about the same, as your inline
>> code. I win, when my code is no more than 2 FPS slower, or faster, than
>> your code (The executable you sent reports 70 FPS here).
>How many are there such lines in your oppinion? :)

I don't understand this question.

To elaborate, and make this on-topic again. Some of the code Alexei
posted uses just "normal" floating point math. He coded almost all
of this inline. I replaced this by the equivalent C-Code, that 
mostly was already there. Some minor modifications where something
  /* a=d/c; b=e/c; */ /* This was already there in comments */
replaced by

#if USEC
  f=1.0/c; a=d*f; b=e*f;
  __asm__ /* ... */

The same optimizatition, Alexei has made in his inline assembler.
After this I recompiled, and the speed went up from 70 FPS to 72 FPS.
This, I think, proves, that gcc is capable to produce quite efficient
floating point code. Of course, Alexei's code would have won, if he
had replaced

  __asm__ volatile("fldl (%0)\n ...\nfstpl (%0)" : : "r" (&dbl));

(Alexei, you got rid of the "g", but I think, here "memory"
is needed in the clobber-list. I'm not totally certain, though.)

  __asm__ volatile("fldl %0\n ...\nfstpl %0" : "=m" (dbl) : "0" (dbl));

This would give gcc more chances to optimize. It uses
less registers, and also needs less instructions. I have not tried
this, but even then I think, the C code would not produce much
less efficient code, than the inline assembly.

Where gcc produces considerably less efficient code, is when you have

  int i;
  double a, b;

  i = (int)(a*b);

Here, gcc always needs to save and restore the FPU control word, and
there are a few occurences of this type in Alexei's code. (I don't
blame gcc here, I think it is almost impossible to do much better 
for a compiler.)

I replaced the above code with

/* can be #ifdefed and replaced by

#define to_int(x) ((int)(x)) 

for non gcc and i386, to make it even portable. Comments for other
or more efficient methods to do double -> int conversions are wellcome. */

__inline__ static int to_int(double x)
  int r;
  __asm__ volatile ("fistl %0" : "=m" (r) : "t" (x)); /* "t" is for st0 */
  return r;

#if USEC2
  i = to_int(a*b);
  __asm__ /* ... */

This is essentially, what the inline code of Alexei does. (I have
not bothered to look up, whether the fistl instruction rounds,
or chops, so this may be not the same as the C-code.)

While the to_int function is not optimal (gcc will have to code
one superflous fstp instruction, compared to fistpl), it is still
quite a bit more efficient than C code. With these modifications,
I got rid of all the other inline assembly. I got 70 FPS, the
same as the original (either the self compiled sources, or the
executable Alexei sent to me).

Alexei's code will "cache" some values on the FPU stack, which
gcc is not able to see (with the switches I used). Nevertheless,
even here, with the help of only one line of inline assembly,
it produces comparable results. Again, it would loose, when all
those references and adress-off operations would be omitted.
It should be clear, that the compiler won't reach the efficiency
of hand optimzed assembler code. Whether the relative small 
difference here is worth all the trouble, ...

One last comment, on the T_Map function. The C-code version actually
got quite a bit slower (5 FPS, IIRC), when compiled with -O2 or -O3,
compared to -O only. The assembler version, not surprisingly, was
not effected.

There was one bug in the other part of the sources, that may be of 
general interest.

[All the context omitted (Alexei, it's in your linev)]
  int c; /* only low byte used */
  __asm__ volatile("movb %0, al" : : "g" (c));

This actually compiled with -O2, but got an error with -O by gas.
It should be clear why - when gcc decides that c will live in memory
or in a/b/c/dx, it will work, when it is in (say) esi, it won't.
So, this is a nice example, why "but it work's", doesn't buy you
too much.

Alexei, I have made some fun. I hope I have made up for it, by this
post, that took actually longer to write, than the coding.
I will send you the modified source by email. The post hopefully
was of general interest.

Regards, Dieter

- Raw text -

  webmaster     delorie software   privacy  
  Copyright 2019   by DJ Delorie     Updated Jul 2019