delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/2000/04/26/07:13:18

From: "Alexei A. Frounze" <alex DOT fru AT mtu-net DOT ru>
Newsgroups: comp.os.msdos.djgpp
Subject: THE CONCLUSION
Date: Wed, 26 Apr 2000 15:25:44 +0400
Organization: MTU-Intel ISP
Lines: 133
Message-ID: <3906D238.888D65F7@mtu-net.ru>
References: <38F20E7A DOT 3330E9A4 AT mtu-net DOT ru>
NNTP-Posting-Host: ppp96-251.dialup.mtu-net.ru
Mime-Version: 1.0
X-Trace: gavrilo.mtu.ru 956748369 55904 212.188.96.251 (26 Apr 2000 11:26:09 GMT)
X-Complaints-To: usenet-abuse AT mtu DOT ru
NNTP-Posting-Date: 26 Apr 2000 11:26:09 GMT
Cc: buers AT gmx DOT de
X-Mailer: Mozilla 4.72 [en] (Win95; I)
X-Accept-Language: en,ru
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp
Reply-To: djgpp AT delorie DOT com

Hello guys!

I beg your pardon for the delay. I had not inet for a couple of days and I was
thinking of the conclusion and the tests we have taken.

Well, it's time to tell you what we have now around the problem I came up with
some time before.

I had a nice 3d engine developed with use of GCC (2.95.2) and some assembly
(both inline and external routine). The program compiled and worked properly
until I wanted to optimize it using GCC with -O2 switch.

GCC started to flame that my inline assembly is faulty or someones of you said
buggy. That doesn't make any difference, my code just became unrecognized by GCC
and AS. This was the actual problem I came up with. 
Btw, some time before that I had a look at the GCC output code (GCC has been
invoked without any -O switches). And I noticed that inline assembly is needed
since GCC generated pretty too much redundant code that should be optimized.
Actually I didn't know that GCC really outputs slow code, if there are no
command-line switches that makes GCC to optimize the code.

Btw, I made one fun mistake. I used SAR (assembly instruction) for arithmetical
shift right instead of >>. I used it because I didn't know that C generates
different instructions the case when signed number needs to be shifted and the
case with unsigned integer. ;)

Thus I had a lot of incorrect inline assembly code at the beginning and I didn't
know what to do, since my inline code has been done relying on the manual about
GCC inline ASM. Seems that article was either incorrect or pretty outdated. As
far as I know inline assembly has been a bit changed in newer versions of GCC. I
have an old program with a lot of inline ASM that was made in 1997. It doesn't
compile with current GCC anymore w/o patching the source code.

To make things a bit clear... I've always used the "g" thing for passing
parameters to inline assembly blocks. Now AFAIK it's wrong. "g" may be used for
eax, ebx, ecx, edx or variable in memory. So if I want to pass some parameters
to the block, I must take in acount that I can't use "g", if there is not enough
spare registers and I can't use esi and edi registers. Just eax, ebx, ecx and
edx plus memory referencies.

It was a bit shocking to discover because the code compiled normally before I
tried that -O2 switch. So I came up with a message with title "insufficiency of
GCC output code and the -O problem".

Seems now you know what really happend.

Then some of people appearing in the NG said me that my inline assembly code
makes all the problems and that is not a bug in the compiler. I still doubt that
GCC has a good behaviour here. It must either compile normally my inline
assembly w/o depending on the optimization switches or fail with the same error
messages again w/o regard of those switches. It's still an open question w/o
answer.

Then Dieter and some of you suggested me to rewrite my inline ASM with something
other than just the "g" stuff.

Dieter also was interested if my inline ASM is needed. I.e. what would happen,
if use plane C and optimization switch to the GCC.
I said that my inline ASM greatly improves the performance here (*greatly*
because I compared my inline ASM with plane C source compiled w/o any
optimization switches. I also thought that GCC has no efficient optimizer... But
that was the past.). He also asked for some numerical results of the comparison.
So we started out our bet. :))

Dieter was very lucky (an me too) because I left plane C version of my main
functions commented between /* */. Each comment block was followed by inline
assembly replacement.If there were no such comment blocks, we wouldn't have
something serious to talk about. :)

Then Dieter sent me some results of his test and I performed some test on my
computer. Plane C version worked faster (in percents) on Dieter's computer while
the version with inline ASM worked faster on mine. I think that's due to
different CPUs. They work differently so we have different results.

I'm not speaking here about actual parts of the code and tricks I used in order
to increase total performance of my engine. Some of them are really good
(replacement for ceil() and parallel division that works faster for me. Btw, GCC
also can generate such tricky code.). Anyway we have different primary results. 

Then Dieter made implementation of the most inner loop of the texture mapper in
plane C with unrolled loop just like in my external ASM implementation of the
same inner loop. He also _inline_d that function and replaced the (int) cast to
inline analogue. He got the engine running faster than before. 

After that I improved my code a little (replaced SHR with >> and eliminated some
redundant code out of my inner loop). And then I compared my program that has a
lot of ASM (both inline and one external subroutine -- inner loop) with Dieter's
plane C implementation. I was surprised... Dieter's version ran almost as fast
as mine. Just *a bit* slower.

Thus we proved that GCC has a very good optimizer. And if you want to make your
program faster, it's not really needed to put a lot of assembly code into the
source. IMHO that's great!!! :)

I'm not sure I should post test tables with values for FPS and details about
which parts of the code were C and which were ASM. Posting those tables means
that I need to explain what exactly Dieter and I were working on in all the
details.

So, anyone may learn from this. 

Btw, recently I changed implementation of my original texture mapping algorithm
and won some extra FPS. That means your code performans greatly depends on the
actual algorithm implementation, since neither compiler nor optimizer can figure
out your algorithm and improve it as well as optimize code. :)

About two days ago I generated an .S file out of Dieter's C version of the
tmapper. Then I replaced manually Dieter's inner loop with the code from my
external ASM implementation of the same inner loop. And it became faster once
more. :)

So actually, my assembly code is not very bad. :)

Well, here must go some kind of conclusing words now.

What we have now:
- fixed inline assembly
- yet another pretty efficient optimizing compiler :)
- faster 3d engine
- some real experience we all can learn from

Thanks to Dieter, everyone else and me for coming up with such a problem.

Dieter, what do you think of this conclusion? Wanna correct something, or there
is everything alright in the above text?

Thanks,
Later
- Alexei A. Frounze
-----------------------------------------
Homepage: http://alexfru.chat.ru
Mirror:   http://members.xoom.com/alexfru

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019