From: "Alexei A. Frounze" Newsgroups: comp.os.msdos.djgpp Subject: THE CONCLUSION Date: Wed, 26 Apr 2000 15:25:44 +0400 Organization: MTU-Intel ISP Lines: 133 Message-ID: <3906D238.888D65F7@mtu-net.ru> References: <38F20E7A DOT 3330E9A4 AT mtu-net DOT ru> NNTP-Posting-Host: ppp96-251.dialup.mtu-net.ru Mime-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 7bit X-Trace: gavrilo.mtu.ru 956748369 55904 212.188.96.251 (26 Apr 2000 11:26:09 GMT) X-Complaints-To: usenet-abuse AT mtu DOT ru NNTP-Posting-Date: 26 Apr 2000 11:26:09 GMT Cc: buers AT gmx DOT de X-Mailer: Mozilla 4.72 [en] (Win95; I) X-Accept-Language: en,ru To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp Reply-To: djgpp AT delorie DOT com Hello guys! I beg your pardon for the delay. I had not inet for a couple of days and I was thinking of the conclusion and the tests we have taken. Well, it's time to tell you what we have now around the problem I came up with some time before. I had a nice 3d engine developed with use of GCC (2.95.2) and some assembly (both inline and external routine). The program compiled and worked properly until I wanted to optimize it using GCC with -O2 switch. GCC started to flame that my inline assembly is faulty or someones of you said buggy. That doesn't make any difference, my code just became unrecognized by GCC and AS. This was the actual problem I came up with. Btw, some time before that I had a look at the GCC output code (GCC has been invoked without any -O switches). And I noticed that inline assembly is needed since GCC generated pretty too much redundant code that should be optimized. Actually I didn't know that GCC really outputs slow code, if there are no command-line switches that makes GCC to optimize the code. Btw, I made one fun mistake. I used SAR (assembly instruction) for arithmetical shift right instead of >>. I used it because I didn't know that C generates different instructions the case when signed number needs to be shifted and the case with unsigned integer. ;) Thus I had a lot of incorrect inline assembly code at the beginning and I didn't know what to do, since my inline code has been done relying on the manual about GCC inline ASM. Seems that article was either incorrect or pretty outdated. As far as I know inline assembly has been a bit changed in newer versions of GCC. I have an old program with a lot of inline ASM that was made in 1997. It doesn't compile with current GCC anymore w/o patching the source code. To make things a bit clear... I've always used the "g" thing for passing parameters to inline assembly blocks. Now AFAIK it's wrong. "g" may be used for eax, ebx, ecx, edx or variable in memory. So if I want to pass some parameters to the block, I must take in acount that I can't use "g", if there is not enough spare registers and I can't use esi and edi registers. Just eax, ebx, ecx and edx plus memory referencies. It was a bit shocking to discover because the code compiled normally before I tried that -O2 switch. So I came up with a message with title "insufficiency of GCC output code and the -O problem". Seems now you know what really happend. Then some of people appearing in the NG said me that my inline assembly code makes all the problems and that is not a bug in the compiler. I still doubt that GCC has a good behaviour here. It must either compile normally my inline assembly w/o depending on the optimization switches or fail with the same error messages again w/o regard of those switches. It's still an open question w/o answer. Then Dieter and some of you suggested me to rewrite my inline ASM with something other than just the "g" stuff. Dieter also was interested if my inline ASM is needed. I.e. what would happen, if use plane C and optimization switch to the GCC. I said that my inline ASM greatly improves the performance here (*greatly* because I compared my inline ASM with plane C source compiled w/o any optimization switches. I also thought that GCC has no efficient optimizer... But that was the past.). He also asked for some numerical results of the comparison. So we started out our bet. :)) Dieter was very lucky (an me too) because I left plane C version of my main functions commented between /* */. Each comment block was followed by inline assembly replacement.If there were no such comment blocks, we wouldn't have something serious to talk about. :) Then Dieter sent me some results of his test and I performed some test on my computer. Plane C version worked faster (in percents) on Dieter's computer while the version with inline ASM worked faster on mine. I think that's due to different CPUs. They work differently so we have different results. I'm not speaking here about actual parts of the code and tricks I used in order to increase total performance of my engine. Some of them are really good (replacement for ceil() and parallel division that works faster for me. Btw, GCC also can generate such tricky code.). Anyway we have different primary results. Then Dieter made implementation of the most inner loop of the texture mapper in plane C with unrolled loop just like in my external ASM implementation of the same inner loop. He also _inline_d that function and replaced the (int) cast to inline analogue. He got the engine running faster than before. After that I improved my code a little (replaced SHR with >> and eliminated some redundant code out of my inner loop). And then I compared my program that has a lot of ASM (both inline and one external subroutine -- inner loop) with Dieter's plane C implementation. I was surprised... Dieter's version ran almost as fast as mine. Just *a bit* slower. Thus we proved that GCC has a very good optimizer. And if you want to make your program faster, it's not really needed to put a lot of assembly code into the source. IMHO that's great!!! :) I'm not sure I should post test tables with values for FPS and details about which parts of the code were C and which were ASM. Posting those tables means that I need to explain what exactly Dieter and I were working on in all the details. So, anyone may learn from this. Btw, recently I changed implementation of my original texture mapping algorithm and won some extra FPS. That means your code performans greatly depends on the actual algorithm implementation, since neither compiler nor optimizer can figure out your algorithm and improve it as well as optimize code. :) About two days ago I generated an .S file out of Dieter's C version of the tmapper. Then I replaced manually Dieter's inner loop with the code from my external ASM implementation of the same inner loop. And it became faster once more. :) So actually, my assembly code is not very bad. :) Well, here must go some kind of conclusing words now. What we have now: - fixed inline assembly - yet another pretty efficient optimizing compiler :) - faster 3d engine - some real experience we all can learn from Thanks to Dieter, everyone else and me for coming up with such a problem. Dieter, what do you think of this conclusion? Wanna correct something, or there is everything alright in the above text? Thanks, Later - Alexei A. Frounze ----------------------------------------- Homepage: http://alexfru.chat.ru Mirror: http://members.xoom.com/alexfru