delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/2000/04/19/15:31:58

From: "Alexei A. Frounze" <alex DOT fru AT mtu-net DOT ru>
Newsgroups: comp.os.msdos.djgpp
Subject: Re: inefficiency of GCC output code & -O problem
Date: Wed, 19 Apr 2000 19:52:00 +0400
Organization: MTU-Intel ISP
Lines: 182
Message-ID: <38FDD620.89ADB579@mtu-net.ru>
References: <Pine DOT LNX DOT 4 DOT 10 DOT 10004180455310 DOT 1540-100000 AT darkstar DOT grendel DOT net> <38FBB719 DOT 3915C530 AT mtu-net DOT ru> <8dgvat DOT 3vvqu6v DOT 0 AT buerssner-17104 DOT user DOT cis DOT dfn DOT de> <38FC0F43 DOT 87E209B3 AT mtu-net DOT ru> <8dib4a DOT 3vvqvqr DOT 0 AT buerssner-17104 DOT user DOT cis DOT dfn DOT de>
NNTP-Posting-Host: ppp103-64.dialup.mtu-net.ru
Mime-Version: 1.0
X-Trace: gavrilo.mtu.ru 956163940 29012 212.188.103.64 (19 Apr 2000 17:05:40 GMT)
X-Complaints-To: usenet-abuse AT mtu DOT ru
NNTP-Posting-Date: 19 Apr 2000 17:05:40 GMT
Cc: buers AT gmx DOT de
X-Mailer: Mozilla 4.72 [en] (Win95; I)
X-Accept-Language: en,ru
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp
Reply-To: djgpp AT delorie DOT com

Dieter Buerssner wrote:
> 
> [This thread should be dead by now, but I really cannot leave some things
> uncorrected]

You're welcome. I have nothing against it, really.

> [In the same reply Alexei has written]
> \begin{quote}
>         You've forgot (in fact, Dieter haven't mentioned) about the
>         FIDIVRL instruction executed in parallel to the span() function.
>         This is a real trick that makes difference. Even Dieter didn't
>         change it and left this piece of my inline ASM AS-IS.
> \end{quote}
> 
> I did change this. And I mentioned everything. I especially
> mentioned, that for one test, I changed part of the inline assembly
> to C code. (I did this at places, where it seemed to me, that
> the inline assembly would not have much inpact to the performance.)
> I also mentioned, that for the other test, I got rid of all your inline
> assembly (and adding one new line of inline assembly). So, the
> quotes are just plain wrong.

You said in the mail:
------------8<------------
If you 

#define USEC 0
#define USEC2 0

you'll get your original code. With

#define USEC 1
#define USEC2 0

you'll get the code, that is actually faster
with my computer, than 
your original. I have to compile it with -O
and not -O2, to get it 
faster. All the other modules, I have let
compiled with -O2.

With

#define USEC 1
#define USEC2 1

you'll get the version with one line of inline
assembly. Again I used 
-O only. If you like, you can tell me your
results.
------------8<------------

I'm not talking about USEC=USEC2=0. Let's talk about others.

USEC=1,USEC2=0 that is faster for you is as fast as mine here. And USEC2=0
doesn't replace paired FIDIVRL as well as some inline ASM blocks.

USEC=1,USEC2=1 is slower here.

Btw, wait a minute...

I have a suggestions. Plz, tell me FPS for all your USEC* switches for the
following situation. Move the player forward until it bounced into a wall made
out of the wood. Here 3d engine renders only one polygon and thus FPS is
maximum. Tell me what you get in this situation with all the range of #define'd
switches =0,=1,=whatever. It's very important.


> >It's not wrong, since I don't get your results with (USEC=USEC2=1 and -O
> >switch). I get it *slower*. And I have no idea what's up.
> 
> Don't you see, that the these sentences tell something totally
> different, than the quotes. I never stated that you will be
> able to reproduce my numbers. "It's not wrong, since ..."
> doesn't make any sense.

Of cource I didn't mention that I must get the same values. I just don't see
your code faster here than before. :)


> Alexei, reread the thread. I think, I has always tried to write
> exactly what I have done. Your statements make me look like a lier.

I didn't want to call you a lier at all. I just mentioned that the case
#define USEC 1
#define USEC2 0
that works faster for you is still with inline ASM and has FIDIVRL executed in
parallel to span(). Nothing about to call you a lier. Really.

> They are often out of context. I have reported the numbers exactly
> like I have told you in my post about this stupid bet. Without
> any of your inline assembly, I got exactly the same performance
> here. I have no doubt, that you might measure something different.

What? Minimum values of FPS are read from the 1st frame (i.e. w/o player stands
all the time at the origional point). The maximum value is when player is moved
forward to the wall made of the wood where FPS is maximum due to number of
visible polygons (only 1 poly - wood wall).

> I don't call you a lier. 

Me too. 


> It really doesn't surprise me, that the
> results are highly machine dependant. But from looking at the
> asm output (I use fsdb after compiling with -g, it shows nicely
> C source and asm together, but there exist other means), it seems to me,
> that there shouldn't be a big difference at all for T_Map() with
> and without inline assembly (besides the rounding to int, which
> I coded by one inline function). I explained, that you use the
> FPU stack efficiently. Some of this advantage, you lost by all
> those references. Count the FPU instructions in the .s output,
> and you will see, that the C version will need as many
> fmul/fdiv etc. instructions. It will need quite a few fxch instructions,
> that you don't need. It will need to discard the top of the floating
> point stack a few times, where you don't need it. These things can be
> very CPU dependant. The C code will avoid many adress calculations,
> to make up for it.

Btw, floating point instructions may overlap with FPU instructions as well as
executed in parallel with CPU ones. So there are two ways of optimizing: 
1. making code as short as possible and eliminating redundant reloads plus reuse
previous results.
2. making instructions overlapping and executing in parallel.
There is no *serious* difference for both ways, though (at least on my
computer). 

> 
> Also, if you think that pairing of the fidivr with span is really
> important, 

Yes it's important on my computer. If I use plane C instead of inline ASM with
fidivrl, FPS drops down.

> you *might* be able to get it with the C code as well.

I doubt. GCC makes code so that everything is removed from FPU stack before new
function call. Otherwise work of that function would be unstable. So I doubt
that GCC can permit such a thing. In this case the result of division is written
to the IIZ variable before calling the span(). The division takes about 33...39
cycles, so we have this delay between actual FIDIV and FSTP and before the
span(). If we move FSTP after the span(), we don't have a delay, since we don't
request the result of the division as soon as possible. This is the main trick.
I don't know why this doesn't work for you while works for me. Parhaps you
should try your C replacement for this inline assembly for the case with only
one visible polygon.

> I delayed that part of the C code till after the span, because
> it was just a very little bit faster here. The C code is still
> there in comments. Gcc will not use fidivr, it will use
> fdivr instead. Obviously gcc decided, to trade an inverse
> division by an integer (compile time constant), with an
> inverse divisision by a floating-point constant.

Try for the case when only *one wood wall* is visible.

> You might have optimzed your code exactly for your processor.

I didn't try to optimize for my processor. I just made it as fast as I could
without respect of Cyrix 6x86..., ... . Btw, I'll try your suggestions on
486dx2-66. I bet it wouldn't be faster at all but much slower. :)

> The numbers I have written are true, they are for the first
> screen of your program. I have not bothered, to find any MIN/MAX,
> but playing around a little bit, I can essentially see no difference
> between the C code and the inline assembly.

You should bother. Because I may invent more efficient algorithm (don't mess
with actual implementation) and thus having higher FPS rate for the most simple
situation is important. I mentioned at least two times above about this case.
Just put the player to the position, where only 1 polgon is visible and notice
FPS for all your "#define" switches. Tell me what you have for this situation.

bye.
Alexei A. Frounze
-----------------------------------------
Homepage: http://alexfru.chat.ru
Mirror:   http://members.xoom.com/alexfru


- Raw text -


  webmaster     delorie software   privacy  
  Copyright 2019   by DJ Delorie     Updated Jul 2019