X-Authentication-Warning: sal.physics.ucsb.edu: dwhysong owned process doing -bs
Date: Mon, 10 May 1999 21:45:45 -0700 (PDT)
From: David Whysong <dwhysong AT physics DOT ucsb DOT edu>
To: Marc Lehmann <pcg AT goof DOT com>
cc: pgcc AT delorie DOT com
Subject: Re: Optimization question
In-Reply-To: <19990511004113.N22062@cerebro.laendle>
Message-ID: <Pine.LNX.4.04.9905101557370.15643-100000@sal.physics.ucsb.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Reply-To: pgcc AT delorie DOT com
X-Mailing-List: pgcc AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com
Precedence: bulk

On Tue, 11 May 1999, Marc Lehmann wrote:

>Have you benchmarked it? fmuls are not really slower than fadds, for
>example.  However, if you can manage to get a (sensibly sized ;) example
>of where cse fails I'd be happy to look into it!

I'm not sure what constitutes "sensibly sized" but here goes. :)

Here is the original example function:

fourvector Asq={u[0]*u[0],u[1]*u[1],u[2]*u[2],u[3]*u[3]};
fourvector Acu={Asq[0]*u[0],Asq[1]*u[1],Asq[2]*u[2],Asq[3]*u[3]};

*Q[0][0]=(((Asq[0] + Asq[1] + Asq[2] + Asq[3])* (Acu[0]*(Pv1*v01 +
Px1*x01) - Acu[1]*(Pv2*v01 + Px2*x01) - A3*(Asq[2]*Pv3*v01 -
Asq[3]*Pv3*v01 + 2*A3*A4*Pv3*v02 + Asq[2]*Px3*x01 - Asq[3]*Px3*x01 +
2*A3*A4*Px3*x02) - Asq[1]*(A3*Pv3*v01 - 2*A4*Pv2*v03 + A3*Px3*x01 -
2*A4*Px2*x03) + A2*(-(Asq[2]*Pv2*v01) + Asq[3]*Pv2*v01 - 2*A3*A4*Pv2*v02 +
2*A3*A4*Pv3*v03 - Asq[2]*Px2*x01 + Asq[3]*Px2*x01 - 2*A3*A4*Px2*x02 +
2*A3*A4*Px3*x03) + Asq[0]*(A2*(Pv2*v01 + 2*Pv1*v02 + Px2*x01 + 2*Px1*x02)
+ A3*(Pv3*v01 + 2*Pv1*v03 + Px3*x01 + 2*Px1*x03)) + A1*(-(Asq[2]*Pv1*v01)
+ Asq[3]*Pv1*v01 - 2*A3*A4*Pv1*v02 + 2*Asq[2]*Pv3*v03 - Asq[2]*Px1*x01 +
Asq[3]*Px1*x01 - 2*A3*A4*Px1*x02 - Asq[1]*(Pv1*v01 - 2*Pv2*v02 + Px1*x01 -
2*Px2*x02) + 2*Asq[2]*Px3*x03 + 2*A2*(A3*Pv3*v02 + A4*Pv1*v03 + A3*Pv2*v03
+ A3*Px3*x02 + A4*Px1*x03 + A3*Px2*x03))))/2);

Doing common subexpression elimination by hand, compiling to assembly, and
using "grep f??? csetest.s | wc" gives the following instruction counts:

Instruction             Original        Hand optimized
------------------------------------------------------
fmul/fmuls/fmulp        94              72
faddl                   40              42
fsubp/fsubrp            18              18
fdiv                    0               0
fxch                    85              52
movl                    22              22
fld/flds                80              87
fstps                   32              39

Benchmarking shows a small but significant difference:

[dwhysong AT eleanor q]$ ./csetest
Time for 10 million calls: 8.490000 seconds
[dwhysong AT eleanor q]$ ./nocsetest
Time for 10 million calls: 9.670000 seconds

The C source and assembly files are available if anyone is interested.

Hand optimizing the full 3300 lines of code has so far resulted in a
factor of two speed increase, and I'm not finished yet.

I suspect the problem is a lack of registers, assuming (from what you said
before) that the compiler will only do CSE if there is a spare register. I
would have thought the compiler would use temporary variables on the stack
if no registers are available, but then again I don't write compilers.

Dave

David Whysong                                       dwhysong AT physics DOT ucsb DOT edu
Astrophysics graduate student         University of California, Santa Barbara
My public PGP keys are on my web page - http://www.physics.ucsb.edu/~dwhysong
DSS PGP Key 0x903F5BD6  :  FE78 91FE 4508 106F 7C88  1706 B792 6995 903F 5BD6
D-H PGP key 0x5DAB0F91  :  BC33 0F36 FCCD E72C 441F  663A 72ED 7FB7 5DAB 0F91