Mail Archives: pgcc/1999/05/11/00:48:09
On Tue, 11 May 1999, Marc Lehmann wrote:
>Have you benchmarked it? fmuls are not really slower than fadds, for
>example. However, if you can manage to get a (sensibly sized ;) example
>of where cse fails I'd be happy to look into it!
I'm not sure what constitutes "sensibly sized" but here goes. :)
Here is the original example function:
fourvector Asq={u[0]*u[0],u[1]*u[1],u[2]*u[2],u[3]*u[3]};
fourvector Acu={Asq[0]*u[0],Asq[1]*u[1],Asq[2]*u[2],Asq[3]*u[3]};
*Q[0][0]=(((Asq[0] + Asq[1] + Asq[2] + Asq[3])* (Acu[0]*(Pv1*v01 +
Px1*x01) - Acu[1]*(Pv2*v01 + Px2*x01) - A3*(Asq[2]*Pv3*v01 -
Asq[3]*Pv3*v01 + 2*A3*A4*Pv3*v02 + Asq[2]*Px3*x01 - Asq[3]*Px3*x01 +
2*A3*A4*Px3*x02) - Asq[1]*(A3*Pv3*v01 - 2*A4*Pv2*v03 + A3*Px3*x01 -
2*A4*Px2*x03) + A2*(-(Asq[2]*Pv2*v01) + Asq[3]*Pv2*v01 - 2*A3*A4*Pv2*v02 +
2*A3*A4*Pv3*v03 - Asq[2]*Px2*x01 + Asq[3]*Px2*x01 - 2*A3*A4*Px2*x02 +
2*A3*A4*Px3*x03) + Asq[0]*(A2*(Pv2*v01 + 2*Pv1*v02 + Px2*x01 + 2*Px1*x02)
+ A3*(Pv3*v01 + 2*Pv1*v03 + Px3*x01 + 2*Px1*x03)) + A1*(-(Asq[2]*Pv1*v01)
+ Asq[3]*Pv1*v01 - 2*A3*A4*Pv1*v02 + 2*Asq[2]*Pv3*v03 - Asq[2]*Px1*x01 +
Asq[3]*Px1*x01 - 2*A3*A4*Px1*x02 - Asq[1]*(Pv1*v01 - 2*Pv2*v02 + Px1*x01 -
2*Px2*x02) + 2*Asq[2]*Px3*x03 + 2*A2*(A3*Pv3*v02 + A4*Pv1*v03 + A3*Pv2*v03
+ A3*Px3*x02 + A4*Px1*x03 + A3*Px2*x03))))/2);
Doing common subexpression elimination by hand, compiling to assembly, and
using "grep f??? csetest.s | wc" gives the following instruction counts:
Instruction Original Hand optimized
------------------------------------------------------
fmul/fmuls/fmulp 94 72
faddl 40 42
fsubp/fsubrp 18 18
fdiv 0 0
fxch 85 52
movl 22 22
fld/flds 80 87
fstps 32 39
Benchmarking shows a small but significant difference:
[dwhysong AT eleanor q]$ ./csetest
Time for 10 million calls: 8.490000 seconds
[dwhysong AT eleanor q]$ ./nocsetest
Time for 10 million calls: 9.670000 seconds
The C source and assembly files are available if anyone is interested.
Hand optimizing the full 3300 lines of code has so far resulted in a
factor of two speed increase, and I'm not finished yet.
I suspect the problem is a lack of registers, assuming (from what you said
before) that the compiler will only do CSE if there is a spare register. I
would have thought the compiler would use temporary variables on the stack
if no registers are available, but then again I don't write compilers.
Dave
David Whysong dwhysong AT physics DOT ucsb DOT edu
Astrophysics graduate student University of California, Santa Barbara
My public PGP keys are on my web page - http://www.physics.ucsb.edu/~dwhysong
DSS PGP Key 0x903F5BD6 : FE78 91FE 4508 106F 7C88 1706 B792 6995 903F 5BD6
D-H PGP key 0x5DAB0F91 : BC33 0F36 FCCD E72C 441F 663A 72ED 7FB7 5DAB 0F91
- Raw text -