X-Authentication-Warning: sal.physics.ucsb.edu: dwhysong owned process doing -bs Date: Mon, 10 May 1999 21:45:45 -0700 (PDT) From: David Whysong To: Marc Lehmann cc: pgcc AT delorie DOT com Subject: Re: Optimization question In-Reply-To: <19990511004113.N22062@cerebro.laendle> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Reply-To: pgcc AT delorie DOT com X-Mailing-List: pgcc AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk On Tue, 11 May 1999, Marc Lehmann wrote: >Have you benchmarked it? fmuls are not really slower than fadds, for >example. However, if you can manage to get a (sensibly sized ;) example >of where cse fails I'd be happy to look into it! I'm not sure what constitutes "sensibly sized" but here goes. :) Here is the original example function: fourvector Asq={u[0]*u[0],u[1]*u[1],u[2]*u[2],u[3]*u[3]}; fourvector Acu={Asq[0]*u[0],Asq[1]*u[1],Asq[2]*u[2],Asq[3]*u[3]}; *Q[0][0]=(((Asq[0] + Asq[1] + Asq[2] + Asq[3])* (Acu[0]*(Pv1*v01 + Px1*x01) - Acu[1]*(Pv2*v01 + Px2*x01) - A3*(Asq[2]*Pv3*v01 - Asq[3]*Pv3*v01 + 2*A3*A4*Pv3*v02 + Asq[2]*Px3*x01 - Asq[3]*Px3*x01 + 2*A3*A4*Px3*x02) - Asq[1]*(A3*Pv3*v01 - 2*A4*Pv2*v03 + A3*Px3*x01 - 2*A4*Px2*x03) + A2*(-(Asq[2]*Pv2*v01) + Asq[3]*Pv2*v01 - 2*A3*A4*Pv2*v02 + 2*A3*A4*Pv3*v03 - Asq[2]*Px2*x01 + Asq[3]*Px2*x01 - 2*A3*A4*Px2*x02 + 2*A3*A4*Px3*x03) + Asq[0]*(A2*(Pv2*v01 + 2*Pv1*v02 + Px2*x01 + 2*Px1*x02) + A3*(Pv3*v01 + 2*Pv1*v03 + Px3*x01 + 2*Px1*x03)) + A1*(-(Asq[2]*Pv1*v01) + Asq[3]*Pv1*v01 - 2*A3*A4*Pv1*v02 + 2*Asq[2]*Pv3*v03 - Asq[2]*Px1*x01 + Asq[3]*Px1*x01 - 2*A3*A4*Px1*x02 - Asq[1]*(Pv1*v01 - 2*Pv2*v02 + Px1*x01 - 2*Px2*x02) + 2*Asq[2]*Px3*x03 + 2*A2*(A3*Pv3*v02 + A4*Pv1*v03 + A3*Pv2*v03 + A3*Px3*x02 + A4*Px1*x03 + A3*Px2*x03))))/2); Doing common subexpression elimination by hand, compiling to assembly, and using "grep f??? csetest.s | wc" gives the following instruction counts: Instruction Original Hand optimized ------------------------------------------------------ fmul/fmuls/fmulp 94 72 faddl 40 42 fsubp/fsubrp 18 18 fdiv 0 0 fxch 85 52 movl 22 22 fld/flds 80 87 fstps 32 39 Benchmarking shows a small but significant difference: [dwhysong AT eleanor q]$ ./csetest Time for 10 million calls: 8.490000 seconds [dwhysong AT eleanor q]$ ./nocsetest Time for 10 million calls: 9.670000 seconds The C source and assembly files are available if anyone is interested. Hand optimizing the full 3300 lines of code has so far resulted in a factor of two speed increase, and I'm not finished yet. I suspect the problem is a lack of registers, assuming (from what you said before) that the compiler will only do CSE if there is a spare register. I would have thought the compiler would use temporary variables on the stack if no registers are available, but then again I don't write compilers. Dave David Whysong dwhysong AT physics DOT ucsb DOT edu Astrophysics graduate student University of California, Santa Barbara My public PGP keys are on my web page - http://www.physics.ucsb.edu/~dwhysong DSS PGP Key 0x903F5BD6 : FE78 91FE 4508 106F 7C88 1706 B792 6995 903F 5BD6 D-H PGP key 0x5DAB0F91 : BC33 0F36 FCCD E72C 441F 663A 72ED 7FB7 5DAB 0F91