X-Authentication-Warning: sal.physics.ucsb.edu: dwhysong owned process doing -bs
Date: Thu, 13 May 1999 18:19:43 -0700 (PDT)
From: David Whysong <dwhysong AT physics DOT ucsb DOT edu>
To: Marc Lehmann <pcg AT goof DOT com>
cc: pgcc AT delorie DOT com
Subject: Common subexpression elimination
In-Reply-To: <19990511004113.N22062@cerebro.laendle>
Message-ID: <Pine.LNX.4.10.9905131509050.23462-100000@sal.physics.ucsb.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Reply-To: pgcc AT delorie DOT com
X-Mailing-List: pgcc AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com
Precedence: bulk

On Tue, 11 May 1999, Marc Lehmann wrote:

>Have you benchmarked it? fmuls are not really slower than fadds, for
>example.  However, if you can manage to get a (sensibly sized ;) example
>of where cse fails I'd be happy to look into it!

Did you see my previous example? If it was too large, here is a small one.

Here's a simple test function:

inline double work(double A1, double A2, double A3, double A4)
{ return(A1*A2*A3 + A1*A3*A4 + A2*A3*A4); }

There are 8 arithmetic operations, 6 multiplies and two adds.

Now here is the same function, with manual subexpression elimination:

inline double fast(double A1, double A2, double A3, double A4)
{
        register double tmp=A3*A4;
        return( A1*(A2*A3+tmp) + A2*tmp);
}

There are only 5 arithmetic operations, 3 multiplies and 2 adds.

Take a look at the assembly produced by pgcc-1.1.3:

/local/bin/gcc -O6 -Wall -mcpu=pentiumpro -march=pentiumpro -mpentiumpro
-ffast-math -mno-ieee-fp -fschedule-insns -fstrength-reduce
-fomit-frame-pointer -malign-double -mstack-align-double -funroll-loops
-funroll-all-loops test.c -S -o test.s


.globl work
        .type    work,@function
work:
        subl $4,%esp
        fldl 8(%esp)
        fld %st(0)
        fldl 24(%esp)
        fldl 16(%esp)
        fxch %st(3)
        fmul %st(1),%st
        fxch %st(2)
        fmul %st(3),%st
        fldl 32(%esp)
        fxch %st(4)
        fmul %st(2),%st
        fxch %st(3)
        fmul %st(4),%st
        fxch %st(1)
        fmulp %st,%st(2)
        fxch %st(2)
        fmulp %st,%st(3)
        faddp %st,%st(1)
        faddp %st,%st(1)
        popl %ecx
        ret
.Lfe2:
        .size    work,.Lfe2-work
        .align 16
.globl fast
        .type    fast,@function
fast:
        subl $4,%esp
        fldl 16(%esp)
        fldl 24(%esp)
        fld %st(1)
        fld %st(1)
        fxch %st(1)
        fmulp %st,%st(2)
        fmull 32(%esp)
        fadd %st,%st(1)
        fmulp %st,%st(2)
        fmull 8(%esp)
        faddp %st,%st(1)
        popl %ecx
        ret
.Lfe3:
        .size    fast,.Lfe3-fast
        .ident  "GCC: (GNU) pgcc-2.91.66 19990314 (egcs-1.1.2 release)"


Function "work" has 6 multiplies and 2 adds, while functin "fast" has 4
multiplies and 2 adds (I'm not sure why it has 4 fmul? operations isntead
of 3). Benchmarking shows almost no difference in speed, but the functions
are so small that I doubt I could measure any difference reliably anyway.

As far as I can tell, the compiler does no common subexpression
elimination at all.

Dave

David Whysong                                       dwhysong AT physics DOT ucsb DOT edu
Astrophysics graduate student         University of California, Santa Barbara
My public PGP keys are on my web page - http://www.physics.ucsb.edu/~dwhysong
DSS PGP Key 0x903F5BD6  :  FE78 91FE 4508 106F 7C88  1706 B792 6995 903F 5BD6
D-H PGP key 0x5DAB0F91  :  BC33 0F36 FCCD E72C 441F  663A 72ED 7FB7 5DAB 0F91