Message-Id: <m10p76W-00020nC@chkw386.ch.pwr.wroc.pl>
Date: Wed, 2 Jun 99 09:14 
From: strasbur AT chkw386 DOT ch DOT pwr DOT wroc DOT pl (Krzysztof Strasburger)
To: pgcc AT delorie DOT com
Subject: Pgcc 1.1.3 - bad performance on P6
Reply-To: pgcc AT delorie DOT com

Hi all!
Yesterday I did some manual optimizations of one of my programs
(inlining of two or three very small functions called millions times
from one place only, unrolling of some loops). Before the operation
the program compiled with pgcc 1.1.3 or gcc 2.7.2.3 ran at almost
identical speed. After the optimization the program is faster, but
comparison of pgcc with gcc 2.7.2.3 is surprising. Here is a short summary.
All compilations used following parameters:
-ffast-math -malign-jumps=0 -malign-loops=0 -malign-functions=0
and -mstack-align-double -march=pentiumpro -mcpu=pentiumpro for pgcc.
Calculations have been performed on Pentium Pro 166 MHz. The program
is FPU intensive.
1. gcc 2.7.2.3 -O2 -m486; t=81.97s
2. pgcc 1.1.3 -O6 -fno-runtime-lift-stores; t=89.69s
3. pgcc 1.1.3 -O4; t=86.91s
4. pgcc 1.1.3 -O2; t=87.02s
Let us consider, that gcc 2.7.2.3 gave accidentally optimal code for P6.
The obvious remark is: the code produced by pgcc for P6 is suboptimal,
but why high optimizations kill the performance instead of improving it? 
I tried to disable -fomit-frame-pointer, but -O4 gives the same code as
-O6 -fno-omit-frame-pointer (BTW, -O3 and -O4 gave the same code on P6), so
the optimizations enabled by -O5 and -O6 seem to require -fomit-frame-pointer
to work.
After this bad experience on P6 i tried Pentium 133. Here the situation is
a bit better. Of course -march and -mcpu have been changed to pentium.
1. gcc 2.7.2.3 -O2 -m486; time = 117.38 s
2. pgcc 1.1.3 -O2; t=115.90s
3. pgcc 1.1.3 -O3; t=117.15s
4. pgcc 1.1.3 -O4; t=115.88s
5. pgcc 1.1.3 -O6 -fno-runtime-lift-stores; t=117.20s
The result is almost independent on the optimization level and at least not
worse than for the old good gcc 2.7.2.3.
I attach a simple program, which demonstrates the deterioration of performance
with -O5/6. It should be called with the number of steps( = calls of the 
function). With 1000000 steps on Pentium 133 it gives:
1. gcc 2.7.2.3 -O2 t=4.56s (3.11s on PPro)
2. pgcc 1.1.3 -O6 t=4.88s (3.60s on PPro)
2. pgcc 1.1.3 -O5 t=4.89s (not tested on PPro)
3. pgcc 1.1.3 -O4 t=4.15s (2.81s on Ppro) 
Krzysztof

#include<stdio.h>
#include <math.h>
/* This probably looks a bit strange for the C programmer, but it is
   modified f2c translated FORTRAN code. */
double gausil(alfa1, c1, alfa2, c2, alfa, c)
double *alfa1, *c1, *alfa2, *c2, *alfa, *c;
{
    double ret_val;
    double x, aa1, aa2;

    *alfa = *alfa1 + *alfa2;
    if (*alfa == 0.) {
	goto L10;
    }
    aa1 = *alfa1 / *alfa;
    aa2 = *alfa2 / *alfa;
    c[0] = aa1 * c1[0] + aa2 * c2[0];
    c[1] = aa1 * c1[1] + aa2 * c2[1];
    c[2] = aa1 * c1[2] + aa2 * c2[2];
    x = c1[0] - c2[0];
    ret_val = x * x;
    x = c1[1] - c2[1];
    ret_val += x * x;
    x = c1[2] - c2[2];
    ret_val = exp(-aa1 * *alfa2 * (ret_val + x * x));
    return ret_val;
L10:
    c[0] = 0.;
    c[1] = 0.;
    c[2] = 0.;
    ret_val = 1.;
    return ret_val;
}

int main(int argc, char**argv) {
double c1[3],c2[3],c3[3],alfa1,alfa2,alfa3,sum;
int i,j,n;
n = atoi(argv[1]);
printf("steps = %d\n",n);
sum=0.;
for (i=0;i<n;i++) {
/* some stupid substitutions */
alfa1=1./(double)(i+1);
alfa2=alfa1+1.;
for (j=0;j<3;j++){
c1[j]=1./(double)(i+j+1);
c2[j]=c1[j]+1.;
}
sum = sum + gausil(&alfa1,c1,&alfa2,c2,&alfa3,c3);
}
printf("sum=%f\n",sum);
return 0;
}