Message-Id: Date: Wed, 2 Jun 99 09:14 From: strasbur AT chkw386 DOT ch DOT pwr DOT wroc DOT pl (Krzysztof Strasburger) To: pgcc AT delorie DOT com Subject: Pgcc 1.1.3 - bad performance on P6 Reply-To: pgcc AT delorie DOT com Hi all! Yesterday I did some manual optimizations of one of my programs (inlining of two or three very small functions called millions times from one place only, unrolling of some loops). Before the operation the program compiled with pgcc 1.1.3 or gcc 2.7.2.3 ran at almost identical speed. After the optimization the program is faster, but comparison of pgcc with gcc 2.7.2.3 is surprising. Here is a short summary. All compilations used following parameters: -ffast-math -malign-jumps=0 -malign-loops=0 -malign-functions=0 and -mstack-align-double -march=pentiumpro -mcpu=pentiumpro for pgcc. Calculations have been performed on Pentium Pro 166 MHz. The program is FPU intensive. 1. gcc 2.7.2.3 -O2 -m486; t=81.97s 2. pgcc 1.1.3 -O6 -fno-runtime-lift-stores; t=89.69s 3. pgcc 1.1.3 -O4; t=86.91s 4. pgcc 1.1.3 -O2; t=87.02s Let us consider, that gcc 2.7.2.3 gave accidentally optimal code for P6. The obvious remark is: the code produced by pgcc for P6 is suboptimal, but why high optimizations kill the performance instead of improving it? I tried to disable -fomit-frame-pointer, but -O4 gives the same code as -O6 -fno-omit-frame-pointer (BTW, -O3 and -O4 gave the same code on P6), so the optimizations enabled by -O5 and -O6 seem to require -fomit-frame-pointer to work. After this bad experience on P6 i tried Pentium 133. Here the situation is a bit better. Of course -march and -mcpu have been changed to pentium. 1. gcc 2.7.2.3 -O2 -m486; time = 117.38 s 2. pgcc 1.1.3 -O2; t=115.90s 3. pgcc 1.1.3 -O3; t=117.15s 4. pgcc 1.1.3 -O4; t=115.88s 5. pgcc 1.1.3 -O6 -fno-runtime-lift-stores; t=117.20s The result is almost independent on the optimization level and at least not worse than for the old good gcc 2.7.2.3. I attach a simple program, which demonstrates the deterioration of performance with -O5/6. It should be called with the number of steps( = calls of the function). With 1000000 steps on Pentium 133 it gives: 1. gcc 2.7.2.3 -O2 t=4.56s (3.11s on PPro) 2. pgcc 1.1.3 -O6 t=4.88s (3.60s on PPro) 2. pgcc 1.1.3 -O5 t=4.89s (not tested on PPro) 3. pgcc 1.1.3 -O4 t=4.15s (2.81s on Ppro) Krzysztof #include #include /* This probably looks a bit strange for the C programmer, but it is modified f2c translated FORTRAN code. */ double gausil(alfa1, c1, alfa2, c2, alfa, c) double *alfa1, *c1, *alfa2, *c2, *alfa, *c; { double ret_val; double x, aa1, aa2; *alfa = *alfa1 + *alfa2; if (*alfa == 0.) { goto L10; } aa1 = *alfa1 / *alfa; aa2 = *alfa2 / *alfa; c[0] = aa1 * c1[0] + aa2 * c2[0]; c[1] = aa1 * c1[1] + aa2 * c2[1]; c[2] = aa1 * c1[2] + aa2 * c2[2]; x = c1[0] - c2[0]; ret_val = x * x; x = c1[1] - c2[1]; ret_val += x * x; x = c1[2] - c2[2]; ret_val = exp(-aa1 * *alfa2 * (ret_val + x * x)); return ret_val; L10: c[0] = 0.; c[1] = 0.; c[2] = 0.; ret_val = 1.; return ret_val; } int main(int argc, char**argv) { double c1[3],c2[3],c3[3],alfa1,alfa2,alfa3,sum; int i,j,n; n = atoi(argv[1]); printf("steps = %d\n",n); sum=0.; for (i=0;i