X-pop3-spooler: POP3MAIL 2.1.0 b 4 980420 -bs- Message-Id: Date: Tue, 2 Jun 98 13:25 From: strasbur AT chkw386 DOT ch DOT pwr DOT wroc DOT pl (Krzysztof Strasburger) To: beastium-list AT Desk DOT nl Subject: Executable sizes and performance Sender: Marc Lehmann Status: RO X-Status: A Content-Length: 4073 Lines: 82 Here is the comparison of executable sizes and execution times of the old version of GAMESS (General Atomic and Molecular Electronic Structure System) on Pentium 166 MMX (Linux 2.0.34-pre12, libc 5.4.38 etc). I tried different compilers from the gcc/pgcc family and different options. I am focused on compilation options giving rather small executables, so -O4 and higher have not been tested. The program is very FPU intensive, written in FORTRAN. Everything has been translated to C using "f2c -a". All executables are linked with the same libgcc.a (from gcc 2.7.2.3). Only user times are given in the table below. What has been tested? 1. Gcc version 2.7.2.3. -m386 -O2 -malign-jumps=0 -malign-loops=0 -malign-functions=0 -malign-double -ffast-math 2. Old pgcc (gcc 2.7.2 based). -mpentium -O3 -malign-jumps=0 -malign-loops=0 -malign-functions=0 -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer 3. Pgcc 1.0.2 without haifa. -mpentium -O3 -malign-jumps=0 -malign-loops=0 -malign-functions=0 -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer -fno-exceptions 4. Pgcc 1.0.2 with haifa. -mpentium -O3 -malign-jumps=0 -malign-loops=0 -malign-functions=0 -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer -fno-exceptions 5. Gcc version 2.7.2.3 (data aligning disabled). -m386 -O2 -malign-jumps=0 -malign-loops=0 -malign-functions=0 -ffast-math 6. Pgcc 1.0.2 without haifa (code aligning enabled). -mpentium -O3 -malign-jumps=2 -malign-loops=2 -malign-functions=2 -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer -fno-exceptions 7. Pgcc 1.0.2 without haifa (strength-reduce disabled). -mpentium -O3 -malign-jumps=0 -malign-loops=0 -malign-functions=0 -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer -fno-exceptions -fno-strength-reduce Variant 1 2 3 4 5 6 7 Executable size 2425144 2437764 2510988 2521100 2394536 2529196 2476332 (bytes) Execution times (s) and the fastest variant test 1 (7) 66.04 72.10 66.96 67.49 86.20 66.95 65.85 test 2 (4) 481.31 453.52 450.19 447.64 574.45 458.02 455.63 test 3 (7) 34.53 34.17 31.23 31.62 45.05 31.33 30.90 test 4 (6) 37.26 38.27 36.30 36.77 51.02 36.11 37.54 test 5 (3) 445.65 445.30 402.96 408.32 555.43 404.08 417.13 test 6 (3) 24.27 24.48 22.83 23.20 29.60 22.85 22.93 test 7 (4) 312.41 323.19 294.18 294.15 406.39 296.46 302.80 sum 1-7 1401.47 1391.03 1304.65 1309.19 1748.14 1315.80 1332.78 (6) (5) (1) (2) (7) (3) (4) Some general conclusions (valid only for FPU intensive, f2c translated code, of course): 1. Old gcc (2.7.2.3) gives smaller executables even with double aligning of "double precision" variables. 2. This data aligning is critical for the efficiency of the code (as pointed out in the pgcc FAQ). Look at column 5. On the other hand, programs which are not CPU intensive could be compiled with these options plus -fno-strength-reduce (gcc 2.7.2.3 of course). You will get small executables - smaller than with egcs/pgcc. 3. It is not true, that old pgcc (2.7.2 based) gives faster code than the new one. In some cases it can be even slower than the code produced by gcc 2.7.2.3 with -malign-double. 4. The haifa scheduler does not give more efficient code (columns 3 and 4) 5. Code aligning is meaningless. Compare columns 3 and 6. Maybe the Big Theory of Programming says "it is the most important thing for performance". I got code bloat only. 6. The -fstrength-reduce thing gives code bloat, but the performance is slightly better (column 3 without and 7 with -fno-strength-reduce). 7. There is no single optimal set of compilation options for all code. There are execution paths, where other variants run faster than 3 (which is best in general). I will made similar comparison for a program, which doesn't use the FPU. Bzip2 seems to be the good candidate. Krzysztof