X-pop3-spooler: POP3MAIL 2.1.0 b 4 980420 -bs-
Message-Id: <m0ygr45-000203C@chkw386.ch.pwr.wroc.pl>
Date: Tue, 2 Jun 98 13:25 
From: strasbur AT chkw386 DOT ch DOT pwr DOT wroc DOT pl (Krzysztof Strasburger)
To: beastium-list AT Desk DOT nl
Subject: Executable sizes and performance
Sender: Marc Lehmann <pcg AT goof DOT com>
Status: RO
X-Status: A
Content-Length: 4073
Lines: 82


Here is the comparison of executable sizes and execution times of the old
version of GAMESS (General Atomic and Molecular Electronic Structure System)
on Pentium 166 MMX (Linux 2.0.34-pre12, libc 5.4.38 etc). I tried different
compilers from the gcc/pgcc family and different options. I am focused on
compilation options giving rather small executables, so -O4 and higher
have not been tested.
The program is very FPU intensive, written in FORTRAN. Everything has been
translated to C using "f2c -a". All executables are linked with the same
libgcc.a (from gcc 2.7.2.3). Only user times are given in the table below.

What has been tested?

1. Gcc version 2.7.2.3.
   -m386 -O2 -malign-jumps=0 -malign-loops=0 -malign-functions=0
   -malign-double -ffast-math
2. Old pgcc (gcc 2.7.2 based).
   -mpentium -O3 -malign-jumps=0 -malign-loops=0 -malign-functions=0
   -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer
3. Pgcc 1.0.2 without haifa.
   -mpentium -O3 -malign-jumps=0 -malign-loops=0 -malign-functions=0
   -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer
   -fno-exceptions
4. Pgcc 1.0.2 with haifa.
   -mpentium -O3 -malign-jumps=0 -malign-loops=0 -malign-functions=0
   -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer
   -fno-exceptions
5. Gcc version 2.7.2.3 (data aligning disabled).
   -m386 -O2 -malign-jumps=0 -malign-loops=0 -malign-functions=0 -ffast-math
6. Pgcc 1.0.2 without haifa (code aligning enabled).
   -mpentium -O3 -malign-jumps=2 -malign-loops=2 -malign-functions=2
   -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer
   -fno-exceptions
7. Pgcc 1.0.2 without haifa (strength-reduce disabled).
   -mpentium -O3 -malign-jumps=0 -malign-loops=0 -malign-functions=0
   -malign-double -ffast-math -fno-inline-functions -fno-omit-frame-pointer
   -fno-exceptions -fno-strength-reduce

Variant		   1	   2	   3	   4	   5	   6	   7

Executable
   size		2425144 2437764 2510988 2521100 2394536 2529196	2476332
 (bytes)

Execution
times (s) and the fastest variant
 test 1 (7)	  66.04	  72.10	  66.96	  67.49	  86.20	  66.95	  65.85
 test 2 (4)	 481.31	 453.52	 450.19	 447.64	 574.45	 458.02	 455.63
 test 3 (7)	  34.53	  34.17	  31.23	  31.62	  45.05	  31.33	  30.90
 test 4 (6)	  37.26	  38.27   36.30	  36.77	  51.02	  36.11	  37.54
 test 5 (3)	 445.65	 445.30	 402.96	 408.32	 555.43	 404.08	 417.13
 test 6 (3)	  24.27	  24.48	  22.83	  23.20	  29.60	  22.85	  22.93
 test 7 (4)	 312.41	 323.19	 294.18	 294.15	 406.39	 296.46	 302.80
 sum 1-7	1401.47	1391.03	1304.65	1309.19	1748.14	1315.80	1332.78
		  (6)	  (5)	  (1)	  (2)	  (7)	  (3)	  (4)

Some general conclusions (valid only for FPU intensive, f2c translated
code, of course):
1. Old gcc (2.7.2.3) gives smaller executables even with double aligning of
   "double precision" variables.
2. This data aligning is critical for the efficiency of the code (as pointed
   out in the pgcc FAQ). Look at column 5. On the other hand, programs
   which are not CPU intensive could be compiled with these options plus
   -fno-strength-reduce (gcc 2.7.2.3 of course). You will get small
   executables - smaller than with egcs/pgcc.
3. It is not true, that old pgcc (2.7.2 based) gives faster code than the
   new one. In some cases it can be even slower than the code produced by
   gcc 2.7.2.3 with -malign-double.
4. The haifa scheduler does not give more efficient code (columns 3 and 4)
5. Code aligning is meaningless. Compare columns 3 and 6. Maybe the Big Theory
   of Programming says "it is the most important thing for performance".
   I got code bloat only.
6. The -fstrength-reduce thing gives code bloat, but the performance
   is slightly better (column 3 without and 7 with -fno-strength-reduce).
7. There is no single optimal set of compilation options for all code.
   There are execution paths, where other variants run faster than 3 (which
   is best in general).

I will made similar comparison for a program, which doesn't use the FPU.
Bzip2 seems to be the good candidate.

Krzysztof