[an error occurred while processing this directive]
Node:Older is faster,
Next:Pentium,
Previous:How fast,
Up:Performance
Q: I switched to v2 and my programs now run slower than when
compiled with v1.x....
Q: I timed a test program and it seems that GCC 2.8.1 produces
slower executables than GCC 2.7.2.1 was, which in turn was slower than
DJGPP v1.x. Why are we giving up so much speed as we get newer
versions?
Q: I installed Binutils 2.8.1, and my programs are now much slower
than when they are linked with Binutils 2.7!
A: In general, newer versions of GCC generate tighter, faster code,
than older versions. Comparison between different versions of GCC shows
that they all optimize reasonably well, but it takes a different
combination of the optimization-related options to achieve the greatest
speed in each compiler version. The default optimization options can
also change; for example, --force-mem
is switched on by
-O2
in 2.7.2.1; it wasn't before. GCC offers a plethora of
optimization options which might make your code faster or slower (see
the GCC docs for a complete list); the best way to find the correct
combination for a given program is to profile and experiment. Here are
some tips:
-O2 -mpentium
-fomit-frame-pointer -ffast-math
. (For PGCC and GCC version 2.95 and
later, use -O6
instead of -O2
.)
-S
(see getting assembly listing),
and examine the machine code.
-fforce-addr
option. This option
helps a lot if a couple of pointers are used heavily within a single
loop. If there are a lot of memory references, try adding
-fno-force-mem
, to prevent GCC from repeatedly copying variables
from memory into registers.
-fomit-frame-pointer
might make things worse, since it
uses stack-relative addresses which have longer encoding and could
therefore overflow the CPU cache. So try with and without this switch.
-mpreferred-stack-boundary=2
compiler option. This causes
the compiler to relax its stack-alignment requirements that need a lot
of sub esp,xx
instructions. The default stack alignment is 16
bytes, unless overridden by -mpreferred-stack-boundary
. The
argument to this option is the power of 2 used for alignment, so 2 means
4-byte alignment; if your code uses double
and long double
variables, an argument of 3 might be a better choice.
-malign-loops
), jumps
(-malign-jumps
), and function entry points
(-malign-functions
). Alignment changes can have especially
profound effects when programs are run on AMD's K6 CPU, since these CPUs
suffer significant slowdown for code aligned on 4-byte boundaries.
-funroll-loops
and -funroll-all-loops
and
profile the effect.
-fno-strength-reduce
. In some cases where GCC
is in dire need of registers, this could be a substantial win, since
strength reduction typically results in using additional registers to
replace multiplication with addition.
I'm told that the PGCC version of GCC has bugs in its optimizer which
show when you use level 7 or higher. Until that is solved in some
future version, you are advised to stick to -O6
. Some
programs actually run faster when compiled with -O2
or
-O3
, even when compiled with PGCC, so you might try that as
well. Several users reported that PGCC v2.95.1 tends to crash a lot
during compilation, especially with -O5
, -O6
and
-mpentium
options. (In general, PGCC version 2.95 is deemed
buggy; you are advised not to use it.)
Programs which manipulate multi-dimensional arrays inside their innermost loops can sometimes gain speed by switching from dynamically allocated arrays to static ones. This can speed up code because the size of a static array is known to GCC at compile time, which allows it to avoid dedicating a CPU register to computing offsets. This register is then available for general-purpose use.
Another problem that is related to C++ programs which manipulate
arrays happens when you fail to qualify the methods used for array
manipulation as inline
. Each method or function that wasn't
declared inline
will not be inlined by GCC, and will incur
an overhead of a function call at run time.
However, inlining only helps with small functions/methods; large inlined functions will overflow the CPU cache and typically slow down the code instead of speeding it up.
If your CPU is AMD's K6, try upgrading to GCC 2.96 or later and use the
-mcpu=k6
switch. I'm told that K6-specific optimizations are
much better in these versions of GCC.
A bug in the startup code distributed with DJGPP versions before v2.02 can also be a reason for slow-down. The problem is that the runtime stack of DJGPP programs was not guaranteed to be properly aligned. This usually only shows up on Windows (since CWSDPMI aligns the stack on its own), and even then only sometimes. But it has been reported that switching to Binutils 2.8.1 sometimes causes such slow-down, and switching to PGCC can reveal this problem as well. In some cases, restarting Windows would cause programs run at normal speed again. If you experience such problems too much, upgrade to v2.02.