Date: Tue, 18 Jan 2000 10:20:46 +0200 (IST)
From: Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>
X-Sender: eliz AT is
To: Dieter Buerssner <buers AT gmx DOT de>
cc: djgpp AT delorie DOT com
Subject: Re: gcc optimization (Was: Executable size: limit to acceptability?)
In-Reply-To: <85vmml$23rse$1@fu-berlin.de>
Message-ID: <Pine.SUN.3.91.1000118102026.3041R-100000@is>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by delorie.com id DAA00447
Reply-To: djgpp AT delorie DOT com
Errors-To: dj-admin AT delorie DOT com
X-Mailing-List: djgpp AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com
Precedence: bulk


On 17 Jan 2000, Dieter Buerssner wrote:

> Btw. I always use -O. For my programs this always almost produces
> the fastest code. And I tried this with various hardware and 
> with FPU intensive code as well as with code that doesn´t 
> need the FPU. I started with djgpp on a 386SX with 16 MHz.
> I thought this may be due to my ancient hardware. I upgraded to
> an 486 with 66 MHz and used the -m486. Often the -m486 would result
> in slower code and -O2 almost always produced slower code then -O.
> Recently I upgraded to AMD K6-2 266 MHz (running at 333 MHz).
> The same thing. Almost all my programs run fastest when compiled
> with -O (and -fomit-frame-pointer -ffast-math).
> 
> Also I upgraded gcc. From (I believe) 1.39 upto 2.9.2 now.
> The fastest code seems to be produced by 2.7.3. Even when I
> compile with -march=k6 or -march=586 with 2.9.2, it won´t produce 
> faster code then 2.7.3 in the examples I tested [1].
> 
> So, am I stupid or has anybody got similar experiences?

Some quantitative data about the relative speed of -O and -O2 would
probably get our feet on the ground when discussing this.  Without the
numbers, we are just waving hands here.

Having said that, here's what I know about this (see also section 14.2
in the FAQ):

-O and -O2 usually produce very similar code, with slight advantage
for -O2.  But, for any specific code, it's quite possible that the
combination of optimization options defined by -O is a larger win than
the combination defined by -O2.  In particular, your code might be
overflowing the CPU cache under -O2, but not under -O.

Most of the strange effects like what you describe are due to
alignment problems.  The causes for these problems are distributed
between the compiler, Binutils, and the library in a complex way that
changes depending on the versions you are using.  Short summary:

  * The library wasn't aligning assembly functions and labels until
    v2.03.  Library functions written in C are not aligned optimally
    due to problems with GCC versions before 2.9x (the v2.03 library
    was compiled with GCC 2.8.1).

  * GCC and Binutils were configured inconsistently as far as the
    meaning of the .align directive is considered.  This caused code
    and data be misaligned.

  * GCC 2.9x finally gets the alignment right (and also produces the
    right .align directives that avoid cache misses on a Pentium).

  * Binutils 2.8.1 and even 2.9.1 (for which there's no official port
    yet) still align subsections on 4-byte boundaries, which can
    easily cause significant run-time penalties in code that branches
    and calls functions a lot.  The next version of Binutils will
    correct that.

  * All versions of GCC before 2.9x were misaligning the stack, in
    perticular if the program used double float data type.

So, to get rid of the alignment problems at this time, you need (in
the order mentioned):

  - build Binutils with a patch that bumps up the subsection
    alignment;
  - rebuild libc.a with GCC 2.95.2 and the patched Binutils;
  - recompile and relink your program with GCC 2.95.2 and the patched
    Binutils.

I suspect that step 2 above will require to change the compiler
switches used for the library build (some of them define alignment).