Date: Tue, 18 Jan 2000 10:20:46 +0200 (IST) From: Eli Zaretskii X-Sender: eliz AT is To: Dieter Buerssner cc: djgpp AT delorie DOT com Subject: Re: gcc optimization (Was: Executable size: limit to acceptability?) In-Reply-To: <85vmml$23rse$1@fu-berlin.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by delorie.com id DAA00447 Reply-To: djgpp AT delorie DOT com Errors-To: dj-admin AT delorie DOT com X-Mailing-List: djgpp AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk On 17 Jan 2000, Dieter Buerssner wrote: > Btw. I always use -O. For my programs this always almost produces > the fastest code. And I tried this with various hardware and > with FPU intensive code as well as with code that doesn´t > need the FPU. I started with djgpp on a 386SX with 16 MHz. > I thought this may be due to my ancient hardware. I upgraded to > an 486 with 66 MHz and used the -m486. Often the -m486 would result > in slower code and -O2 almost always produced slower code then -O. > Recently I upgraded to AMD K6-2 266 MHz (running at 333 MHz). > The same thing. Almost all my programs run fastest when compiled > with -O (and -fomit-frame-pointer -ffast-math). > > Also I upgraded gcc. From (I believe) 1.39 upto 2.9.2 now. > The fastest code seems to be produced by 2.7.3. Even when I > compile with -march=k6 or -march=586 with 2.9.2, it won´t produce > faster code then 2.7.3 in the examples I tested [1]. > > So, am I stupid or has anybody got similar experiences? Some quantitative data about the relative speed of -O and -O2 would probably get our feet on the ground when discussing this. Without the numbers, we are just waving hands here. Having said that, here's what I know about this (see also section 14.2 in the FAQ): -O and -O2 usually produce very similar code, with slight advantage for -O2. But, for any specific code, it's quite possible that the combination of optimization options defined by -O is a larger win than the combination defined by -O2. In particular, your code might be overflowing the CPU cache under -O2, but not under -O. Most of the strange effects like what you describe are due to alignment problems. The causes for these problems are distributed between the compiler, Binutils, and the library in a complex way that changes depending on the versions you are using. Short summary: * The library wasn't aligning assembly functions and labels until v2.03. Library functions written in C are not aligned optimally due to problems with GCC versions before 2.9x (the v2.03 library was compiled with GCC 2.8.1). * GCC and Binutils were configured inconsistently as far as the meaning of the .align directive is considered. This caused code and data be misaligned. * GCC 2.9x finally gets the alignment right (and also produces the right .align directives that avoid cache misses on a Pentium). * Binutils 2.8.1 and even 2.9.1 (for which there's no official port yet) still align subsections on 4-byte boundaries, which can easily cause significant run-time penalties in code that branches and calls functions a lot. The next version of Binutils will correct that. * All versions of GCC before 2.9x were misaligning the stack, in perticular if the program used double float data type. So, to get rid of the alignment problems at this time, you need (in the order mentioned): - build Binutils with a patch that bumps up the subsection alignment; - rebuild libc.a with GCC 2.95.2 and the patched Binutils; - recompile and relink your program with GCC 2.95.2 and the patched Binutils. I suspect that step 2 above will require to change the compiler switches used for the library build (some of them define alignment).