Sender: reimer AT schwan DOT e-technik DOT tu-ilmenau DOT de Message-ID: <36FC6B66.D36CD9CE@e-technik.tu-ilmenau.de> Date: Sat, 27 Mar 1999 06:23:50 +0100 From: Wolfgang Reimer Organization: Technical Univ. of Ilmenau X-Mailer: Mozilla 4.5 [en] (X11; I; Linux 2.1.125 i686) X-Accept-Language: en MIME-Version: 1.0 To: pgcc AT delorie DOT com Subject: [Fwd: Re: Aligning stack variables [8-byte operands]] Content-Type: multipart/mixed; boundary="------------FABEDD86F58C5281C28DCE19" Reply-To: pgcc AT delorie DOT com This is a multi-part message in MIME format. --------------FABEDD86F58C5281C28DCE19 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit FYI -------- Original Message -------- Subject: Re: Aligning stack variables [8-byte operands] Date: Sat, 27 Mar 1999 06:19:22 +0100 From: Wolfgang Reimer Organization: Technical Univ. of Ilmenau To: John Wehle CC: holzloehner AT umbc DOT edu,Bernd Schmidt ,VP Developers ,fftw AT theory DOT lcs DOT mit DOT edu References: <199903270013 DOT TAA00443 AT jwlab DOT FEITH DOT COM> John Wehle wrote: > > > I have tried compiling the FFTW package (a package from MIT for fast > > Fourier transforms) with a custom option provided by the software > > authors, that allowed to also align stack variables on 8-byte boundaries > > -- the speed improvement was dramatic. I have been wondering for a long > > time why it is so hard to align stack variables, when an option such as > > -malign-double already can align globals? > > Actually egcs always aligns constants, globals, statics for the x86 (assuming > the object file format supports it) so it isn't necessary to use -malign-double > for this purpose. -malign-double changes the alignment rules for doubles > which can improve the performance though it breaks binary compatibility. > > The problem of aligning stack variables on 64 bit (or 128 bit) boundary is > that there is no clear approach which is a guaranteed win. The alignment > can be accomplished by: > > 1) Always keeping the stack aligned. > > a) Either by always preallocating the stack frame (including outgoing > argument space) in multiples of 64 bits (or 128 bits). This means > that call arguments aren't pushed onto the stack, instead they are > moved to the stack. > > Negatives: Move is bigger opcode than push. > > Pluses: Don't require adjustment at each call site. > > b) Or by adjusting the stack at each call site. > > Negatives: More instructions. > > Pluses: Can use push. > > 2) Use a register to align the stack when necessary. > > Negatives: Burns a register. > > Pluses: Less instructions. > > Keep in mind that integer code doesn't benefit from more alignment and > that it's desirable to avoid impacting integer performance. As with > many things it is a question of tradeoffs. > [ lines deleted ] > > -- John Hi John, I created a small test program which impressively shows how important the 8-byte stack alignment of 8-byte operands is to the Intel Pentium floating point performance. The speed up of aligned visa unaligned code is not only some ten percent (as it is usual in case of other Pentium specific optimizations). On my dual PentiumPro/200 Linux box, properly aligned code runs about 150% faster (speed ratio is about 2.5 !!!) than misaligned code (I mean alignment of doubles on stack). And this is not only true in the case of my special designed test program but also with the FFTW (http://theory.lcs.mit.edu/~fftw) code which is our most essential application when doing split-step simulations of optical fibers. So for me (and other number crunching guys) the alignment of 8-byte operands (doubles) on stack would improve Intel P6 performance much more than any other sophisticated Pentium optimization strategy. That's why it should be the "Number ONE" on the TODO list of the egcs compiler development group, IMHO. With pgcc-2.91.60 19981201 (egcs-1.1.1 release) and previous versions there is a compiler flag "-mstack-align-double" but unfortunately it does not work properly. Sometime all doubles aligned properly and sometimes all doubles are misaligned. I attached my small test program, a tiny makefile, and the log file of the output from the test run on my computer. The program can be built and run by "make test". Additionally, I built a statically linked binary (gzipped about 45k) which should run under all types of ELF Linux systems. If somebody is interested, just let me know and I will send it. Best regards, -- Wolfgang Reimer (Dr.-Ing.) T U I -- Technical University of Ilmenau, GERMANY, Thuringia Address: TU Ilmenau, FEI/IKM, PF 100565, 98684 Ilmenau, GERMANY http://ikmcip1.e-technik.tu-ilmenau.de Phone: +49-3677-69-2619 mailto:reimer AT e-technik DOT tu-ilmenau DOT de Fax : +49-3677-69-1195 --------------FABEDD86F58C5281C28DCE19 Content-Type: text/plain; charset=us-ascii; name="stackalign.c" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="stackalign.c" /* This small test program shows how extremely strong the run time with * Intel P5 and P6 depends on correct 8-Byte alignment of 8-Byte operands * (doubles) on stack. On my P6 the speed ratio (misaligned/aligned) * is up to 2.5! * * When compiling with -O2 or higher you must compile this code with * -fno-inline-functions (or otherwise the compiler will expand * the function test_loop() inline) and * -ffloat-store (or otherwise the compiler will use registers only * instead of accessing the stack). * * Author: Wolfgang M. Reimer, mailto:reimer AT e-technik DOT tu-ilmenau DOT de * Idea of stack alignment manipulation and checking is * stolen from FFTW code (http://theory.lcs.mit.edu/~fftw) * Date : 99/03/27 */ #include #include #include #define LOOPS 10000000L double test_loop(int *aligned) { double a=1, b=2, c=2, d=1, e=0.5, f=0.5, g=1, h=2, i=2, j=1, k=0.5; int z; /* check double alignment */ *aligned = ((((long) &k) & 0x7) == 0); for (z = 0; z < LOOPS; z++) { a *= k; b *= a; c *= b; d *= c; e *= d; f *= e; g *= f; h *= g; i *= h; j *= i; k *= j; } return k; } double empty_loop(int *aligned) { double a=1, b=2, c=2, d=1, e=0.5, f=0.5, g=1, h=2, i=2, j=1, k=0.5; int z; /* check double alignment */ *aligned = ((((long) &k) & 0x7) == 0); for (z = 0; z < LOOPS; z++) { /* empty loop */ } return k; } double time_diff(struct timeval t1, struct timeval t2) { struct timeval diff; diff.tv_sec = t1.tv_sec - t2.tv_sec; diff.tv_usec = t1.tv_usec - t2.tv_usec; /* normalize */ while (diff.tv_usec < 0) { diff.tv_usec += 1000000L; diff.tv_sec -= 1; } return diff.tv_usec * 1e-6 + diff.tv_sec; } #define GET_TIME(timex,alignedx) \ printf("Running ... "); fflush(stdout); \ \ /* time the test loop */ \ gettimeofday(&t1, 0); \ d = test_loop(&(alignedx)); \ gettimeofday(&t2, 0); \ time = time_diff(t2, t1); \ \ /* time the empty loop */ \ gettimeofday(&t1, 0); \ d = empty_loop(&a); \ gettimeofday(&t2, 0); \ (timex) = time - time_diff(t2, t1); \ printf("with %s aligned doubles, run time was %g seconds.\n", \ alignment[(alignedx)], (timex)); \ if ((alignedx) != a) \ printf("Alignment between test and empty loop differs!\n"); int main(void) { struct timeval t1, t2; double d, time, time1, time2, ratio; int aligned1, aligned2, a; char alignment[2][7] = {{" oddly"}, {"evenly"}}; /* hack to align stack oddly */ if (!(((long) (__builtin_alloca(0))) & 0x7)) __builtin_alloca(4); GET_TIME(time1, aligned1); /* hack to align stack evenly */ if (((long) (__builtin_alloca(0))) & 0x7) __builtin_alloca(4); GET_TIME(time2, aligned2); if (aligned1) ratio = time2 / time1; else ratio = time1 / time2; printf("The speed ratio (odd/even) is %g!\n", ratio); return 0; } --------------FABEDD86F58C5281C28DCE19 Content-Type: text/plain; charset=us-ascii; name="makefile" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="makefile" # Author: Wolfgang M. Reimer, mailto:reimer AT e-technik DOT tu-ilmenau DOT de # Date : 99/03/27 # Run "make test" to build and run the alignment test #OPTIMIZER = -O6 -mcpu=pentiumpro -malign-double -fomit-frame-pointer OPTIMIZER = -O0 -mcpu=pentiumpro -malign-double -fomit-frame-pointer TARGET = stackalign CC = gcc CFLAGS = -Wall -Wno-unused $(OPTIMIZER) -fno-inline-functions -ffloat-store all: $(TARGET) test: $(TARGET) @echo @echo "********************* System info **************************" uname -a @echo @echo "********************** $(CC) Version ***************************" $(CC) -v @echo @echo "***************** Running $(TARGET) ************************" ./$(TARGET) clean: $(RM) -rf $(TARGET).s $(TARGET).o distclean: clean $(RM) -rf $(TARGET) --------------FABEDD86F58C5281C28DCE19 Content-Type: text/plain; charset=us-ascii; name="stackalign.log" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="stackalign.log" [reimer AT schwan stackalign]$ make test gcc -Wall -Wno-unused -O0 -mcpu=pentiumpro -malign-double -fomit-frame-pointer -fno-inline-functions -ffloat-store stackalign.c -o stackalign ********************* System info ************************** uname -a Linux schwan.e-technik.tu-ilmenau.de 2.1.125 #1 SMP Fri Nov 6 20:46:08 CET 1998 i686 unknown ********************** gcc Version *************************** gcc -v Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/egcs-2.91.66/specs gcc version egcs-2.91.66 19990314 (egcs-1.1.2 release) ***************** Running stackalign ************************ ./stackalign Running ... with oddly aligned doubles, run time was 14.2632 seconds. Running ... with evenly aligned doubles, run time was 5.35753 seconds. The speed ratio (odd/even) is 2.66227! --------------FABEDD86F58C5281C28DCE19--