From: jesse AT lenny DOT dseg DOT ti DOT com (Jesse Bennett) Newsgroups: comp.os.msdos.djgpp Subject: Re: Netlib code [was Re: flops...] Date: 2 Mar 1997 22:31:49 GMT Organization: Texas Instruments Lines: 178 Message-ID: <5fcv4l$ag6$1@superb.csc.ti.com> References: Reply-To: jbennett AT ti DOT com (Jesse Bennett) NNTP-Posting-Host: lenny.dseg.ti.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp [Posted and mailed] This is getting somewhat off-topic in an already off-topic thread. I must also admit that I am not currently using DJ's gcc port (I do most of my code development using Linux w/gcc) so this is probably not even appropriate for this newsgroup. OTOH, the issues discussed here are relevant to those using DJGPP since they are gcc/g77 specific. Respondents may want to consider an email reply if they do not consider the subject matter applicable to the newsgroup. So, to continue the discussion concerning numerical algorithm performance with gcc/g77 ... In article , Dave Love writes: >>>>>> "Jesse" == Jesse W Bennett writes: > > Jesse> void gemm( int m, int n, int k, double **a, double **b, double **c ) > Jesse> { > > Jesse> /* C = AB + C */ > > Jesse> int i, j, l; > Jesse> double temp; > > Jesse> for( i=0; i Jesse> for( l=0; l Jesse> { > Jesse> temp = a[i][l]; > Jesse> for( j=0; j Jesse> c[i][j] += temp * b[l][j]; > Jesse> } > Jesse> } > > Jesse> compiled with gcc -O2 -S gemm.c > > Jesse> The generated assembly for the inner loop is: > > Jesse> L13: > Jesse> movl (%edi),%edx > Jesse> movl (%esi),%eax > Jesse> fld %st(0) > Jesse> fmull (%eax,%ecx,8) > Jesse> faddl (%edx,%ecx,8) > Jesse> fstpl (%edx,%ecx,8) > Jesse> incl %ecx > Jesse> cmpl %ecx,12(%ebp) > Jesse> jg L13 > > Jesse> It is not clear to me why the edx and eax registers are being reloaded > Jesse> each iteration. > > I can't show DJGPP G77 o/p at present, but assume the generated code > would be the same as this. (On 586 and especially on ppro, the speed > will actually be determined by how your double words happen to get > aligned, sigh.) I had not given this much thought, but this is potentially a serious issue from a performance standpoint. Are there any workarounds for the 64 bit alignment problem on 32 bit machines? I can think of ways to insure the proper alignment using C/malloc but I would be interested in what, if anything, is being considered by the gcc/g77 developers. > $ cat a.f > subroutine gemm(m, n, k, a, b, c) > integer i,m,n,k,l,j > double precision a(n,m), b(n,m), c(n,m) > do i=1,m ! poor for illustration only > do l=1,k > do j=1,n > c(j,i) = c(j,i) + a(l,i)*b(j,l) > end do > end do > end do > end > $ g77 -S -O2 -v a.f > g77 version 0.5.19.1 > gcc -S -O2 -v -xf77 a.f > Reading specs from /usr/lib/gcc-lib/i486-unknown-linux/2.7.2.1.f.1/specs > gcc version 2.7.2.1.f.1 > /usr/lib/gcc-lib/i486-unknown-linux/2.7.2.1.f.1/f771 a.f -fset-g77-defaults -qu > iet -dumpbase a.f -O2 -version -fversion -o a.s > GNU F77 version 2.7.2.1.f.1 (i386 Linux/ELF) compiled by GNU C version 2.7.2.1.f > .1. > GNU Fortran Front End version 0.5.19.1 compiled: Feb 1 1997 19:51:03 > $ more +/L13 a.s > > ...skipping > addl 24(%ebp),%eax > .align 4 > .L13: > movl -24(%ebp),%edi > fldl (%edi) > fmull (%eax) > faddl (%edx) > fstpl (%edx) > addl $8,%eax > addl $8,%edx > decl %ecx > jns .L13 > .L8: The above assembler looks strikingly like what I get with the C pointer-based implementation. It is also a good example of two deficiencies I see consistently with the GNU compilers: 1. Unnecessary memory references. The movl -24(%ebp),%edi opcode should have been moved out of the inner loop since %edi is not modified inside the loop. 2. The pointer aliasing feature/problem (depending on your perspective). In the C language it is a "feature" so the programmer must create temporary variables to avoid the "problem". In FORTRAN it should not be a problem, but I suppose the use of the gcc backend makes this obvious optimization difficult. I understand that this optimization will be included in the next G77 release (0.5.20). Does that mean GCC users will have a -noalias optimization switch? I get a much better implementation with the following modification. radteam:~$ cat a.f subroutine gemm(m, n, k, a, b, c) integer i,m,n,k,l,j double precision a(n,m), b(n,m), c(n,m), temp do i=1,m ! poor for illustration only do l=1,k temp = a(l,i) do j=1,n c(j,i) = c(j,i) + temp*b(j,l) end do end do end do end radteam:~$ g77 -S -O2 -v a.f gcc -S -O2 -v -xf77 a.f Reading specs from /usr/lib/gcc-lib/i486-linux/2.7.2/specs gcc version 2.7.2.f.1 /usr/lib/gcc-lib/i486-linux/2.7.2/f771 a.f -fset-g77-defaults -quiet -dumpbase a.f -O2 -version -fversion -o a.s GNU F77 version 2.7.2.f.1 (i386 Linux/ELF) compiled by GNU C version 2.7.2.f.1. GNU Fortran Front End version 0.5.18 compiled: Apr 8 1996 14:27:32 radteam:~$ more +/L13 a.s ...skipping addl 24(%ebp),%eax .align 4 .L13: fld %st(0) fmull (%eax) faddl (%edx) fstpl (%edx) addl $8,%eax addl $8,%edx decl %ecx jns .L13 .L17: Note that the innermost loop now has only 3 memory references instead of the 5 in the original implementation. It is also virtually identical to the code generated by my pointer-based C implementation (the only difference is the decrement instr. is replaced by a compare). Since the performance of software on modern CPU's is dominated by the number (and pattern) of memory references it is worthwhile to take whatever steps are necessary to minimize and localize these references. BTW, I also found your comment in the source code (poor for illustration only) interesting. Short of "hacks" such as loop unrolling and more sophisticated methods for localizing memory references, e.g. blocking (tiling), what sort of improvements would you suggest? Best Regards, Jesse