From: jesse AT lenny DOT dseg DOT ti DOT com (Jesse Bennett)
Newsgroups: comp.os.msdos.djgpp
Subject: Re: Netlib code [was Re: flops...]
Date: 2 Mar 1997 22:31:49 GMT
Organization: Texas Instruments
Lines: 178
Message-ID: <5fcv4l$ag6$1@superb.csc.ti.com>
References: <Pine DOT LNX DOT 3 DOT 91 DOT 970226105830 DOT 29585A-100000 AT lenny DOT dseg DOT ti DOT com>
    <rzqvi7d6j83 DOT fsf AT djlvig DOT dl DOT ac DOT uk>
Reply-To: jbennett AT ti DOT com (Jesse Bennett)
NNTP-Posting-Host: lenny.dseg.ti.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp

[Posted and mailed]

This is getting somewhat off-topic in an already off-topic thread.  I
must also admit that I am not currently using DJ's gcc port (I do most
of my code development using Linux w/gcc) so this is probably not even
appropriate for this newsgroup.  OTOH, the issues discussed here are
relevant to those using DJGPP since they are gcc/g77 specific.
Respondents may want to consider an email reply if they do not
consider the subject matter applicable to the newsgroup.

So, to continue the discussion concerning numerical algorithm
performance with gcc/g77 ...

In article <rzqvi7d6j83 DOT fsf AT djlvig DOT dl DOT ac DOT uk>,
	Dave Love <d DOT love AT dl DOT ac DOT uk> writes:
>>>>>> "Jesse" == Jesse W Bennett <jesse AT lenny DOT dseg DOT ti DOT com> writes:
> 
>  Jesse> void gemm( int m, int n, int k, double **a, double **b, double **c )
>  Jesse> {
> 
>  Jesse> /* C = AB + C */
> 
>  Jesse>    int i, j, l;
>  Jesse>    double temp;
> 
>  Jesse>    for( i=0; i<m; i++ )
>  Jesse>       for( l=0; l<k; l++ )
>  Jesse>       {
>  Jesse>          temp = a[i][l];
>  Jesse> 	 for( j=0; j<n; j++ )
>  Jesse> 	    c[i][j] += temp * b[l][j];
>  Jesse>       }     
>  Jesse> }
> 
>  Jesse> compiled with gcc -O2 -S gemm.c
> 
>  Jesse> The generated assembly for the inner loop is:
> 
>  Jesse> L13:
>  Jesse>         movl (%edi),%edx
>  Jesse>         movl (%esi),%eax
>  Jesse>         fld %st(0)
>  Jesse>         fmull (%eax,%ecx,8)
>  Jesse>         faddl (%edx,%ecx,8)
>  Jesse>         fstpl (%edx,%ecx,8)
>  Jesse>         incl %ecx
>  Jesse>         cmpl %ecx,12(%ebp)
>  Jesse>         jg L13
> 
>  Jesse> It is not clear to me why the edx and eax registers are being reloaded 
>  Jesse> each iteration.  
> 
> I can't show DJGPP G77 o/p at present, but assume the generated code
> would be the same as this.  (On 586 and especially on ppro, the speed
> will actually be determined by how your double words happen to get
> aligned, sigh.)

I had not given this much thought, but this is potentially a serious
issue from a performance standpoint.  Are there any workarounds for
the 64 bit alignment problem on 32 bit machines?  I can think of ways
to insure the proper alignment using C/malloc but I would be
interested in what, if anything, is being considered by the gcc/g77
developers.

> $ cat a.f
>       subroutine gemm(m, n, k, a, b, c)
>       integer i,m,n,k,l,j
>       double precision a(n,m),  b(n,m),  c(n,m)
>       do i=1,m     ! poor for illustration only
>         do l=1,k
>           do j=1,n
>             c(j,i) = c(j,i) + a(l,i)*b(j,l)
>           end do
>         end do
>       end do
>       end
> $ g77 -S -O2 -v a.f
> g77 version 0.5.19.1
>  gcc -S -O2 -v -xf77 a.f
> Reading specs from /usr/lib/gcc-lib/i486-unknown-linux/2.7.2.1.f.1/specs
> gcc version 2.7.2.1.f.1
>  /usr/lib/gcc-lib/i486-unknown-linux/2.7.2.1.f.1/f771 a.f -fset-g77-defaults -qu
> iet -dumpbase a.f -O2 -version -fversion -o a.s
> GNU F77 version 2.7.2.1.f.1 (i386 Linux/ELF) compiled by GNU C version 2.7.2.1.f
> .1.
> GNU Fortran Front End version 0.5.19.1 compiled: Feb  1 1997 19:51:03
> $ more +/L13 a.s
> 
> ...skipping
>         addl 24(%ebp),%eax
>         .align 4
> .L13:
>         movl -24(%ebp),%edi
>         fldl (%edi)
>         fmull (%eax)
>         faddl (%edx)
>         fstpl (%edx)
>         addl $8,%eax
>         addl $8,%edx
>         decl %ecx
>         jns .L13
> .L8:

The above assembler looks strikingly like what I get with the C
pointer-based implementation.  It is also a good example of two
deficiencies I see consistently with the GNU compilers:

1. Unnecessary memory references.  The

          movl -24(%ebp),%edi

   opcode should have been moved out of the inner loop since %edi is
   not modified inside the loop.

2. The pointer aliasing feature/problem (depending on your
   perspective).  In the C language it is a "feature" so the
   programmer must create temporary variables to avoid the "problem".
   In FORTRAN it should not be a problem, but I suppose the use of the
   gcc backend makes this obvious optimization difficult.  I
   understand that this optimization will be included in the next G77
   release (0.5.20).  Does that mean GCC users will have a -noalias
   optimization switch?

I get a much better implementation with the following modification.

radteam:~$ cat a.f
      subroutine gemm(m, n, k, a, b, c)
      integer i,m,n,k,l,j
      double precision a(n,m),  b(n,m),  c(n,m), temp
      do i=1,m     ! poor for illustration only
        do l=1,k
          temp = a(l,i)
          do j=1,n
            c(j,i) = c(j,i) + temp*b(j,l)
          end do
        end do
      end do
      end
radteam:~$ g77 -S -O2 -v a.f
 gcc -S -O2 -v -xf77 a.f
Reading specs from /usr/lib/gcc-lib/i486-linux/2.7.2/specs
gcc version 2.7.2.f.1
 /usr/lib/gcc-lib/i486-linux/2.7.2/f771 a.f -fset-g77-defaults -quiet -dumpbase a.f -O2 -version -fversion -o a.s
GNU F77 version 2.7.2.f.1 (i386 Linux/ELF) compiled by GNU C version 2.7.2.f.1.
GNU Fortran Front End version 0.5.18 compiled: Apr  8 1996 14:27:32
radteam:~$ more +/L13 a.s

...skipping
        addl 24(%ebp),%eax
        .align 4
.L13:
        fld %st(0)
        fmull (%eax)
        faddl (%edx)
        fstpl (%edx)
        addl $8,%eax
        addl $8,%edx
        decl %ecx
        jns .L13
.L17:

Note that the innermost loop now has only 3 memory references instead
of the 5 in the original implementation.  It is also virtually
identical to the code generated by my pointer-based C implementation
(the only difference is the decrement instr. is replaced by a
compare).  Since the performance of software on modern CPU's is
dominated by the number (and pattern) of memory references it is
worthwhile to take whatever steps are necessary to minimize and
localize these references.

BTW, I also found your comment in the source code (poor for
illustration only) interesting.  Short of "hacks" such as loop
unrolling and more sophisticated methods for localizing memory
references, e.g. blocking (tiling), what sort of improvements would
you suggest?

Best Regards,
Jesse