Mail Archives: djgpp/1997/03/02/18:41:17
[Posted and mailed]
This is getting somewhat off-topic in an already off-topic thread. I
must also admit that I am not currently using DJ's gcc port (I do most
of my code development using Linux w/gcc) so this is probably not even
appropriate for this newsgroup. OTOH, the issues discussed here are
relevant to those using DJGPP since they are gcc/g77 specific.
Respondents may want to consider an email reply if they do not
consider the subject matter applicable to the newsgroup.
So, to continue the discussion concerning numerical algorithm
performance with gcc/g77 ...
In article <rzqvi7d6j83 DOT fsf AT djlvig DOT dl DOT ac DOT uk>,
Dave Love <d DOT love AT dl DOT ac DOT uk> writes:
>>>>>> "Jesse" == Jesse W Bennett <jesse AT lenny DOT dseg DOT ti DOT com> writes:
>
> Jesse> void gemm( int m, int n, int k, double **a, double **b, double **c )
> Jesse> {
>
> Jesse> /* C = AB + C */
>
> Jesse> int i, j, l;
> Jesse> double temp;
>
> Jesse> for( i=0; i<m; i++ )
> Jesse> for( l=0; l<k; l++ )
> Jesse> {
> Jesse> temp = a[i][l];
> Jesse> for( j=0; j<n; j++ )
> Jesse> c[i][j] += temp * b[l][j];
> Jesse> }
> Jesse> }
>
> Jesse> compiled with gcc -O2 -S gemm.c
>
> Jesse> The generated assembly for the inner loop is:
>
> Jesse> L13:
> Jesse> movl (%edi),%edx
> Jesse> movl (%esi),%eax
> Jesse> fld %st(0)
> Jesse> fmull (%eax,%ecx,8)
> Jesse> faddl (%edx,%ecx,8)
> Jesse> fstpl (%edx,%ecx,8)
> Jesse> incl %ecx
> Jesse> cmpl %ecx,12(%ebp)
> Jesse> jg L13
>
> Jesse> It is not clear to me why the edx and eax registers are being reloaded
> Jesse> each iteration.
>
> I can't show DJGPP G77 o/p at present, but assume the generated code
> would be the same as this. (On 586 and especially on ppro, the speed
> will actually be determined by how your double words happen to get
> aligned, sigh.)
I had not given this much thought, but this is potentially a serious
issue from a performance standpoint. Are there any workarounds for
the 64 bit alignment problem on 32 bit machines? I can think of ways
to insure the proper alignment using C/malloc but I would be
interested in what, if anything, is being considered by the gcc/g77
developers.
> $ cat a.f
> subroutine gemm(m, n, k, a, b, c)
> integer i,m,n,k,l,j
> double precision a(n,m), b(n,m), c(n,m)
> do i=1,m ! poor for illustration only
> do l=1,k
> do j=1,n
> c(j,i) = c(j,i) + a(l,i)*b(j,l)
> end do
> end do
> end do
> end
> $ g77 -S -O2 -v a.f
> g77 version 0.5.19.1
> gcc -S -O2 -v -xf77 a.f
> Reading specs from /usr/lib/gcc-lib/i486-unknown-linux/2.7.2.1.f.1/specs
> gcc version 2.7.2.1.f.1
> /usr/lib/gcc-lib/i486-unknown-linux/2.7.2.1.f.1/f771 a.f -fset-g77-defaults -qu
> iet -dumpbase a.f -O2 -version -fversion -o a.s
> GNU F77 version 2.7.2.1.f.1 (i386 Linux/ELF) compiled by GNU C version 2.7.2.1.f
> .1.
> GNU Fortran Front End version 0.5.19.1 compiled: Feb 1 1997 19:51:03
> $ more +/L13 a.s
>
> ...skipping
> addl 24(%ebp),%eax
> .align 4
> .L13:
> movl -24(%ebp),%edi
> fldl (%edi)
> fmull (%eax)
> faddl (%edx)
> fstpl (%edx)
> addl $8,%eax
> addl $8,%edx
> decl %ecx
> jns .L13
> .L8:
The above assembler looks strikingly like what I get with the C
pointer-based implementation. It is also a good example of two
deficiencies I see consistently with the GNU compilers:
1. Unnecessary memory references. The
movl -24(%ebp),%edi
opcode should have been moved out of the inner loop since %edi is
not modified inside the loop.
2. The pointer aliasing feature/problem (depending on your
perspective). In the C language it is a "feature" so the
programmer must create temporary variables to avoid the "problem".
In FORTRAN it should not be a problem, but I suppose the use of the
gcc backend makes this obvious optimization difficult. I
understand that this optimization will be included in the next G77
release (0.5.20). Does that mean GCC users will have a -noalias
optimization switch?
I get a much better implementation with the following modification.
radteam:~$ cat a.f
subroutine gemm(m, n, k, a, b, c)
integer i,m,n,k,l,j
double precision a(n,m), b(n,m), c(n,m), temp
do i=1,m ! poor for illustration only
do l=1,k
temp = a(l,i)
do j=1,n
c(j,i) = c(j,i) + temp*b(j,l)
end do
end do
end do
end
radteam:~$ g77 -S -O2 -v a.f
gcc -S -O2 -v -xf77 a.f
Reading specs from /usr/lib/gcc-lib/i486-linux/2.7.2/specs
gcc version 2.7.2.f.1
/usr/lib/gcc-lib/i486-linux/2.7.2/f771 a.f -fset-g77-defaults -quiet -dumpbase a.f -O2 -version -fversion -o a.s
GNU F77 version 2.7.2.f.1 (i386 Linux/ELF) compiled by GNU C version 2.7.2.f.1.
GNU Fortran Front End version 0.5.18 compiled: Apr 8 1996 14:27:32
radteam:~$ more +/L13 a.s
...skipping
addl 24(%ebp),%eax
.align 4
.L13:
fld %st(0)
fmull (%eax)
faddl (%edx)
fstpl (%edx)
addl $8,%eax
addl $8,%edx
decl %ecx
jns .L13
.L17:
Note that the innermost loop now has only 3 memory references instead
of the 5 in the original implementation. It is also virtually
identical to the code generated by my pointer-based C implementation
(the only difference is the decrement instr. is replaced by a
compare). Since the performance of software on modern CPU's is
dominated by the number (and pattern) of memory references it is
worthwhile to take whatever steps are necessary to minimize and
localize these references.
BTW, I also found your comment in the source code (poor for
illustration only) interesting. Short of "hacks" such as loop
unrolling and more sophisticated methods for localizing memory
references, e.g. blocking (tiling), what sort of improvements would
you suggest?
Best Regards,
Jesse
- Raw text -