Mail Archives: djgpp/1997/02/26/12:39:54
On Sun, 23 Feb 1997, Eli Zaretskii wrote:
> On 20 Feb 1997, Jesse Bennett wrote:
[snip]
> > I have done some benchmarking of the above matrix multiply code on a
> > DEC Alpha using the native DEC compilers. The "pointer based" C
> > implementation was fastest, followed by the F77 code and the "array
> > based" C code. Hence my comments.
>
> Try it with gcc. In most cases, it converts the array-based code to
> pointer-based automatically, as far as I could see, and in many cases it
> does that better than you would.
Hi Eli,
I tried this on a Linux box with gcc 2.6.3 and 2.7.2 and the results were
encouraging, but the pointer based code was still slightly faster. When
I looked at the generated assembly I could see that the array based
implementation was making better use of the x86 CISC instruction set but
the innermost instruction loop appears to have some unnecessary memory
references (I say "appears" because I am not very familiar with the
x86). The test code was:
void gemm( int m, int n, int k, double **a, double **b, double **c )
{
/* C = AB + C */
int i, j, l;
double temp;
for( i=0; i<m; i++ )
for( l=0; l<k; l++ )
{
temp = a[i][l];
for( j=0; j<n; j++ )
c[i][j] += temp * b[l][j];
}
}
compiled with gcc -O2 -S gemm.c
The generated assembly for the inner loop is:
L13:
movl (%edi),%edx
movl (%esi),%eax
fld %st(0)
fmull (%eax,%ecx,8)
faddl (%edx,%ecx,8)
fstpl (%edx,%ecx,8)
incl %ecx
cmpl %ecx,12(%ebp)
jg L13
It is not clear to me why the edx and eax registers are being reloaded
each iteration. Is there something about the addressing mode used by the
fxxx instructions that is modifying these registers? I have also tried
(without success) to eliminate the memory reference for the loop
counter. It seems to me (bear in mind that I know zip about x86
assembly) that a much better implementation would have been:
movl (%edi),%edx
movl (%esi),%eax
movl 12(%ebp),%ebx
L13:
fld %st(0)
fmull (%eax,%ecx,8)
faddl (%edx,%ecx,8)
fstpl (%edx,%ecx,8)
incl %ecx
cmpl %ecx,%ebx
jg L13
requiring only 3 memory references in the inner loop (executed O(n^3)
times) instead of 6.
Am I missing something here?
Best Regards,
Jesse
- Raw text -