From: jesse AT lenny DOT dseg DOT ti DOT com (Jesse Bennett)
Newsgroups: comp.os.msdos.djgpp
Subject: Re: Netlib code [was Re: flops...]
Date: 2 Mar 1997 23:08:56 GMT
Organization: Texas Instruments
Lines: 58
Message-ID: <5fd1a8$ag6$2@superb.csc.ti.com>
References: <Pine DOT SUN DOT 3 DOT 91 DOT 970227184200 DOT 2124C-100000 AT is>
Reply-To: jbennett AT ti DOT com (Jesse Bennett)
NNTP-Posting-Host: lenny.dseg.ti.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp

In article <Pine DOT SUN DOT 3 DOT 91 DOT 970227184200 DOT 2124C-100000 AT is>,
	Eli Zaretskii <eliz AT is DOT elta DOT co DOT il> writes:
> 
> On Wed, 26 Feb 1997, Jesse W. Bennett wrote:
> 
>> I tried this on a Linux box with gcc 2.6.3 and 2.7.2 and the results were
>> encouraging, but the pointer based code was still slightly faster.
> 
> Did you try to experiment with the various optimization-related
> switches to gcc?  There are a plethora of them, all described in
> section called "Optimize Options" of the gcc on-line docs.  I suggest
> to try those which seem relevant to your inner loops, looking at the
> generated assembly and timing the results, until you find the best
> combination.
> 
>> L13:
>>         movl (%edi),%edx
>>         movl (%esi),%eax
>>         fld %st(0)
>>         fmull (%eax,%ecx,8)
>>         faddl (%edx,%ecx,8)
>>         fstpl (%edx,%ecx,8)
>>         incl %ecx
>>         cmpl %ecx,12(%ebp)
>>         jg L13
>> 
>> It is not clear to me why the edx and eax registers are being reloaded 
>> each iteration.
> 
> Maybe because GCC allows `a' or `b' to be the same as `c' at the
> caller side?  Try declaring `a' and `b' const and see if that helps.

Hi Eli,

I tried this and it doesn't seem to affect the resulting assembler.  I
thought that the const declaration might make the temp variable
unnecessary as well (pointer aliasing issue) but, alas, it didn't seem
to have any affect on the generated code.

I also tried various optimization levels (-O4, -O6, etc) and a number
of optimization flags (as you suggested in email) but nothing I could
come up with (except the pointer-based implementation) had any effect
on the resulting code.

This is a very simple function (but also very important in numerical
applications).  Understanding how to coerce GCC into producing near
optimal code (without obfuscating the source) for the matrix
multiplication problem would be very beneficial to my work, since the
required "tricks" should be widely applicable to my code.  I would
like to hear any further thoughts or ideas on this subject.

I would also like to thank all those who have offered suggestions and
alternative approaches.  They have been helpful.

Best Regards,
Jesse