From: jesse AT lenny DOT dseg DOT ti DOT com (Jesse Bennett) Newsgroups: comp.os.msdos.djgpp Subject: Re: Netlib code [was Re: flops...] Date: 2 Mar 1997 23:08:56 GMT Organization: Texas Instruments Lines: 58 Message-ID: <5fd1a8$ag6$2@superb.csc.ti.com> References: Reply-To: jbennett AT ti DOT com (Jesse Bennett) NNTP-Posting-Host: lenny.dseg.ti.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp In article , Eli Zaretskii writes: > > On Wed, 26 Feb 1997, Jesse W. Bennett wrote: > >> I tried this on a Linux box with gcc 2.6.3 and 2.7.2 and the results were >> encouraging, but the pointer based code was still slightly faster. > > Did you try to experiment with the various optimization-related > switches to gcc? There are a plethora of them, all described in > section called "Optimize Options" of the gcc on-line docs. I suggest > to try those which seem relevant to your inner loops, looking at the > generated assembly and timing the results, until you find the best > combination. > >> L13: >> movl (%edi),%edx >> movl (%esi),%eax >> fld %st(0) >> fmull (%eax,%ecx,8) >> faddl (%edx,%ecx,8) >> fstpl (%edx,%ecx,8) >> incl %ecx >> cmpl %ecx,12(%ebp) >> jg L13 >> >> It is not clear to me why the edx and eax registers are being reloaded >> each iteration. > > Maybe because GCC allows `a' or `b' to be the same as `c' at the > caller side? Try declaring `a' and `b' const and see if that helps. Hi Eli, I tried this and it doesn't seem to affect the resulting assembler. I thought that the const declaration might make the temp variable unnecessary as well (pointer aliasing issue) but, alas, it didn't seem to have any affect on the generated code. I also tried various optimization levels (-O4, -O6, etc) and a number of optimization flags (as you suggested in email) but nothing I could come up with (except the pointer-based implementation) had any effect on the resulting code. This is a very simple function (but also very important in numerical applications). Understanding how to coerce GCC into producing near optimal code (without obfuscating the source) for the matrix multiplication problem would be very beneficial to my work, since the required "tricks" should be widely applicable to my code. I would like to hear any further thoughts or ideas on this subject. I would also like to thank all those who have offered suggestions and alternative approaches. They have been helpful. Best Regards, Jesse