Date: Thu, 22 Mar 2001 18:53:01 +0200 (EET) From: Tuukka Toivonen To: pgcc AT delorie DOT com, agcc-athlonlinux AT lists DOT sourceforge DOT net Subject: Re: gcc generates bad code In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Reply-To: pgcc AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: pgcc AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk On Tue, 20 Mar 2001, Tuukka Toivonen wrote: > I have a straightforward piece of code that needs to be well > optimized. Since it's VERY straightforward, I'd suppose gcc not having > problems with it. However, all versions I tested (egcs 1.1.2, pgcc-2.95.2 > 19991024, AthlonGCC) have the same thing that looks very much like it's ... I haven't been able to remove the useless memory accesses, but I found something interesting (and bad) concerning all the compilers mentioned above. With -O2 the compiler generates 76% slower code than with -O1. More specifically, the problem is -fregmove. Whenever I add that after -O, the code runs *much* slower. I can't see any significant difference in the generated assembly code. The instructions are roughly just in different places and with different registers. I guess it hurts scheduling somehow. This is AMD Athlon 800 MHz. I can provide example source (~40k) for request. Here's short piece of example code (cutted from a longer piece, I'm not sure if it actually is good example since I haven't timed it separately but the rest of the code is very similar): (the code marked with - is the fast version, + is the slow version) a=ntt_block_p[3]; b=ntt_block_p[19]; ntt_block_p[3]=a+b; ntt_block_p[19]=ntt_2n(a-b,22)-ntt_2n(a-b,10); - 804a7a3: 8b 71 0c mov 0xc(%ecx),%esi - 804a7a6: 8b 79 4c mov 0x4c(%ecx),%edi - 804a7a9: 8d 0c 37 lea (%edi,%esi,1),%ecx - 804a7ac: 89 4d d8 mov %ecx,0xffffffd8(%ebp) - 804a7af: 8b 5d f8 mov 0xfffffff8(%ebp),%ebx - 804a7b2: 89 4b 0c mov %ecx,0xc(%ebx) - 804a7b5: 89 f2 mov %esi,%edx - 804a7b7: 29 fa sub %edi,%edx - 804a7b9: 89 d6 mov %edx,%esi - 804a7bb: c1 e6 16 shl $0x16,%esi - 804a7be: 81 e6 ff ff ff 00 and $0xffffff,%esi - 804a7c4: 89 d0 mov %edx,%eax - 804a7c6: c1 f8 02 sar $0x2,%eax - 804a7c9: 29 c6 sub %eax,%esi - 804a7cb: 89 d0 mov %edx,%eax - 804a7cd: c1 e0 0a shl $0xa,%eax - 804a7d0: 25 ff ff ff 00 and $0xffffff,%eax - 804a7d5: c1 fa 0e sar $0xe,%edx - 804a7d8: 29 d0 sub %edx,%eax - 804a7da: 29 c6 sub %eax,%esi - 804a7dc: 89 75 d4 mov %esi,0xffffffd4(%ebp) - 804a7df: 89 73 4c mov %esi,0x4c(%ebx) + 804a88d: 8b 73 0c mov 0xc(%ebx),%esi + 804a890: 8b 7b 4c mov 0x4c(%ebx),%edi + 804a893: 8d 0c 37 lea (%edi,%esi,1),%ecx + 804a896: 89 4d dc mov %ecx,0xffffffdc(%ebp) + 804a899: 89 4b 0c mov %ecx,0xc(%ebx) + 804a89c: 89 f2 mov %esi,%edx + 804a89e: 29 fa sub %edi,%edx + 804a8a0: 89 55 d8 mov %edx,0xffffffd8(%ebp) + 804a8a3: c1 65 d8 16 shll $0x16,0xffffffd8(%ebp) + 804a8a7: c6 45 db 00 movb $0x0,0xffffffdb(%ebp) + 804a8ab: 89 d0 mov %edx,%eax + 804a8ad: c1 f8 02 sar $0x2,%eax + 804a8b0: 29 45 d8 sub %eax,0xffffffd8(%ebp) + 804a8b3: 89 d0 mov %edx,%eax + 804a8b5: c1 e0 0a shl $0xa,%eax + 804a8b8: 25 ff ff ff 00 and $0xffffff,%eax + 804a8bd: c1 fa 0e sar $0xe,%edx + 804a8c0: 29 d0 sub %edx,%eax + 804a8c2: 29 45 d8 sub %eax,0xffffffd8(%ebp) + 804a8c5: 8b 4d d8 mov 0xffffffd8(%ebp),%ecx + 804a8c8: 89 4b 4c mov %ecx,0x4c(%ebx)