Date: Thu, 22 Mar 2001 18:53:01 +0200 (EET)
From: Tuukka Toivonen <tuukkat AT s-inf-pc24 DOT oulu DOT fi>
To: pgcc AT delorie DOT com, agcc-athlonlinux AT lists DOT sourceforge DOT net
Subject: Re: gcc generates bad code
In-Reply-To: <Pine.LNX.4.21.0103201614150.21716-100000@s-inf-pc24.oulu.fi>
Message-ID: <Pine.LNX.4.21.0103221828310.31701-100000@s-inf-pc24.oulu.fi>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Reply-To: pgcc AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: pgcc AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com
Precedence: bulk

On Tue, 20 Mar 2001, Tuukka Toivonen wrote:

> I have a straightforward piece of code that needs to be well
> optimized. Since it's VERY straightforward, I'd suppose gcc not having
> problems with it. However, all versions I tested (egcs 1.1.2, pgcc-2.95.2
> 19991024, AthlonGCC) have the same thing that looks very much like it's

... I haven't been able to remove the useless memory accesses, but I found
something interesting (and bad) concerning all the compilers mentioned
above.

With -O2 the compiler generates 76% slower code than with -O1. More
specifically, the problem is -fregmove. Whenever I add that after -O, the
code runs *much* slower. 

I can't see any significant difference in the generated assembly code. The
instructions are roughly just in different places and with different
registers. I guess it hurts scheduling somehow.

This is AMD Athlon 800 MHz. I can provide example source (~40k) for
request.

Here's short piece of example code (cutted from a longer piece, I'm not
sure if it actually is good example since I haven't timed it separately
but the rest of the code is very similar):

(the code marked with - is the fast version, + is the slow version)

a=ntt_block_p[3];       b=ntt_block_p[19];      ntt_block_p[3]=a+b;     ntt_block_p[19]=ntt_2n(a-b,22)-ntt_2n(a-b,10);

- 804a7a3:      8b 71 0c                mov    0xc(%ecx),%esi
- 804a7a6:      8b 79 4c                mov    0x4c(%ecx),%edi
- 804a7a9:      8d 0c 37                lea    (%edi,%esi,1),%ecx
- 804a7ac:      89 4d d8                mov    %ecx,0xffffffd8(%ebp)
- 804a7af:      8b 5d f8                mov    0xfffffff8(%ebp),%ebx
- 804a7b2:      89 4b 0c                mov    %ecx,0xc(%ebx)
- 804a7b5:      89 f2                   mov    %esi,%edx
- 804a7b7:      29 fa                   sub    %edi,%edx
- 804a7b9:      89 d6                   mov    %edx,%esi
- 804a7bb:      c1 e6 16                shl    $0x16,%esi
- 804a7be:      81 e6 ff ff ff 00       and    $0xffffff,%esi
- 804a7c4:      89 d0                   mov    %edx,%eax
- 804a7c6:      c1 f8 02                sar    $0x2,%eax
- 804a7c9:      29 c6                   sub    %eax,%esi
- 804a7cb:      89 d0                   mov    %edx,%eax
- 804a7cd:      c1 e0 0a                shl    $0xa,%eax
- 804a7d0:      25 ff ff ff 00          and    $0xffffff,%eax
- 804a7d5:      c1 fa 0e                sar    $0xe,%edx
- 804a7d8:      29 d0                   sub    %edx,%eax
- 804a7da:      29 c6                   sub    %eax,%esi
- 804a7dc:      89 75 d4                mov    %esi,0xffffffd4(%ebp)
- 804a7df:      89 73 4c                mov    %esi,0x4c(%ebx)

+ 804a88d:      8b 73 0c                mov    0xc(%ebx),%esi
+ 804a890:      8b 7b 4c                mov    0x4c(%ebx),%edi
+ 804a893:      8d 0c 37                lea    (%edi,%esi,1),%ecx
+ 804a896:      89 4d dc                mov    %ecx,0xffffffdc(%ebp)
+ 804a899:      89 4b 0c                mov    %ecx,0xc(%ebx)
+ 804a89c:      89 f2                   mov    %esi,%edx
+ 804a89e:      29 fa                   sub    %edi,%edx
+ 804a8a0:      89 55 d8                mov    %edx,0xffffffd8(%ebp)
+ 804a8a3:      c1 65 d8 16             shll   $0x16,0xffffffd8(%ebp)
+ 804a8a7:      c6 45 db 00             movb   $0x0,0xffffffdb(%ebp)
+ 804a8ab:      89 d0                   mov    %edx,%eax
+ 804a8ad:      c1 f8 02                sar    $0x2,%eax
+ 804a8b0:      29 45 d8                sub    %eax,0xffffffd8(%ebp)
+ 804a8b3:      89 d0                   mov    %edx,%eax
+ 804a8b5:      c1 e0 0a                shl    $0xa,%eax
+ 804a8b8:      25 ff ff ff 00          and    $0xffffff,%eax
+ 804a8bd:      c1 fa 0e                sar    $0xe,%edx
+ 804a8c0:      29 d0                   sub    %edx,%eax
+ 804a8c2:      29 45 d8                sub    %eax,0xffffffd8(%ebp)
+ 804a8c5:      8b 4d d8                mov    0xffffffd8(%ebp),%ecx
+ 804a8c8:      89 4b 4c                mov    %ecx,0x4c(%ebx)