Date: Thu, 25 Feb 1999 23:52:32 +0100 To: pgcc AT delorie DOT com Cc: johnny AT entity DOT netcologne DOT de Subject: loop unrolling Message-ID: <19990225235232.C20417@cerebro.laendle> Mail-Followup-To: pgcc AT delorie DOT com, johnny AT entity DOT netcologne DOT de References: <199902241423 DOT JAA29290 AT envy DOT delorie DOT com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <199902241423.JAA29290@envy.delorie.com>; from DJ Delorie on Wed, Feb 24, 1999 at 09:23:29AM -0500 X-Operating-System: Linux version 2.2.2 (marc AT cerebro) (gcc driver version pgcc-2.93.04 19990131 (gcc2 ss-980929 experimental) executing gcc version 2.7.2.3) From: Marc Lehmann <pcg AT goof DOT com> Reply-To: pgcc AT delorie DOT com > From: =?iso-8859-1?Q?Johnny_Teve=DFen?= <j DOT tevessen AT gmx DOT de> > > double foo (int i, double d) { > int j; > for (j = 20; j; --j) { > i *= i; > d *= d; > } > return d*(double)i; > } > > Now compile this using -funroll-all-loops. It will result in a loop that > runs twice and has 10 "imull" and 10 "fmul" instructions in it. What > confused me was the way these got mixed. Have you made a benchmark? (I haven't). Scheduling is often unintuitive. It might indeed be the case that gcc's scheduling constants are suboptimal for some cases. One of the problms is that the normal list scheduler isn't up to scheduling for superscalar architectures (pentiumpro), while the scheduling parameters aren't tuned for the haifa scheduler. > To make a long output short, > I replaced every imull by '.' and every fmul by '*'. Compiled using Thats a very nice technique ;) > -march=3Di386: .*.*.*.*.*.*.*.*.*.* > -march=3Di486: .*.*.*.*.*.*.*.*.*.* > -march=3Di586: ....*.*.**.*.**.*.** > -march=3Di686: ******.*...*...*...* > -march=3Dk6 : *..*.*.*.*.*.*.*.*.* > > Especially the pentium (i586) ones look strange to me: At the beginning Strange, yes, but it doens't seem to run slower on pentiums (almost every insn is dependet on each other, and, in addition, the integer multiply unit is interlocked with the fp multiply unit on pentiums) > of the loop, the FPU is nearly totally left alone (well, I don't think > the load-"d"-from-stack still occupies it here). And is the pentiumpro > (i686) really capable of collecting 6 fp multiplications in its queue? Yes ;) > Please don't be angry if I'm totally misunderstanding something, but some > of the scheduler effects confused me quite a bit for the last days. No, its good to get reminded of suboptimal code, but, esp. with scheduling, benchmarking is better. Good scheduling parameters are much more difficult to find since they tend to affect everything. > Then, a little memory-juggling question: > > double bar (int i, double d) { > return d * (double)i; > } > > Compiled using -O6, on -march=3D{i386,i486,i686,k6} I get the (good) result: > > bar: fildl 4(%esp) > fmull 8(%esp) > ret > > But -march=3Dpentium (the default) gives this: (the default for your version, as pgcc configures for pentiumpro when it detects one) > bar: movl 4(%esp),%edx > pushl %edx > fildl (%esp) > addl $4,%esp > fmull 8(%esp) > ret Thats the riscification going on here. There is a pass that corrects these problems, if you specify -frecombine you should get better code. The problem is that pgcc seems to generate slower overall code with -frecombine, could you make a benchmark with -frecombine and with -fno-recombine (and -O4 or higher, of course)? I cannot try this myself at the moment (no pentium), but I was always a bit puzzled since the benchmark said: "turn it off" but my eyes, looking at the resulting code (like here) said: "turn it on"! -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg AT goof DOT com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | |