Mail Archives: pgcc/1999/02/25/20:02:21.2
> From: =?iso-8859-1?Q?Johnny_Teve=DFen?= <j DOT tevessen AT gmx DOT de>
>
> double foo (int i, double d) {
> int j;
> for (j = 20; j; --j) {
> i *= i;
> d *= d;
> }
> return d*(double)i;
> }
>
> Now compile this using -funroll-all-loops. It will result in a loop that
> runs twice and has 10 "imull" and 10 "fmul" instructions in it. What
> confused me was the way these got mixed.
Have you made a benchmark? (I haven't). Scheduling is often unintuitive.
It might indeed be the case that gcc's scheduling constants are suboptimal
for some cases. One of the problms is that the normal list scheduler
isn't up to scheduling for superscalar architectures (pentiumpro), while
the scheduling parameters aren't tuned for the haifa scheduler.
> To make a long output short,
> I replaced every imull by '.' and every fmul by '*'. Compiled using
Thats a very nice technique ;)
> -march=3Di386: .*.*.*.*.*.*.*.*.*.*
> -march=3Di486: .*.*.*.*.*.*.*.*.*.*
> -march=3Di586: ....*.*.**.*.**.*.**
> -march=3Di686: ******.*...*...*...*
> -march=3Dk6 : *..*.*.*.*.*.*.*.*.*
>
> Especially the pentium (i586) ones look strange to me: At the beginning
Strange, yes, but it doens't seem to run slower on pentiums (almost every
insn is dependet on each other, and, in addition, the integer multiply
unit is interlocked with the fp multiply unit on pentiums)
> of the loop, the FPU is nearly totally left alone (well, I don't think
> the load-"d"-from-stack still occupies it here). And is the pentiumpro
> (i686) really capable of collecting 6 fp multiplications in its queue?
Yes ;)
> Please don't be angry if I'm totally misunderstanding something, but some
> of the scheduler effects confused me quite a bit for the last days.
No, its good to get reminded of suboptimal code, but, esp. with
scheduling, benchmarking is better. Good scheduling parameters are much
more difficult to find since they tend to affect everything.
> Then, a little memory-juggling question:
>
> double bar (int i, double d) {
> return d * (double)i;
> }
>
> Compiled using -O6, on -march=3D{i386,i486,i686,k6} I get the (good) result:
>
> bar: fildl 4(%esp)
> fmull 8(%esp)
> ret
>
> But -march=3Dpentium (the default) gives this:
(the default for your version, as pgcc configures for pentiumpro when it
detects one)
> bar: movl 4(%esp),%edx
> pushl %edx
> fildl (%esp)
> addl $4,%esp
> fmull 8(%esp)
> ret
Thats the riscification going on here. There is a pass that corrects these
problems, if you specify -frecombine you should get better code.
The problem is that pgcc seems to generate slower overall code with
-frecombine, could you make a benchmark with -frecombine and with
-fno-recombine (and -O4 or higher, of course)? I cannot try this myself
at the moment (no pentium), but I was always a bit puzzled since the
benchmark said: "turn it off" but my eyes, looking at the resulting code
(like here) said: "turn it on"!
--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / pcg AT goof DOT com |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|
- Raw text -