Mail Archives: pgcc/1999/03/17/07:43:57
> On 16 Mar, Axel Thimm wrote:
> > We are currently trying to see what we can drain maximally from PII for a
> > certain flop intensive application (QCD). Until now folks were using gcc 2.8.1
> > with -O2 -fomit-frame-pointer. I thought I might surprise them with egcs or
> > pgcc, but the perfomance dropped from 80 to 50 Mflop/s (?)
Thanks for the mails on this subject. I was not specific enough in my first
mail in order to keep the mail small. Here are more infos on this subject.
The system is a i686-pc-linux-gnulibc1 (Dual 400Mhz, but only one CPU is used
in testing, Kernel is 2.0.36). For pgcc I used the precompiled
i586-pc-linux-gnulibc1/pgcc-2.91.57 (from the pgcc WWW pages) and for gcc the
installed (/usr/bin) i486-linux/2.7.2.1. This is all information I can
currently drain, but I can provide more on request.
We are benchmarking this system in order to have some numbers for a future
project of a home made parallel machine possibly based on x86 and successor
technology.
The application is "complex double" running using
"typedef struct {double re;double im;} complex;" (another step to try to speed
up is to use __complex__ double, but that is another point, which I will be
checking in the very near future.)
Based on this "complex" a 3-Vector (48 bytes) and a 3x3 Matrix on complex (144
bytes) are defined as structures. The application must multiply matrices with
those vectors, where both of them are placed on two cubical grids. The pseudo
algorithm looks like (very simplified)
result_vector_[x] = Sum_over_all_neighbours_of_x Matrix[neighbour]*vector[neighbour].
The sums etc. are mostly done as macros. So perhaps code bloat could make a
difference also.
But the key point in my opinion is the amount of data read and written
throughout one loop body. There are 8 neighbours, so we have 8*48+8*144=1.5KB
input. The loops pass over the grid in a typewriter fashion (also something
improvable from an algorithmic viewpoint by using a Hilbert curve), so mostly
only one neighbour survived in the cache. (That means that the slowdown could
not come from cache effects, because we already left this regime.) Also the
factor 3 is rather unlucky when it comes to cache lines, which will be
"crossed".
On Tue, Mar 16, 1999 at 03:40:48PM -0600, Jeffrey Hundstad wrote:
> try
>
> CFLAGS=-O20 -mcpu=i686 -march=i686 -fomit-frame-pointer --fast-math
> -mstack-align-double -funroll-all-loops
I already had tried most of it (not -O20, but -O9), it does not work.
On Tue, Mar 16, 1999 at 11:59:24PM +0100, Marc Lehmann wrote:
> - double alignment. depending on how your program allocates memory for
> doubles, it can, by pure luck, change from optimal to non-optimal.
I tried all alignment options available in pgcc, no luck.
> - cache colouring (or lack thereof). Sometimes moving around data structures
> will defter performance randomly (from run to run). some algorithms are
> highly sensitive to these. Unfortunately, the compiler cannot help here.
> > memory intensive (small ratio of computations per memory accesses) and perhaps
> > this is what makes the difference.
>
> It might. Cahce line aliasing can make up to 200% difference in runtime.
What do cache *coloring* and cache line *aliasing* mean?
On Wed, Mar 17, 1999 at 08:07:07AM +0100, Michael Hanke wrote:
> > x86 fp performance is veeery sensitive to environment issues.
> >
> > > memory intensive (small ratio of computations per memory accesses) and perhaps
> > > this is what makes the difference.
> >
> > It might. Cahce line aliasing can make up to 200% difference in runtime.
> If memory access is an issue, modern hierarchical memory
> architectures have more influence on the flop rate than a compiler
> can have. An impressing example is the atlas library (a gemm based
> blas implementation) which pushed my machine from 7 mflops to 20
> (double) and 35 (float) mflops, respectively. Other carefully
> designed implementations of numerical standard algorithms lead also
> to speed improvements. So my hint would be to see if you can make use
> of tuned standard libraries. As far as I know, they are available for
> blas, lapack, fftw. Maybe, you can change even your algorithm so that
> they are able to take advantage of them.
Of course, but this is another point to investigate. (We are "only" getting
80Mflop/s on a 400MHz machine). In the current frameset I wonder how and why
some improvements from gcc to egcs/pgcc have had that side effect for this
class of number crunching computation.
If there is interest in this list, I could ask the original authors whether I
may post it here.
BTW another mystic thing is that assuming that this application is indeed
limited by memory access, then running this benchmark on both cpus should show
some performance droppings, but it does not (?). Does anyone have a good
explanation on this? (dual memory port?). I must investigate more on the board
hardware, but I was said that it is an Asus board.
Regards, Axel.
--
Axel DOT Thimm AT physik DOT fu-berlin DOT de Axel DOT Thimm AT ifh DOT de
- Raw text -