Mail Archives: pgcc/2000/01/29/02:13:06
Jan,
thanks for your reply.
> > In pgcc some basic blocks (loops?) are being aligned.
> > These 16 byte blocks are ifetch blocks.
> > Quoting Agner Fog, "While aligning data is always important,
> > aligning code is not necessary on the PPlain and PMMX."
>
> The alignment (4,,7) is consistent with Intel Optimizing Manual's
> recommendation. Changing this value might require quite extensive testing to
> prove your statement. For Pentium, the alignment 4,,7 seems to be win
> according to my (simple) tests.
Is there a switch to turn this alignment off so that I could test it?
-mcode-align? Or does this turn off alignment of entry points
as well?
> > In pgcc strings are being aligned to cache lines.
> > But is alignment even necessary for strings?
> It is. Consider memset/memcpy/strlen expanders. These can work
> much better when they know that destination is word size aligned.
I didn't quite understand this. The string alignment now is to a
cache line.
.file "ioport.c"
.version "01.01"
gcc2_compiled.:
.section .rodata
.LC0:
.string "eip: %p\n"
.align 32
.LC1:
.string "/home/chris/linux/include/asm/spinlock.h"
Admittedly, a cache line is word aligned as well,
but wouldn't .align 4 suffice to align to a word boundary?
>
> I will verify this tommorow and in case you are correct, I will fix this bug.
>
> (in both gas and gcc).
If possible could you send me email telling me what happened.
> > So in summary, I think that functions should be aligned to cache lines
> > and that basic blocks and strings should not be aligned at all.
> Gcc don't align every basic block. It uses alignments for top of loops, where
> the alignment to ifetch block is necesary. Top of loop appearing at the very
> end of ifetch blocks may cause stalls in the decoding process IMO.
> Second alignment is dont after barriers, where situation is in many points
> of view equivalent to function entry point.
The .p2align 4,,7 is deceptively misleading. It could probably be better
read as .align 8 as the 7 represents a limit of 7 nops, which gas usually
replaces with a do nothing leal and a nop.
So given that this can happen in four cases in a 32 byte cache line:
bytes 0-7 + 7 gets aligned to bytes 7-15 -- alignment not done
bytes 8-15 + 7 gets aligned to 16 -- alignment to 16
bytes 16-23 + 7 gets aligned to 23-31 -- alignment not done
bytes 24-31 + 7 get aligned to 32 -- alignment to 32
So half of the time it isn't being aligned anyways. In the second case,
it seems a waste since the icache line will be in the buffer. No point.
In the fourth case, I can see a point, especially if there is an jmp
instruction and no nops will be executed.
> Aligning to 16 byte boundary can be quite good tradeoff between code size
> and cache line fetching effecienty. While function starting near end of
> cache line is catastrophical, function starting in the middle of it is not
> so bad.
> Again Intel Optimizing Manual recommends this. I believe Intel did some experiments
> before saying so.
16 byte alignment for functions trades memory against cache footprint.
I would strongly prefer cache and I would urge someone to look at this.
In this case, I wouldn't take Intel's word.
To summarize:
word alignment for strings -- .align 4 not .align 32
cache line alignment for functions -- .align 32 not .align 4 (egcs) or .align 16
pgcc
change loop body alignment to only the fourth quarter of a cacheline
.p2align probably can't do this -- not .p2align 4,,7
Chris Sears
cbsears AT ix DOT netcom DOT com
- Raw text -