Mail Archives: pgcc/2000/01/29/20:42:36
>
> Is there a switch to turn this alignment off so that I could test it?
> -mcode-align? Or does this turn off alignment of entry points
> as well?
There are switches -malign-jump and -malign-loops, that may do what do you want.
>
> > > In pgcc strings are being aligned to cache lines.
> > > But is alignment even necessary for strings?
> > It is. Consider memset/memcpy/strlen expanders. These can work
> > much better when they know that destination is word size aligned.
>
> I didn't quite understand this. The string alignment now is to a
> cache line.
>
> .file "ioport.c"
> .version "01.01"
> gcc2_compiled.:
> .section .rodata
> .LC0:
> .string "eip: %p\n"
> .align 32
> .LC1:
> .string "/home/chris/linux/include/asm/spinlock.h"
>
> Admittedly, a cache line is word aligned as well,
> but wouldn't .align 4 suffice to align to a word boundary?
Yes. We was discussing this recently with Richard and we probably will change this bit.
The rationale behind is to place string into as few cache lines as possible.
(when the string starts near end of cache line, it may go cross one extra).
But this needs some tunning.
>
> If possible could you send me email telling me what happened.
I didn't had time to test it yet (I am preparing for exam and I've dopne other
100Kb patch to function calling code ehh..)I will do that tommorow :)
and let you know.
>
> > > So in summary, I think that functions should be aligned to cache lines
> > > and that basic blocks and strings should not be aligned at all.
> > Gcc don't align every basic block. It uses alignments for top of loops, where
> > the alignment to ifetch block is necesary. Top of loop appearing at the very
> > end of ifetch blocks may cause stalls in the decoding process IMO.
> > Second alignment is dont after barriers, where situation is in many points
> > of view equivalent to function entry point.
>
> The .p2align 4,,7 is deceptively misleading. It could probably be better
> read as .align 8 as the 7 represents a limit of 7 nops, which gas usually
> replaces with a do nothing leal and a nop.
>
> So given that this can happen in four cases in a 32 byte cache line:
>
> bytes 0-7 + 7 gets aligned to bytes 7-15 -- alignment not done
> bytes 8-15 + 7 gets aligned to 16 -- alignment to 16
> bytes 16-23 + 7 gets aligned to 23-31 -- alignment not done
> bytes 24-31 + 7 get aligned to 32 -- alignment to 32
>
> So half of the time it isn't being aligned anyways. In the second case,
> it seems a waste since the icache line will be in the buffer. No point.
> In the fourth case, I can see a point, especially if there is an jmp
> instruction and no nops will be executed.
The second case bring similar sppedups to the fourth case at least on the CPUs
without of order execution. I will do some tests how does this behave on Pentium.
But I believe that code starting near the end of 16byte prefetch buffer will cause
stalls too.
>
> > Aligning to 16 byte boundary can be quite good tradeoff between code size
> > and cache line fetching effecienty. While function starting near end of
> > cache line is catastrophical, function starting in the middle of it is not
> > so bad.
> > Again Intel Optimizing Manual recommends this. I believe Intel did some experiments
> > before saying so.
>
> 16 byte alignment for functions trades memory against cache footprint.
> I would strongly prefer cache and I would urge someone to look at this.
> In this case, I wouldn't take Intel's word.
The problem is, that you need to align function to the largest alignment used
in the code. If you use 32 byte boundary based alignment for loops, you
need 32 byte alignment for functions as well. Gas puts alignment of whole section
based on the .align directive at the start and other alignments are done relative to
this value.
At least this is my understanding of thinks. I may be mistaken in this case.
Honza
- Raw text -