Mail Archives: pgcc/2000/05/09/08:07:01
> Jan,
>
> seems to be the same with Athlon, at least with this one
> vendor_id : AuthenticAMD
> cpu family : 6
> model : 2
> model name : AMD Athlon(tm) Processor
> stepping : 1
> cpu MHz : 698.660058
>
> here again, I got some speedups when I rearranged the code to have no
> instructions crossing any 16byte border.
OK. I wilask AMD about this issue. Alex from AMD claims, that Athlon donīt
have such problems. It is well possible the the speedups are caused by some
accidental change elsewhere...
Honya
>
> Wolfgang
>
> Wolfgang Formann wrote:
> >
> > Jan Hubicka wrote:
> > >
> > > > On Sun, 30 Jan 2000, Marc Lehmann wrote:
> > > >
> > > > > > 10% is really a lot, inside a loop, which takes (about) 25 * 35 cycles.
> > > > >
> > > > > That's very much. I doubt it really is the three nops, but...
> > > >
> > > > Well, AFAIK K6 family (especially K6-1) is pretty sensitive to
> > > > splitting insns over cache line boundary. Such cases slow down the
> > > > decoding of instruction. Considering importance of decoders'
> > > > performance on K6 and loop length (only 25-35 cycles as being said)
> > > > and assuming some longer insns was split this way, 10% difference
> > > > is IMHO possible.
> > > I've measured more than 10% speedups in number of loops by patch assing
> > > .p2align 5,,<opcode+modrm length> before each instruction.
> > > I have made patch to egcs. It is not in the mailnine (I will re-try to
> > > submit updated version soon), but you may find in the mailing list
> > > archives (July or August)
> > >
> > > The penalties are not clean (even to the AMD folks), but they are believed
> > > to be following:
> > > insn opcode crossing cache line boundary (32 bytes) - 1 cycle + insn becoming vector decoded (minimally 2 cycles + lost parallelism)
> > > insn opcode crossing ifetch buffer (16 bytes) - 1 cycle at lest
> > > insn mod/rm byte separated by cache line boundary - 1 cycle + lost parallelism in case insn ought to be scheduled to first decoder
> > > insn mod/rm byte separated by ifetch buffer - lost prallelism in case insn ought to be scheduled to first decoder
> >
> > This seems to be right, so after hacking one more day, I get another
> > ~10%
> > of improvement. All together crypt586.pl is improved from the original
> > 13780 to 18912 crypts/second on my good old K6-I/233 :-)
> >
> > But there is still a large number of question marks!
> > Thanks!
> >
> > >
> > > This is not official. Even the AMD's K6 emulator is incorrect in handling these
> > > situations and probably no-one knows how it really works.
> > > Especialy the penalties for first case are extreme. In other cases padding
> > > by nops may or may not be worthwhile. Reordering insns/moving whole loop
> > > body helps in all cases, but it is out of reach of gcc's optimizers.
> > >
> > > Does anyone know how the situation looks for PPro? I tought that only
> > > ifetch buffers matters and that they are missaligned (so when long insn
> > > is crossing the end of current ifetch, next one starts at the start of
> > > that insn), so .p2align strategy don't works there, or am I mistaken?
> > > >
> > > > BTW: On my K6-2, I get best performance when loops and functions are
> > > > aligned to 8 byte boundary. But this (as well as cache line end issues)
> > > > deserves more testing, so I will do so during weekend.
> > > >
> > >
> > > I've just re-started by work on the K6 support for egcs (and cleaning up
> > > the code and looking for common bits with Athlon I need for my contract)
> > > so please keep me informed.
> > >
> > > Honza
> > > > Have a nice day
> > > >
> > > > ------------------------------------------------------------------------------
> > > > Martin Ockajak a.k.a. Mandos <mandos AT hq DOT alert DOT sk> http://hq.alert.sk/~mandos
> > > > "The goal of Computer Science is to build something that will last at
> > > > least until we've finished building it."
- Raw text -