delorie.com/archives/browse.cgi   search  
Mail Archives: pgcc/2000/02/05/14:09:45

Sender: wolfi AT netsurf213 DOT neuss DOT netsurf DOT de
Message-ID: <389C6000.5B79248@neuss.netsurf.de>
Date: Sat, 05 Feb 2000 18:38:08 +0100
From: Wolfgang Formann <w DOT formann AT netsurf213 DOT neuss DOT netsurf DOT de>
X-Mailer: Mozilla 4.6 [en] (X11; I; Linux 2.2.8 i586)
X-Accept-Language: German, de, en
MIME-Version: 1.0
To: pgcc AT delorie DOT com
Subject: Re: pgcc and egcs alignment -- function, basic block and string
References: <20000130211158 DOT D641 AT cerebro DOT laendle> <Pine DOT LNX DOT 4 DOT 21 DOT 0002022017450 DOT 16833-100000 AT hq DOT alert DOT sk> <20000203131955 DOT D12247 AT atrey DOT karlin DOT mff DOT cuni DOT cz>
Reply-To: pgcc AT delorie DOT com

Jan Hubicka wrote:
> 
> > On Sun, 30 Jan 2000, Marc Lehmann wrote:
> >
> > > > 10% is really a lot, inside a loop, which takes (about) 25 * 35 cycles.
> > >
> > > That's very much. I doubt it really is the three nops, but...
> >
> > Well, AFAIK K6 family (especially K6-1) is pretty sensitive to
> > splitting insns over cache line boundary. Such cases slow down the
> > decoding of instruction. Considering importance of decoders'
> > performance on K6 and loop length (only 25-35 cycles as being said)
> > and assuming some longer insns was split this way, 10% difference
> > is IMHO possible.
> I've measured more than 10% speedups in number of loops by patch assing
> .p2align 5,,<opcode+modrm length> before each instruction.
> I have made patch to egcs. It is not in the mailnine (I will re-try to
> submit updated version soon), but you may find in the mailing list
> archives (July or August)
> 
> The penalties are not clean (even to the AMD folks), but they are believed
> to be following:
> insn opcode crossing cache line boundary (32 bytes) - 1 cycle + insn becoming vector decoded (minimally 2 cycles + lost parallelism)
> insn opcode crossing ifetch buffer (16 bytes) - 1 cycle at lest
> insn mod/rm byte separated by cache line boundary - 1 cycle + lost parallelism in case insn ought to be scheduled to first decoder
> insn mod/rm byte separated by ifetch buffer - lost prallelism in case insn ought to be scheduled to first decoder

This seems to be right, so after hacking one more day, I get another
~10%
of improvement. All together crypt586.pl is improved from the original
13780 to 18912 crypts/second on my good old K6-I/233 :-)

But there is still a large number of question marks!
Thanks!

> 
> This is not official. Even the AMD's K6 emulator is incorrect in handling these
> situations and probably no-one knows how it really works.
> Especialy the penalties for first case are extreme. In other cases padding
> by nops may or may not be worthwhile. Reordering insns/moving whole loop
> body helps in all cases, but it is out of reach of gcc's optimizers.
> 
> Does anyone know how the situation looks for PPro? I tought that only
> ifetch buffers matters and that they are missaligned (so when long insn
> is crossing the end of current ifetch, next one starts at the start of
> that insn), so .p2align strategy don't works there, or am I mistaken?
> >
> > BTW: On my K6-2, I get best performance when loops and functions are
> > aligned to 8 byte boundary. But this (as well as cache line end issues)
> > deserves more testing, so I will do so during weekend.
> >
> 
> I've just re-started by work on the K6 support for egcs (and cleaning up
> the code and looking for common bits with Athlon I need for my contract)
> so please keep me informed.
> 
> Honza
> > Have a nice day
> >
> > ------------------------------------------------------------------------------
> > Martin Ockajak a.k.a. Mandos  <mandos AT hq DOT alert DOT sk>  http://hq.alert.sk/~mandos
> > "The goal of Computer Science is to build something that will last at
> > least until we've finished building it."

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019