X-pop3-spooler: POP3MAIL 2.1.0 b 3 961213 -bs- Delivered-To: pcg AT goof DOT com From: ak AT stuttgart DOT netsurf DOT de (Andreas Kaiser) To: beastium-list AT Desk DOT nl Subject: Re: [performance] newer binutils / pgcc / K6 Date: Wed, 15 Apr 1998 00:08:49 GMT Organization: Ananke Message-ID: <3537f9b9.782528@mail1.stuttgart.netsurf.de> References: In-Reply-To: X-Mailer: Forte Agent 1.5/16.451 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: Marc Lehmann Status: RO Content-Length: 3214 Lines: 57 On Tue, 14 Apr 1998 02:30:18 +0200 (CEST), Ronald Wahl wrote: >further testing I found out that it is a code alignment issue. If I use >-malign-loops=2 the tests run nearly at the same speed as with the older >versions of binutils (gas). Some tests are a bit slower but not much >(--> see my appended nbench results). Other alignments will cause >slowdowns. I've encoutered the same effect in some program I use for benchmarking. Even though AMD suggests target alignment, because like all other X86 CPUs except for the non-MMX Pentium, the K6 won't fetch across a cache line boundary, it was faster without. My personal interpretation: The 16 entry branch target instruction cache of the K6 appears to be direct mapped by A2..A5. When many branch targets are aligned to 16 bytes (A2,A3=0), they can use only 4 of the 16 entries. However it can also result from some other accidental side effect: When the part of the opcode, which is required for instruction length detection, is split across two cache lines, the instruction becomes microcoded, thus slow (predecoder problem). Aligning instructions (shifting code around) may accidentally enlarge or reduce such effects in critical parts of a program. Other stuff which affects many X86, including the K6: A few months ago, I looked into plain GCC 2.7.2 (EMX) and found a code/data mix which is prone to systematic cache misses. Ok, plain GCC is old and never knew about split cache X86s, so I downloaded the PGCC (OS/2) and to my big surprise it was exactly the same. Mixing code and data in the same cache line may lead to a ping-pong effect, where the lines are frequently flushed and reloaded from L2 (for X86 CPUs, the same cache line is *never* included in both I- and D-cache). Especially aweful is a switch table immediately following the jump using it. This holds for all X86-CPUs with split caches except for AMD-K5 (where a D-miss/I-hit data read is handled uncached instead), but the K6 is more affected, because wrt. this aspect its cache line size is 64 bytes, not 32. This is easily avoided by putting const data into the data section, a separate const section (for non-a.out format) or at least a separate subsection (for old a.out format). The effect on Perl is quite noticable. Another optimization is worth trying: Avoid [ESI] w/o displacement, because such an instruction becomes microcoded (once again the predecoder gets confused, because this address mode has the same opcode as 16bit absolute). Avoiding [ESI] and the opcode split mentioned above (insert NOPs or lengthen a prior insn) however can only be done by the assembler. Just in case it is still found in PGCC (I didn't check it, however it is regularly used in plain GCC): After partial register operations, like "or $0x01,%ah" instead of "or $0x0100,%eax", the next insn using %eax may get stalled until the parts are recombined. This affects many X86s, especially P6 (the decoder stalls, so it's a very large penalty) and K6 (data dependency stall). Even old 486 stalled for 1 clock. Just Pentium and K5 don't stall (the ever-astonishing K5 is able to collect the parts of a single operand from 3 different sources w/o penalty ;-). Gruss, Andreas