Sender: nate AT cartsys DOT com Message-ID: <362CFA62.BA5CB879@cartsys.com> Date: Tue, 20 Oct 1998 14:02:26 -0700 From: Nate Eldredge X-Mailer: Mozilla 4.05 [en] (X11; I; Linux 2.0.35 i486) MIME-Version: 1.0 To: djgpp AT delorie DOT com, ludvig AT club-internet DOT fr Subject: Re: superslow simpel rep stosl, why? References: <362B824C DOT 6A9B AT club-internet DOT fr> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Reply-To: djgpp AT delorie DOT com Ludvig Larsson wrote: > > Eli Zaretskii wrote: > > > > On Mon, 19 Oct 1998, Ludvig Larsson wrote: > > > > > On my AmdK6-2 300mhz it takes 0.006 sec. which gives about > > > 100millions of bytes/sec. Quite a bit right! > > > But should it take 3 clockcykles to clear each byte? > > > I'm clearing quadwords... > > > > > > I'm using asm(rep stosl). > > > > > > Is this normal? > > > > Why not? On a 486 STOSD is documented to require 5 clocks per move, > > so it doesn't strike me as terribly wrong to get 3 clocks on K6. Keep > > in mind that it doesn't just move the dword, it also increments a > > pointer and decrements a count as it goes. > > > But? As I'm clearing d-words, each stosl takes 12 cycles... Actually, the entire *instruction* is supposed to take 5 cycles. That's for all 4 bytes, and the `rep' adds a constant overhead of 5 cycles (this is for a 386, for which I have the book). But obviously it's taking longer here. So... * First, are you sure your starting address is aligned on a 4-byte boundary? Misaligned accesses often have significant speed penalties. You might consider aligning to an even greater (8- or 16-bytes) if that can be arranged. * You are clearing normal memory (as opposed to video memory), right? Video memory is often uncached, which can slow things down (the instruction timings don't account for memory wait states). Sometimes chips can be configured to disable caching for some region of memory, and it's possible that's what's going on here (though I rather doubt it). You might also want to examine your BIOS configuration and see if you have something strange going on (too many wait-states, etc). But be careful; wrong settings can send things very awry. * On recent chips, the designers have put the most energy into speeding up those instructions which are statistically used more often. The others often slow down. In some cases, that means string instructions can be *slower* than their mundane equivalents. I'm not sure whether that's the case here. However, I have sent you under separate cover a copy of a `memset' routine used by GNU libc on Pentiums and above, which does use mostly regular instructions (long chunk of `mov's, followed by a big index register increment). Beware-- it is GPL, so either make your program GPL, or don't look at the code too closely. -- Nate Eldredge nate AT cartsys DOT com