Sender: nate AT cartsys DOT com
Message-ID: <362CFA62.BA5CB879@cartsys.com>
Date: Tue, 20 Oct 1998 14:02:26 -0700
From: Nate Eldredge <nate AT cartsys DOT com>
X-Mailer: Mozilla 4.05 [en] (X11; I; Linux 2.0.35 i486)
MIME-Version: 1.0
To: djgpp AT delorie DOT com, ludvig AT club-internet DOT fr
Subject: Re: superslow simpel rep stosl, why?
References: <Pine DOT SUN DOT 3 DOT 91 DOT 981019102612 DOT 7874P-100000 AT is> <362B824C DOT 6A9B AT club-internet DOT fr>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Reply-To: djgpp AT delorie DOT com

Ludvig Larsson wrote:
> 
> Eli Zaretskii wrote:
> >
> > On Mon, 19 Oct 1998, Ludvig Larsson wrote:
> >
> > > On my AmdK6-2 300mhz it takes 0.006 sec. which gives about
> > > 100millions of bytes/sec. Quite a bit right!
> > > But should it take 3 clockcykles to clear each byte?
> > > I'm clearing quadwords...
> > >
> > > I'm using asm(rep stosl).
> > >
> > > Is this normal?
> >
> > Why not?  On a 486 STOSD is documented to require 5 clocks per move,
> > so it doesn't strike me as terribly wrong to get 3 clocks on K6.  Keep
> > in mind that it doesn't just move the dword, it also increments a
> > pointer and decrements a count as it goes.
> >
> But? As I'm clearing d-words, each stosl takes 12 cycles...

Actually, the entire *instruction* is supposed to take 5 cycles.  That's
for all 4 bytes, and the `rep' adds a constant overhead of 5 cycles
(this is for a 386, for which I have the book).  But obviously it's
taking longer here.  So...

* First, are you sure your starting address is aligned on a 4-byte
boundary?  Misaligned accesses often have significant speed penalties. 
You might consider aligning to an even greater (8- or 16-bytes) if that
can be arranged.

* You are clearing normal memory (as opposed to video memory), right? 
Video memory is often uncached, which can slow things down (the
instruction timings don't account for memory wait states).  Sometimes
chips can be configured to disable caching for some region of memory,
and it's possible that's what's going on here (though I rather doubt
it).  You might also want to examine your BIOS configuration and see if
you have something strange going on (too many wait-states, etc).  But be
careful; wrong settings can send things very awry.

* On recent chips, the designers have put the most energy into speeding
up those instructions which are statistically used more often.  The
others often slow down.  In some cases, that means string instructions
can be *slower* than their mundane equivalents.  I'm not sure whether
that's the case here.  However, I have sent you under separate cover a
copy of a `memset' routine used by GNU libc on Pentiums and above, which
does use mostly regular instructions (long chunk of `mov's, followed by
a big index register increment).  Beware-- it is GPL, so either make
your program GPL, or don't look at the code too closely.
-- 

Nate Eldredge
nate AT cartsys DOT com