Date: Wed, 6 Apr 1994 10:23:18 -0400 (EDT)
From: "Chris Mr. Tangerine Man Tate" <FIXER AT FAXCSL DOT DCRT DOT NIH DOT GOV>
To: djgpp AT sun DOT soe DOT clarkson DOT edu
Subject: Re: memxxx() library functions

eliz AT is DOT elta DOT co DOT il wrote:

>  While browsing through the libc.a sources, I noticed that the functions of
>the
>memxxx family (memcpy(), memset() etc.) use the byte-oriented instructions
>(i.e. rep movsb, rep stosb and the like) rather than the word- or double-word
>oriented variations.  Is this intentional?  Won't the operation be sped-up two-
>or four-fold by using movsd/stosd instructions?

Quite likely.  Note, however, that a near-optimal memcpy() is quite hard
to write.  memset() is rather easier, but the "simple" versions are just
that - much simpler than a highly efficient version.

I wrote an extremely efficient memset() routine for the MC68000; I imagine
that the issues faced in an Intel implementation are similar.  The most
important idea is to move longwords (or more - see below) at a time,
rather than bytes, and to guarantee that memory accesses are longword
aligned.  I don't know what the alignment restrictions are on Intel
processors; on the Motorola ones, longword accesses have to occur at
word boundaries (i.e. even addresses).  But they're *much* faster if
they occur at longword (4-byte) boundaries.

The Motorola chips have a MOVEM instruction that allows several (or all!)
registers to be copied to memory, with autodecrement of the index
register.  It's used for saving/restoring registers.  That, in conjunction
with the 68000's "decrement and branch if not zero" loop control instructions
form the center of the tight loop.  This let me set 32 bytes in each
iteration of the inner loop in my memset().

Don't the Intel chips have some sort of block-move instruction?  Remember
also that DJGPP code is running in protected mode, so you can go ahead and
use the 32-bit forms of everything (can't you?).  I'm no Intel asm guru,
by any stretch, but it seems that you *should* be able to do a lot better
than simple byte-by-byte.

*HOWEVER*, it's my experience that at least on my MC68000 testbeds, you
can't do better than byte-by-byte for small blocks (smaller than 32 bytes).
My memset() implementation tested for that case, and dropped through to
a simple loop in that case.  The overhead of setting up memory access
alignment outweighs the advantage of moving more than one byte at a time
for small blocks.

-- Chris Tate
   fixer AT faxcsl DOT dcrt DOT nih DOT gov