Date: Fri, 8 Apr 1994 17:02:40 -0400 (EDT)
From: "Chris Mr. Tangerine Man Tate" <FIXER AT FAXCSL DOT DCRT DOT NIH DOT GOV>
To: djgpp AT sun DOT soe DOT clarkson DOT edu
Subject: memxxx(), Duff's Device, etc.

Clearly, I have too much spare time.  :-)

I ran a quick-and-skanky benchmark comparing the library memxxx() routines,
naive byte-by-byte implementations of them, and Duff's Device (unrolled)
versions of them.

The library version of memcpy() is 60% faster than the Duff's Device
implementation, which is in turn about 17% faster than naive byte-by-byte.
That gives you some idea of how good the library routines are.  :-)

On a related note, I'm curious about the Intel architecture.  Specifically,
I'd like to know:

a) Does it have odd-address access restrictions?
b) Are accesses on longword (4 byte) boundaries faster than word bounds?

b) would probably make it beneficial to write a somewhat more complex,
optimized version of memset() and memcpy(), if the library versions of
those routines currently work by byte accesses.  If they work by some
block-move instruction, then give up; it's already optimal.  :-)

-- chris tate
   fixer AT faxcsl DOT dcrt DOT nih DOT gov