Date: Wed, 6 Apr 1994 10:23:18 -0400 (EDT) From: "Chris Mr. Tangerine Man Tate" To: djgpp AT sun DOT soe DOT clarkson DOT edu Subject: Re: memxxx() library functions eliz AT is DOT elta DOT co DOT il wrote: > While browsing through the libc.a sources, I noticed that the functions of >the >memxxx family (memcpy(), memset() etc.) use the byte-oriented instructions >(i.e. rep movsb, rep stosb and the like) rather than the word- or double-word >oriented variations. Is this intentional? Won't the operation be sped-up two- >or four-fold by using movsd/stosd instructions? Quite likely. Note, however, that a near-optimal memcpy() is quite hard to write. memset() is rather easier, but the "simple" versions are just that - much simpler than a highly efficient version. I wrote an extremely efficient memset() routine for the MC68000; I imagine that the issues faced in an Intel implementation are similar. The most important idea is to move longwords (or more - see below) at a time, rather than bytes, and to guarantee that memory accesses are longword aligned. I don't know what the alignment restrictions are on Intel processors; on the Motorola ones, longword accesses have to occur at word boundaries (i.e. even addresses). But they're *much* faster if they occur at longword (4-byte) boundaries. The Motorola chips have a MOVEM instruction that allows several (or all!) registers to be copied to memory, with autodecrement of the index register. It's used for saving/restoring registers. That, in conjunction with the 68000's "decrement and branch if not zero" loop control instructions form the center of the tight loop. This let me set 32 bytes in each iteration of the inner loop in my memset(). Don't the Intel chips have some sort of block-move instruction? Remember also that DJGPP code is running in protected mode, so you can go ahead and use the 32-bit forms of everything (can't you?). I'm no Intel asm guru, by any stretch, but it seems that you *should* be able to do a lot better than simple byte-by-byte. *HOWEVER*, it's my experience that at least on my MC68000 testbeds, you can't do better than byte-by-byte for small blocks (smaller than 32 bytes). My memset() implementation tested for that case, and dropped through to a simple loop in that case. The overhead of setting up memory access alignment outweighs the advantage of moving more than one byte at a time for small blocks. -- Chris Tate fixer AT faxcsl DOT dcrt DOT nih DOT gov