Mail Archives: djgpp/1995/03/22/21:00:50
NOTE: if people want to see some good implementations of these routines,
you should check out the inline asm versions in the Linux headers,
e.g. linux/asm/string.h. They are impressive.
>>>>> "Ryuichiro" == Ryuichiro Araki <raraki AT human DOT waseda DOT ac DOT jp> writes:
>>>>> Mat Hostetter <mat AT ardi DOT com> writes:
>> This is much better, but what if %esi and %edi are not aligned
>> %4? Every single transfer might have an unaligned load and an
>> unaligned store, which is slow.
Ryuichiro> Right, right!! By adding a small code before movsl, I
Ryuichiro> tried to make either %esi or %edi 4 bytes-aligned, too
Ryuichiro> (possibly the way like yours, but I'm not sure that
Ryuichiro> code was efficient enough) when I wrote my memcpy() and
Ryuichiro> other gas codes. But the difference in performance was
Ryuichiro> not so remarkable unless memcpy() transfers quite a bit
Ryuichiro> of data at once. On the contrary, I have experienced
Ryuichiro> the adverse effect of such a code when programs mainly
Ryuichiro> transfer small data fragments (say, shorter than 8 - 10
Ryuichiro> bytes. This is likely with programs which mainly
Ryuichiro> process short tokens, i.e., compilers, assemblers,
Ryuichiro> etc.)
1) The added overhead of seeing if the size is < 16 bytes and then
doing movsb instead is negligible. When that happens you can punt
the (also neglible) overhead associated with long moves and cleanup.
2) Many small moves will be constant sized (e.g. sizeof (struct foo)),
and will get inlined into gcc.
Ryuichiro> Somebody suggested to me trying 16 bytes alignment on
Ryuichiro> 486/Pentium, since cache line size of the internal
Ryuichiro> cache in these processors is 16 bytes and thus 16 bytes
Ryuichiro> alignment might reduce cache misses. How about this
Ryuichiro> idea, Mat?
I don't see how the cache line size affects memcpy, though, since you
can only transfer 4 bytes at a time (the 68040 lets you copy entire
cache lines at once). It might be to your advantage to prefetch the
next cache line 16 bytes in advance, so that line will be ready by the
time you actually read it. I'm not sure if you can do that on the
x86, but you can do it on some other processors.
Aligning your data structures on 16 byte boundaries would slightly
improve your cache behavior when referencing those structures (if, for
example, you are referencing a huge array of 16 byte structs in a
tight loop you might want to align your array mod 16 bytes).
Also, I'd guess there's a lot of pointless bus activity on machines
with write-through caches during memcpy's...I'm not sure if there's a
reasonable way around that.
-Mat
- Raw text -