Date: Wed, 22 Mar 95 17:50 MST From: mat AT ardi DOT com (Mat Hostetter) To: DJGPP AT sun DOT soe DOT clarkson DOT edu Subject: Re: A quick way to copy n bytes References: <199503220842 DOT RAA28714 AT wutc DOT human DOT waseda DOT ac DOT jp> NOTE: if people want to see some good implementations of these routines, you should check out the inline asm versions in the Linux headers, e.g. linux/asm/string.h. They are impressive. >>>>> "Ryuichiro" == Ryuichiro Araki writes: >>>>> Mat Hostetter writes: >> This is much better, but what if %esi and %edi are not aligned >> %4? Every single transfer might have an unaligned load and an >> unaligned store, which is slow. Ryuichiro> Right, right!! By adding a small code before movsl, I Ryuichiro> tried to make either %esi or %edi 4 bytes-aligned, too Ryuichiro> (possibly the way like yours, but I'm not sure that Ryuichiro> code was efficient enough) when I wrote my memcpy() and Ryuichiro> other gas codes. But the difference in performance was Ryuichiro> not so remarkable unless memcpy() transfers quite a bit Ryuichiro> of data at once. On the contrary, I have experienced Ryuichiro> the adverse effect of such a code when programs mainly Ryuichiro> transfer small data fragments (say, shorter than 8 - 10 Ryuichiro> bytes. This is likely with programs which mainly Ryuichiro> process short tokens, i.e., compilers, assemblers, Ryuichiro> etc.) 1) The added overhead of seeing if the size is < 16 bytes and then doing movsb instead is negligible. When that happens you can punt the (also neglible) overhead associated with long moves and cleanup. 2) Many small moves will be constant sized (e.g. sizeof (struct foo)), and will get inlined into gcc. Ryuichiro> Somebody suggested to me trying 16 bytes alignment on Ryuichiro> 486/Pentium, since cache line size of the internal Ryuichiro> cache in these processors is 16 bytes and thus 16 bytes Ryuichiro> alignment might reduce cache misses. How about this Ryuichiro> idea, Mat? I don't see how the cache line size affects memcpy, though, since you can only transfer 4 bytes at a time (the 68040 lets you copy entire cache lines at once). It might be to your advantage to prefetch the next cache line 16 bytes in advance, so that line will be ready by the time you actually read it. I'm not sure if you can do that on the x86, but you can do it on some other processors. Aligning your data structures on 16 byte boundaries would slightly improve your cache behavior when referencing those structures (if, for example, you are referencing a huge array of 16 byte structs in a tight loop you might want to align your array mod 16 bytes). Also, I'd guess there's a lot of pointless bus activity on machines with write-through caches during memcpy's...I'm not sure if there's a reasonable way around that. -Mat