delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1995/03/22/21:00:50

Date: Wed, 22 Mar 95 17:50 MST
From: mat AT ardi DOT com (Mat Hostetter)
To: DJGPP AT sun DOT soe DOT clarkson DOT edu
Subject: Re: A quick way to copy n bytes
References: <199503220842 DOT RAA28714 AT wutc DOT human DOT waseda DOT ac DOT jp>

NOTE:  if people want to see some good implementations of these routines,
       you should check out the inline asm versions in the Linux headers,
       e.g. linux/asm/string.h.  They are impressive.


>>>>> "Ryuichiro" == Ryuichiro Araki <raraki AT human DOT waseda DOT ac DOT jp> writes:

>>>>> Mat Hostetter <mat AT ardi DOT com> writes:
    >> This is much better, but what if %esi and %edi are not aligned
    >> %4?  Every single transfer might have an unaligned load and an
    >> unaligned store, which is slow.

    Ryuichiro> Right, right!!  By adding a small code before movsl, I
    Ryuichiro> tried to make either %esi or %edi 4 bytes-aligned, too
    Ryuichiro> (possibly the way like yours, but I'm not sure that
    Ryuichiro> code was efficient enough) when I wrote my memcpy() and
    Ryuichiro> other gas codes.  But the difference in performance was
    Ryuichiro> not so remarkable unless memcpy() transfers quite a bit
    Ryuichiro> of data at once.  On the contrary, I have experienced
    Ryuichiro> the adverse effect of such a code when programs mainly
    Ryuichiro> transfer small data fragments (say, shorter than 8 - 10
    Ryuichiro> bytes.  This is likely with programs which mainly
    Ryuichiro> process short tokens, i.e., compilers, assemblers,
    Ryuichiro> etc.)

1) The added overhead of seeing if the size is < 16 bytes and then
   doing movsb instead is negligible.  When that happens you can punt
   the (also neglible) overhead associated with long moves and cleanup.
2) Many small moves will be constant sized (e.g. sizeof (struct foo)),
   and will get inlined into gcc.

    Ryuichiro> Somebody suggested to me trying 16 bytes alignment on
    Ryuichiro> 486/Pentium, since cache line size of the internal
    Ryuichiro> cache in these processors is 16 bytes and thus 16 bytes
    Ryuichiro> alignment might reduce cache misses.  How about this
    Ryuichiro> idea, Mat?

I don't see how the cache line size affects memcpy, though, since you
can only transfer 4 bytes at a time (the 68040 lets you copy entire
cache lines at once).  It might be to your advantage to prefetch the
next cache line 16 bytes in advance, so that line will be ready by the
time you actually read it.  I'm not sure if you can do that on the
x86, but you can do it on some other processors.

Aligning your data structures on 16 byte boundaries would slightly
improve your cache behavior when referencing those structures (if, for
example, you are referencing a huge array of 16 byte structs in a
tight loop you might want to align your array mod 16 bytes).

Also, I'd guess there's a lot of pointless bus activity on machines
with write-through caches during memcpy's...I'm not sure if there's a
reasonable way around that.

-Mat

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019