Mail Archives: djgpp/1995/03/22/04:57:20
>>>>> Mat Hostetter <mat AT ardi DOT com> writes:
> This is much better, but what if %esi and %edi are not aligned %4?
> Every single transfer might have an unaligned load and an unaligned
> store, which is slow.
Right, right!! By adding a small code before movsl, I tried to make either
%esi or %edi 4 bytes-aligned, too (possibly the way like yours, but I'm not
sure that code was efficient enough) when I wrote my memcpy() and other gas
codes. But the difference in performance was not so remarkable unless
memcpy() transfers quite a bit of data at once. On the contrary, I have
experienced the adverse effect of such a code when programs mainly transfer
small data fragments (say, shorter than 8 - 10 bytes. This is likely with
programs which mainly process short tokens, i.e., compilers, assemblers, etc.)
probably due to the overhead of the prepending code . But the situation
might be different on 486 and/or Pentium machines (I checked the performance
only with *old* 386 machine long ago, and had no chance to do that on newer
PCs, since I've deleted the old code:-< ).
Somebody suggested to me trying 16 bytes alignment on 486/Pentium, since
cache line size of the internal cache in these processors is 16 bytes and
thus 16 bytes alignment might reduce cache misses. How about this
idea, Mat?
> I fixed this in the memcpy and movedata for the current V2 alpha.
> They do movsb's until either %esi or %edi is long-aligned before doing
> movsl's (and hopefully both are aligned then). The code checks for
> small moves right away and just use movsb for them, skipping the
> alignment overhead.
Cool. I've not look into the current V2 alpha yet. I'll examine how
your code works well. Thank you for valuable information.
> For what it's worth, I also modified memset to do aligned stosl's when
> possible.
I did, too:-)
---
raraki(Ryuichiro Araki)
raraki AT human DOT waseda DOT ac DOT jp
- Raw text -