Mail Archives: djgpp/1997/11/30/21:31:11
> Other thing is that I tried to write memcpy using 64bit FPU registers as
> someone here suggested. It's about _20% faster_!!
If you know what the src values are and know they won't produce errors,
you can speed the code up even more by using the normal FP values, ie:
fldl src
fldl src + 8
fldl src + 16
...
fxch st8, st0
fstpl dest + ...
fstpl dest + ...
etc...
Which is 3 cycles per iteration...
> _LoopPoint:
> fildq (%%eax,%%ecx)
> fistpq (%%ebx,%%ecx)
Have you tried unrolling this more? The fistpq right after the fildq
(IIRC) causes a stall which can be prevented by unrolling out...
fildq src
fildq src + 8
...
fildq src + 56
fxch st8, st0
fistpq dest
fistpq dest + 56
etc
Note: the rest of your code becomes simpler too as you don't have to worry
about adding registers to attain offsets etc...
> Interesting thing is that is run only 10-12% faster with cwsdpmi r3 and r4 but
> with pmode (1.2), cwsdpr0 (both r3 and r4), qdpmi (1.1 form QEMM 8.0) run the
> cpu code faster. The normal memcpy is about the same.
Using proper fld/fstp instructions you can do something like 64 byte moves
in around 24 (I think) cycles (not considering cache hits). I used it to clear
memory buffers (such as floating point Z-buffers) which were very fast in
SW. It just means keeping a small 64-byte zero'ed memory region which could
be used to fld/fstp at the frame buffer memory location...
Leathal.
- Raw text -