From: nikki AT gameboutique DOT co (nikki)
Newsgroups: comp.os.msdos.djgpp
Subject: Re: Allegro perspective-correct .. (fpu memcopy)
Date: 4 Mar 1997 19:35:52 GMT
Organization: GameBoutique Ltd.
Lines: 53
Message-ID: <5fhtio$rqm@flex.uunet.pipex.com>
References: <Pine DOT SUN DOT 3 DOT 91 DOT 970304170108 DOT 12717B-100000 AT is>
NNTP-Posting-Host: www.gameboutique.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp

>> > 	2) Cause your program to ignore FP exceptions by including the
>> > following somewhere at its beginning:
>> > 
>> > 	#include <signal.h>
>> > 	...
>> > 	signal (SIGFPE, SIG_IGN);
>> 
>> i wasn't aware you could do this actually. does this perhaps mean that you
>> could use a memcopy with fld and fstp and just ignore errors like this? it
>> would be much faster than the fild fistp version obviously...

well, for the benefit of the djgpp community as a whole here's the result.
first the standard fpu memcopy which i use. this is 2 cycles faster than the
fastest i've ever seen anywhere else (the agner fog articles) and is 100%
accurate :

asm volatile ("1:\n\t"
              "fildq (%%esi)\n\t"             // load first qword  1 NP (2,3)
              "fildq 8(%%esi)\n\t"            // load second qword 2 NP (3,4)
              "addl $16,%%esi\n\t"            // update esi        3 uv
              "addl $16,%%edi\n\t"            // update edi        3 uv
              "fistpq -8(%%edi)\n\t"          // save 2nd qword    4 NP (-9)
              "fistpq -16(%%edi)\n\t"         // save 1st qword   10 NP (-15)
              "decl %%ecx\n\t"                // dec ecx          16 uv
              "jnz 1b"                        // (loop)           16  v
             :
             : "S" (scr_buf), "D" (videoptr), "c" (no_to_move)
             : "ecx", "esi", "edi" );

as you can see, the slow part is the fist which takes a fat 6NP :( but it
still manages 16 bytes in 16 cycles with 1/2 the normal write misses and
associated cache penalties.

now the fast (and theoretically not so accurate) version i came up with.
replace the flid and fist with fld and fst and set the flags as eli 
described above. the result is an 8 cycle loop - twice as fast in fact.
the disadvantages is that this is a 'lossy' form of moving data about. there
are some sequences of numbers which cause errors and these show quite visibly
if you're using a blitz to screen for instance. my suggestion therefore is to
only use this for 24bit screen displays and to +-1 from the values that cause
fpu errors so that this never happens. the result is something that's visually
indistinguishable from what you want but twice as fast. (and 4 times faster
than the rep stos versions) so my question really is - does anyone know which
sequences cause fpu errors so i can avoid them? :) perhaps leath would know?

regards,
nik


-- 
Graham Tootell           
nikki AT gameboutique DOT com