From: nikki AT gameboutique DOT co (nikki) Newsgroups: comp.os.msdos.djgpp Subject: Re: Allegro perspective-correct .. (fpu memcopy) Date: 4 Mar 1997 19:35:52 GMT Organization: GameBoutique Ltd. Lines: 53 Message-ID: <5fhtio$rqm@flex.uunet.pipex.com> References: NNTP-Posting-Host: www.gameboutique.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp >> > 2) Cause your program to ignore FP exceptions by including the >> > following somewhere at its beginning: >> > >> > #include >> > ... >> > signal (SIGFPE, SIG_IGN); >> >> i wasn't aware you could do this actually. does this perhaps mean that you >> could use a memcopy with fld and fstp and just ignore errors like this? it >> would be much faster than the fild fistp version obviously... well, for the benefit of the djgpp community as a whole here's the result. first the standard fpu memcopy which i use. this is 2 cycles faster than the fastest i've ever seen anywhere else (the agner fog articles) and is 100% accurate : asm volatile ("1:\n\t" "fildq (%%esi)\n\t" // load first qword 1 NP (2,3) "fildq 8(%%esi)\n\t" // load second qword 2 NP (3,4) "addl $16,%%esi\n\t" // update esi 3 uv "addl $16,%%edi\n\t" // update edi 3 uv "fistpq -8(%%edi)\n\t" // save 2nd qword 4 NP (-9) "fistpq -16(%%edi)\n\t" // save 1st qword 10 NP (-15) "decl %%ecx\n\t" // dec ecx 16 uv "jnz 1b" // (loop) 16 v : : "S" (scr_buf), "D" (videoptr), "c" (no_to_move) : "ecx", "esi", "edi" ); as you can see, the slow part is the fist which takes a fat 6NP :( but it still manages 16 bytes in 16 cycles with 1/2 the normal write misses and associated cache penalties. now the fast (and theoretically not so accurate) version i came up with. replace the flid and fist with fld and fst and set the flags as eli described above. the result is an 8 cycle loop - twice as fast in fact. the disadvantages is that this is a 'lossy' form of moving data about. there are some sequences of numbers which cause errors and these show quite visibly if you're using a blitz to screen for instance. my suggestion therefore is to only use this for 24bit screen displays and to +-1 from the values that cause fpu errors so that this never happens. the result is something that's visually indistinguishable from what you want but twice as fast. (and 4 times faster than the rep stos versions) so my question really is - does anyone know which sequences cause fpu errors so i can avoid them? :) perhaps leath would know? regards, nik -- Graham Tootell nikki AT gameboutique DOT com