Date: Tue, 4 Mar 1997 23:24:28 -0500 (EST) From: Michael Phelps To: nikki cc: djgpp AT delorie DOT com Subject: Re: Allegro perspective-correct .. (fpu memcopy) In-Reply-To: <5fhtio$rqm@flex.uunet.pipex.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On 4 Mar 1997, nikki wrote: > >> > 2) Cause your program to ignore FP exceptions by including the > >> > following somewhere at its beginning: > >> > > >> > #include > >> > ... > >> > signal (SIGFPE, SIG_IGN); > >> > >> i wasn't aware you could do this actually. does this perhaps mean that you > >> could use a memcopy with fld and fstp and just ignore errors like this? it > >> would be much faster than the fild fistp version obviously... > > well, for the benefit of the djgpp community as a whole here's the result. > first the standard fpu memcopy which i use. this is 2 cycles faster than the > fastest i've ever seen anywhere else (the agner fog articles) and is 100% > accurate : > > asm volatile ("1:\n\t" > "fildq (%%esi)\n\t" // load first qword 1 NP (2,3) > "fildq 8(%%esi)\n\t" // load second qword 2 NP (3,4) > "addl $16,%%esi\n\t" // update esi 3 uv > "addl $16,%%edi\n\t" // update edi 3 uv > "fistpq -8(%%edi)\n\t" // save 2nd qword 4 NP (-9) > "fistpq -16(%%edi)\n\t" // save 1st qword 10 NP (-15) > "decl %%ecx\n\t" // dec ecx 16 uv > "jnz 1b" // (loop) 16 v > : > : "S" (scr_buf), "D" (videoptr), "c" (no_to_move) > : "ecx", "esi", "edi" ); > > as you can see, the slow part is the fist which takes a fat 6NP :( but it > still manages 16 bytes in 16 cycles with 1/2 the normal write misses and > associated cache penalties. > > now the fast (and theoretically not so accurate) version i came up with. > replace the flid and fist with fld and fst and set the flags as eli > described above. the result is an 8 cycle loop - twice as fast in fact. > the disadvantages is that this is a 'lossy' form of moving data about. there > are some sequences of numbers which cause errors and these show quite visibly > if you're using a blitz to screen for instance. my suggestion therefore is to > only use this for 24bit screen displays and to +-1 from the values that cause > fpu errors so that this never happens. the result is something that's visually > indistinguishable from what you want but twice as fast. (and 4 times faster > than the rep stos versions) so my question really is - does anyone know which > sequences cause fpu errors so i can avoid them? :) perhaps leath would know? > > regards, > nik > > > > -- > Graham Tootell > nikki AT gameboutique DOT com > Now this is interesting. I have a program that I translated part of into extended asm because it was taking way too long on our workstation when programmed in C. This is basically what it does: 1) subtract one long integer from another 2) perform a negl if the result is negative 3) check to see if the absolute value of the difference is < a given number 4) store result of comparison (0 or 1) in a given array 5) repeat step #1 with next number in sequence 6) when all numbers have been compared with that first number, repeat step #1 using the next number and scanning through all the rest, until all numbers have been compared with each other (Actually, this is somewhat simplified, since I have taking into account the fact that the vector comparison is commutative, and that each matches exactly with itself, so the actual amount of comparisons I have to do is half of the above, but it gives the idea.) Anyway, is there a way to do this faster using one of DJGPP's FPU instructions? If you would like more information, I will mail you a piece of the actual code. ---Michael Phelps morphine AT cs DOT jhu DOT edu CH3 | N / | ______/ | / \ CH2 _____/ \__|__ // \\ / | \\ // \\______/___CH2 \\ \ / \ / \______/ \_____/ / ------ \ / \ OH \ / OH O Morphine