delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1997/01/20/18:50:19

From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller)
Message-Id: <199701202327.JAA04787@solwarra.gbrmpa.gov.au>
Subject: Re: floating point is... fast???
To: gpt20 AT thor DOT cam DOT ac DOT uk (G.P. Tootell)
Date: Tue, 21 Jan 1997 09:27:25 +1000 (EST)
Cc: djgpp AT delorie DOT com
In-Reply-To: <5bvjeb$mji@lyra.csx.cam.ac.uk> from "G.P. Tootell" at Jan 20, 97 11:03:39 am

> well i dug the big book of cycles out today. this is what it says..
> 		fdiv		fmul		idiv	imul	div	mul
> 486(7)		8-89		11-27		43/44	42	40	42
> pentium		39-42		1-7		22-46	10/11	17-41	11

Your book disagrees with the information provided by Intel in their
programmers reference manual. Go to http://www.x86.org and get Acrobat
reader, and then the PDF file from Intel (it has a link): 241430_4.pdf.
This has everything you need. I recall 3 cycles for an fmul and 10
cycles for an fimul...
 
> can anyone confirm those values? just on the offchance there's a mistake in my
> book. now it strikes me that rather than do the expensive operation

I would say its wrong. It gives the impression you can do a simply fmul
in one clock when this isn't true. If you have:
	flds	_x0;
	fmuls	_x1;
	fstps	_result;
Then this will take (1 + 3 + 3) 7 cycles. However, if you have something
like:
	flds	_x0;		// 1
	fmuls	_x1;		// 2 - 4
	flds	_y0;		// 3
	fmuls	_y1;		// 4 - 6
	flds	_z0;		// 5
	fmuls	_z1;		// 6 - 8
	fxch	%st(2);	// free
	faddp	%st(1);	// 7 - 9
	faddp	%st(1);	// 10 - 12
	fstp	_result;	// 12 - 15
As you are overlapping fmuls in this dot product routine, the fmul comes
at one cycle... (the fstp normally takes 2 cycles, but has a 1 cycle
latency when using the result of the previous operation). I haven't seen
a fmul take 7 seconds, but that may be what it takes on 80bit ops.
Note: I do fld, fmul etc with an s not d because I store things as
floats - the conversion time is zero between the 32bit float and 80bit
full precision, so it makes no difference. You just have less accuracy.

> float a,b,c,d,x,y;
> c=x/b;
> d=y/b;

> a=1/b;
> c=x*a;
> d=y*a;

Definately - the savings are huge...
 
> which would save a whole load of cycles, particularly on a pentium.
> in fact, if i were doing the operations with signed longs instead...

> signed long a,b,c,d,x,y;
 
> i would be better writing - (and changing a to a float)
 
> a=1.0/b;	(because fdiv is still faster than idiv in most cases)
> c=(float)x*a;
> d=(float)y*a;

You are probably better off using full floating point math until the
very end where you can store the result as a float. To load/store
int's is very expensive and should be avoided where possible. I think
an int->float or float->int conversion takes about 14 cycles each
time...
 
> ie. to change the integers into floating wherever possible to make use of the
> fmul timings, which outstrip every other timing even in worst case!

Yep... :)
 
> so there must be a catch somewhere of course ;)

No, fpu stuff is fast on Intel stuff now (pentium onwards). It took a
while, but they finally caught up to Motorola. 

> perhaps the changing from float->int and vice versa takes a lot of time?
> anyone?

Yep... :)  Use floating point. If your programming Pentium onwards anyway.
Remember, if someone says otherwise, just say one word: Quake.

Leathal.

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019