Message-ID: <329E2E2B.3D02@gbrmpa.gov.au> Date: Fri, 29 Nov 1996 08:44:00 +0800 From: Leath Muller Reply-To: leathm AT gbrmpa DOT gov DOT au Organization: Great Barrier Reef Marine Park Authority MIME-Version: 1.0 To: Elliott Oti CC: djgpp AT delorie DOT com Subject: Re: Optimization References: <57hg9b$or5 AT kannews DOT ca DOT newbridge DOT com> <329C4CD4 DOT 7474 AT cornell DOT edu> <329C62F6 DOT 23F6 AT stud DOT warande DOT ruu DOT nl> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit > > What makes you say that? I can't see how this would make it faster... > > more cache misses, and an extra shift to index non-byte sized quantities. > > Not to mention the fact that there are more byte sized registers. > I believe in 32-bit protected mode most dword register ops are faster > than the equivalent 16-bit ones on a 486 and above. Certainly on a P6 > 16-bit instructions are disproportionately slow. > In any case I haven't seen djgpp generate any optimizations which utilise > the byte registers; AFAIK it uses them only in straightforward byte ops. On the pentium, the following rule is used to decide which type of instructions to use: i) If you are running your code in 32 bit protected mode, use 32 bit and 8 bit data and registers, and avoid 16 bit ones ii) If your running in 16 bit protected/real mode, avoid 32 bits registers Its all in the pentium programmers manual. Go to http://www.x86.com/ and have a look around there... > > > did you actually profile your code to see where the bottlenecks are? > > Yes. I know exactly where I need to improve. > I have no idea how good your C coding skills are, so don't be offended, > but careful C code can speed up a sloppy implementation by ~ 100%: > on the other hand, there are limits. > Check your algorithm to see what basic operations are being used > (specifically multiplies, divides, sqrts etc) and check how many > operations are duplicated in such a way that they can be removed with > a little recoding - > e.g a1 = b1/(x*y); c = x*y; > a2 = b2/(x*y); ===> a1 = b1/c; a2 = b1/c etc. > a3 = b3/(x*Y); > Simplistic, but you get the point. Actually, this is even faster if you: c = 1 / (x * y); a1 = b1 * c; a2 = b2 * c; a3 = b3 * c; A divide takes 39 cycles on a normal double divide, a mul takes 3 cycles. Using your method, you have 3 divides (117 cycles) and one mul for 120 cycles. Using the second method, you have 39 + 9 cycles, or 48... :) Leathal.