Date: Tue, 10 Jan 95 15:42 MST From: mat AT ardi DOT com (Mat Hostetter) To: THE MASKED PROGRAMMER Cc: djgpp AT sun DOT soe DOT clarkson DOT edu References: <0098A422 DOT 5A731270 DOT 3 AT bsa DOT bristol DOT ac DOT uk> >>>>> "badcoe" == THE MASKED PROGRAMMER writes: badcoe> So how's it done ? How do I find the real-address of a badcoe> protected-mode address ? As DJ pointed out, neither your problem nor your solution has anything to do with real mode. badcoe> It's the addition of the offsets to the base address. If badcoe> the lower part of the base address is 0x0000 then I can badcoe> simply load the offset into (say) dx (or dh and dl badcoe> depending whether they're already together) with the two badcoe> high-bytes already in the top of edx. I hand coded the inner loop of one of our time-critical routines in gcc inline asm, using a smaller version of the same trick. The loop maps a sequence of input bytes to output bytes through a lookup table. The loop is unwrapped and operates on eight bytes per iteration (so I can reorder stuff in such a way as to avoid Pentium cache bank conflicts). For example %edx holds a pointer to the 256-byte aligned lookup table, and by replacing %dl with the input byte %edx becomes a pointer to the output byte value. I had to keep a few things in mind when writing this code: 1) Pentium pairability (a complicated issue; unwrapping loops, and hand-scheduling code can be a big win). gcc is totally clueless about the Pentium. 2) AGI stalls (don't dereference an address right after you compute it; this is even trickier on the superscalar Pentium where you may need to have several instructions between the computation and the dereference). gcc doesn't seem to care about these either. 3) Computing an address involving an index register takes an extra cycle on the i486. 4) i486 I-cache prefetch stalls (don't do four memory refs in a row, and align branch targets % 16 bytes when possible). badcoe> I suspect that it wouldn't be fast enough even so. I'm badcoe> only following this thread for completeness (my best badcoe> attempt so far is ~10-fold too slow and that's after badcoe> trying about 10 diff asm strategies). If that's the case, then perhaps your overall approach to whatever problem you are trying to solve needs work? In most cases where I've seen people spend extensive effort optimizing assembly code they are not using the best algorithm. For example, if you are just looping over your array then there are vastly better ways to code whatever it is you are trying to do. Or you may be spending time somewhere other than where you are devoting your attention. >> Have you actually looked at the code that "gcc -O3" produces >> for foo[x][y] when the dimensions are 256x256? It might >> already be optimal - gcc is a real good optimizer. badcoe> Hmmm, is it really that good ? I'll be very impressed if badcoe> it is. gcc is decent, but it's easy to thump on it with hand-coded assembly esp. for the Pentium (even so, I strongly believe that programming in assembly is usually inappropriate, often stupid!) Cygnus plans to improve gcc in the near future, but gcc cannot make the optimization you suggest (which requires gcc having code that checks to see if the lookup table is aligned % 64K). Heck, gcc still generates word moves from addresses known to be long-aligned, which is slower, bigger, and less Pentium pairable than the corresponding long move. It's more plausible that gcc could make an optimization where it replaces the low two bytes of zero register to create an _index_ into that array, but it doesn't (not gcc 2.5.8, anyway). -Mat