Date: Tue, 10 Jan 95 15:42 MST
From: mat AT ardi DOT com (Mat Hostetter)
To: THE MASKED PROGRAMMER <badcoe AT bsa DOT bris DOT ac DOT uk>
Cc: djgpp AT sun DOT soe DOT clarkson DOT edu
References: <0098A422 DOT 5A731270 DOT 3 AT bsa DOT bristol DOT ac DOT uk>

>>>>> "badcoe" == THE MASKED PROGRAMMER <badcoe AT bsa DOT bris DOT ac DOT uk> writes:

    badcoe> So how's it done ?  How do I find the real-address of a
    badcoe> protected-mode address ?

As DJ pointed out, neither your problem nor your solution has anything
to do with real mode.

    badcoe> It's the addition of the offsets to the base address.  If
    badcoe> the lower part of the base address is 0x0000 then I can
    badcoe> simply load the offset into (say) dx (or dh and dl
    badcoe> depending whether they're already together) with the two
    badcoe> high-bytes already in the top of edx.

I hand coded the inner loop of one of our time-critical routines in
gcc inline asm, using a smaller version of the same trick.  The loop
maps a sequence of input bytes to output bytes through a lookup table.
The loop is unwrapped and operates on eight bytes per iteration (so I
can reorder stuff in such a way as to avoid Pentium cache bank
conflicts).  For example %edx holds a pointer to the 256-byte aligned
lookup table, and by replacing %dl with the input byte %edx becomes a
pointer to the output byte value.  I had to keep a few things in mind
when writing this code:

1) Pentium pairability (a complicated issue; unwrapping loops,
   and hand-scheduling code can be a big win).  gcc is totally clueless
   about the Pentium.
2) AGI stalls (don't dereference an address right after you compute it;
   this is even trickier on the superscalar Pentium where you may need
   to have several instructions between the computation and the
   dereference).  gcc doesn't seem to care about these either.
3) Computing an address involving an index register takes an extra cycle
   on the i486.
4) i486 I-cache prefetch stalls (don't do four memory refs in a row,
   and align branch targets % 16 bytes when possible).

    badcoe> I suspect that it wouldn't be fast enough even so.  I'm
    badcoe> only following this thread for completeness (my best
    badcoe> attempt so far is ~10-fold too slow and that's after
    badcoe> trying about 10 diff asm strategies).

If that's the case, then perhaps your overall approach to whatever
problem you are trying to solve needs work?  In most cases where I've
seen people spend extensive effort optimizing assembly code they are
not using the best algorithm.  For example, if you are just looping
over your array then there are vastly better ways to code whatever it
is you are trying to do.  Or you may be spending time somewhere other
than where you are devoting your attention.

    >> Have you actually looked at the code that "gcc -O3" produces
    >> for foo[x][y] when the dimensions are 256x256?  It might
    >> already be optimal - gcc is a real good optimizer.

    badcoe> Hmmm, is it really that good ?  I'll be very impressed if
    badcoe> it is.

gcc is decent, but it's easy to thump on it with hand-coded assembly
esp. for the Pentium (even so, I strongly believe that programming in
assembly is usually inappropriate, often stupid!)  Cygnus plans to
improve gcc in the near future, but gcc cannot make the optimization
you suggest (which requires gcc having code that checks to see if the
lookup table is aligned % 64K).  Heck, gcc still generates word moves
from addresses known to be long-aligned, which is slower, bigger, and
less Pentium pairable than the corresponding long move.  It's more
plausible that gcc could make an optimization where it replaces the
low two bytes of zero register to create an _index_ into that array,
but it doesn't (not gcc 2.5.8, anyway).

-Mat