Mail Archives: djgpp/1995/01/13/13:55:43
Hi,
I asked a rather confused question about data alignment, PM and RM
and got some useful replies, some people expressed an interest in the
exact situation so I'm supplying it here. This is the code that I was
wanting to run even faster by aligning the 256x256 table on a 64K boundary.
It's for applying 'semi-sophisticated' shading to coloured VGA images in
real time (but it's too slow, I always thought it would be). In this
case the data at %esi (from) contains words where the first byte is the
colour, and the second is the shade-table to process it with:
#ifndef _TOSCRTRN_INL_
#define _TOSCRTRN_INL_
static inline void toscrtrn(const void *from, const void *to,
const int length, const unsigned long time,
const void *table, const void *shades) {
asm("
.align 4, 0x90
toscrtrn_%=_4:
xor %%eax, %%eax;
lodsw;
movb (%%ebx,%%eax), %%dl;
lodsw;
movb (%%ebx,%%eax), %%dh;
shl $0x10, %%edx;
lodsw;
movb (%%ebx,%%eax), %%dl;
lodsw;
movb (%%ebx,%%eax), %%dh;
movl %%edx, %%eax;
ror $0x10, %%eax;
stosl; # write to screen
loop toscrtrn_%=_4; # repeat ecx times
"
: // no output
: "b" (table),
"c" (length),
"S" (from),
"D" (to)
: "eax", "ecx", "edx", "esi", "edi");
}
#endif // _TOSCRTRN_INL_
Several people say that it is not, as I originally thought, a matter of the
difference between PM and RM but rather a loader feature which means that it
will not align data beyond about 512K boundaries.
I've now found a way to so align the data (by declaring it as twice the size
as I need and starting the table at the 64K boundary that must lie within it)
(sandmann AT new-orleans DOT NeoSoft DOT com and I both had this idea) but the acceleration
is marginal. With 64K alignment the code reads:
asm("
toscrtrn_%=_4:
lodsw;
movb (%%eax), %%dl; # This is the operation that I thought would
# be faster than previously
lodsw;
movb (%%eax), %%dh;
shl $0x10, %%edx;
lodsw;
movb (%%eax), %%dl;
lodsw;
movb (%%eax), %%dh;
xchg %%edx, %%eax;
ror $0x10, %%eax;
stosl; # write to screen
xchg %%edx, %%eax;
loop toscrtrn_%=_4; # repeat ecx times
"
: // no output
: "a" (table),
"c" (length),
"S" (from),
"D" (to)
: "eax", "ecx", "edx", "esi", "edi");
The only (possibly slight) improvement on that that I have thought of is
perhaps to read two offsets at the same time with a lodsl, but the word
swapping then involved would probably outweigh the speed up.
Anyway, I guess I'll just have to re-think the approach and process only
parts of the screen at a time (or just do the whole thing more slowly).
(The current timing is c0.07 of a second for 320x200 bytes on a 40mHz 386.)
Thanks to those who helped,
Badders
- Raw text -