delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1997/05/05/20:14:53

From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller)
Message-Id: <199705052359.JAA19327@solwarra.gbrmpa.gov.au>
Subject: Re: Alignmentt
To: wapex AT silesia DOT top DOT pl (Michal)
Date: Tue, 6 May 1997 09:59:41 +1000 (EST)
Cc: djgpp AT delorie DOT com
In-Reply-To: <336DCFDB.7C54@silesia.top.pl> from "Michal" at May 5, 97 02:17:31 pm

> You're saying that if I have my FPU in double precision mode and execute
> for example -fmul %st(1)- the FPU is swiched into single precision and
> after executing fmul back to double precision? If it is right, why do we
> have diferent precisions, when all operations are realy single?

No, generally all operations be they single or double are extended to 
extended (80 bit) precision where all the math is performed, and then moved
back to their original precision be it single or double - the main reason
for this is speed (AFAIK). If you want increased speed and don't require
the increased precision, you can move the FPU into a lower precision mode.

> > I also get the impression then that your texturing 8 pixels, lighting 8
> > pixels, texturing 8 pixels, lighting... etc ... Basically, this is _really_
> > bad for cache coherency - your better off texturing the complete scanline
> > and then lighting the complete scanline. 

> No I'm doing it at the same time. 

Ouch - run out of registers by any chance?  :)

> I don't think that it would be faster. You would need a buffer to store
> 1/z for every 8 pixels, unless you're dividing it once more. And some
> pixels of scaneline could be out of cashe when they would be written for
> the secund time. The only good side that I see is more registers for
> both texturing and lightning, but you need more instructions; writting
> to 1/z's buffer, secund time address calulation, secund loop and stuff
> like that. 

Aaah, your using a z-buffer - an oversight on my behalf... I have yet to
see an extremely fast z-buffer in software and I don't use a z-buffer; I
use a AET. (well did, I did this stuff for someone else and now I use O-GL)

> > If your wondering, I had my perspective correct, sub-pixel accurate true
> > colour light-sourced, gouraud shaded engine running at 16 cycles per pixel.

> My is drawing about 7.8 milions pixels per secund writing to LFB (ViRGE)
> on my P120 in 8bit color. That's 15.5 cycles per pixel, but it's with
> cashe misses. I've never calculated it so accurately, but I think I
> would be something about 12-13 clocks per pixel. Your result is quite
> good, I mean clocks/pixel. In 24bit color your inner must be dramaticly
> slowing down becouse of cashe misses, you have 3 bytes per pixel
> textures and 3 times more memory to address. I think it would be better
> to do it in 16bit color, use 1 byte per pixel textures, organize your
> pallete in that way, that high byte of all colors (in palete) would by
> brightness and low byte the real  color value(teaken from texture).
> Adapted to thet my inner in theory would have the same speed, but it
> would have more cash misses. I've never coded for 16 bit color, never
> even try.

At 16 cycles/pixel, I had no cache misses (effectively) - I had a few points
where a wasted cycle in the V pipe occured, so I used a dummy movl to load
the cache line, effectively eliminating the cache load later on... I also
coded for 32bit modes on the LFB which meant I was using 4 bytes per pixel,
or 8 pixels per cache line. I looked at using hi-color modes, but they are
actually slower; its very fast to do a true-colour lightsource lookup using
the segmented architecture of the x86 registers. I had the light source
level in the ah register, the actual texel colour in the al, added the start
of the light source lookup table to eax and moved the corresponding byte.
Did you follow that? I kinda nearly got lost myself... :)

> > With MMX registers, I could get it running in 9 cycles per pixel... which
> > is faster than Quake and looks a whole lot better...

> I don't have MMX, so I can't say, but I don't think it can give such a
> speed up. You cann't use MMX and FPU at the same time, so You would have
> to write non-FPU inner. The whole think about FPU code overlaping with
> CPU code would be lost. Also MMX have no div instruction (as far as I
> know) so You would have to use CPU div. Maybe it can be done with
> saveing FPU registers in some buffer, and then loading MMX regisrers or
> somethink like thet.
> Inner speed is not everything, try to create whole engin like in QUAKE,
> that's the real difficult task.

Inner speed _is_ everything. 90% of your texture mapping code will be spent
in your inner most loop... Also, because of the way I used an AET (look in
Abrash's Zen of Graphics Programming - brilliant resource) I could perform
all my texture mapping in one pass, move to MMX mode and then post-process
all the pixels performing the lighting, etc. Extremely effective. 

> PS
> What about my first question 'Haw to align in DJGPP'. My doubles are NOT
> aligned at 8 byte boundary like they 
> should be.

I think someone posted the correct alignment __atrribute__ variable; check
the archives and it will turn up there...

> Sorry my anser is so late, but I had problems with my internet provider.

We had a public holiday here, so I had a long weekend... :)

Leathal.

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019