From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller)
Message-Id: <199712150148.LAA27759@solwarra.gbrmpa.gov.au>
Subject: Re: Slow MOV opcode
To: flemming DOT larsen AT private DOT dk (Flemming Stig Larsen)
Date: Mon, 15 Dec 1997 11:48:52 +1000 (EST)
Cc: djgpp AT delorie DOT com
In-Reply-To: <01bd08e7$dfe77f00$10fbffc2@fsl22.novo.dk> from "Flemming Stig Larsen" at Dec 14, 97 11:28:51 pm
Content-Type: text
Precedence: bulk

> Im an intermediate in programming DJGPP protected mode, and I really
> think it's cool to release such a nice C-compiler 100% payfree!!
> Well, I'm trying to do some fast graphics using inline asm.
> But the "movb reg32, mem" upcode seems to take several clock cycles, 
> when it has to take only one! max!

It will take 1 cycle IFF the memory address being accesses is already
in the L1 cache, and the L1 cache is as fast as the CPU - which is has
to be. It can take longer, say on a P2 as the L2 cache is only half
the CPU speed - thus it can take 2+ clocks on a P2 if in L2 cache. If
your accessing SDRAM it will be even slower, and slower again in EDO
or slower memory. Simply because the manual says it takes 1 clock to
load a memory address, doesn't mean the memory, BUS etc can....
 
> I made a simple program to test the speed of a innerloop, and got 
> these interesting results (on a Pentium 200 mhz):

> This unuseable loop:
>    __asm__ __volatile__ (
>    "1:movb %%al, (%%edi)
>       incl %%edi
>       incl %%eax
>       decl %%ecx
>       jnz 1b"
>       :  : "ecx" (60000), "D" (video_buffer) : "ecx", "eax", "edi" );
 
> seemed to take about 19 - 20 clocks per cycle ! ouch!!

You will have cache hits (where a 32 bytes block is loaded into your cache)
when you do a move from memory, and your VRAM is most likely very slow
which would suggest the 19-20 clock move. Have you tried the same code on
system memory moves? It would most likely be faster...
 
> while this:     
>    __asm__ __volatile__ (
>    "1:incl %%edi
>       incl %%eax
>       decl %%ecx
>       jnz 1b"
>       :  : "ecx" (60000), "D" (video_buffer) : "ecx", "eax", "edi" );

> only took about 2 clocks p. cycle !!!  
>      (must be the pairing!)

The first 2 incl instructions pair, taking 1 cycle, The decl/jnz pair to
take 1 cycle. As your not accessing any memory (your doing nothing in your
loop but modifying registers) it will naturally not need to access anything
and be able to run at full speed.

Leathal.