Mail Archives: djgpp/1997/12/14/20:56:59

From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller)
Message-Id: <>
Subject: Re: Slow MOV opcode
To: flemming DOT larsen AT private DOT dk (Flemming Stig Larsen)
Date: Mon, 15 Dec 1997 11:48:52 +1000 (EST)
Cc: djgpp AT delorie DOT com
In-Reply-To: <01bd08e7$dfe77f00$> from "Flemming Stig Larsen" at Dec 14, 97 11:28:51 pm

> Im an intermediate in programming DJGPP protected mode, and I really
> think it's cool to release such a nice C-compiler 100% payfree!!
> Well, I'm trying to do some fast graphics using inline asm.
> But the "movb reg32, mem" upcode seems to take several clock cycles, 
> when it has to take only one! max!

It will take 1 cycle IFF the memory address being accesses is already
in the L1 cache, and the L1 cache is as fast as the CPU - which is has
to be. It can take longer, say on a P2 as the L2 cache is only half
the CPU speed - thus it can take 2+ clocks on a P2 if in L2 cache. If
your accessing SDRAM it will be even slower, and slower again in EDO
or slower memory. Simply because the manual says it takes 1 clock to
load a memory address, doesn't mean the memory, BUS etc can....
> I made a simple program to test the speed of a innerloop, and got 
> these interesting results (on a Pentium 200 mhz):

> This unuseable loop:
>    __asm__ __volatile__ (
>    "1:movb %%al, (%%edi)
>       incl %%edi
>       incl %%eax
>       decl %%ecx
>       jnz 1b"
>       :  : "ecx" (60000), "D" (video_buffer) : "ecx", "eax", "edi" );
> seemed to take about 19 - 20 clocks per cycle ! ouch!!

You will have cache hits (where a 32 bytes block is loaded into your cache)
when you do a move from memory, and your VRAM is most likely very slow
which would suggest the 19-20 clock move. Have you tried the same code on
system memory moves? It would most likely be faster...
> while this:     
>    __asm__ __volatile__ (
>    "1:incl %%edi
>       incl %%eax
>       decl %%ecx
>       jnz 1b"
>       :  : "ecx" (60000), "D" (video_buffer) : "ecx", "eax", "edi" );

> only took about 2 clocks p. cycle !!!  
>      (must be the pairing!)

The first 2 incl instructions pair, taking 1 cycle, The decl/jnz pair to
take 1 cycle. As your not accessing any memory (your doing nothing in your
loop but modifying registers) it will naturally not need to access anything
and be able to run at full speed.


