From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller) Message-Id: <199712150148.LAA27759@solwarra.gbrmpa.gov.au> Subject: Re: Slow MOV opcode To: flemming DOT larsen AT private DOT dk (Flemming Stig Larsen) Date: Mon, 15 Dec 1997 11:48:52 +1000 (EST) Cc: djgpp AT delorie DOT com In-Reply-To: <01bd08e7$dfe77f00$10fbffc2@fsl22.novo.dk> from "Flemming Stig Larsen" at Dec 14, 97 11:28:51 pm Content-Type: text Precedence: bulk > Im an intermediate in programming DJGPP protected mode, and I really > think it's cool to release such a nice C-compiler 100% payfree!! > Well, I'm trying to do some fast graphics using inline asm. > But the "movb reg32, mem" upcode seems to take several clock cycles, > when it has to take only one! max! It will take 1 cycle IFF the memory address being accesses is already in the L1 cache, and the L1 cache is as fast as the CPU - which is has to be. It can take longer, say on a P2 as the L2 cache is only half the CPU speed - thus it can take 2+ clocks on a P2 if in L2 cache. If your accessing SDRAM it will be even slower, and slower again in EDO or slower memory. Simply because the manual says it takes 1 clock to load a memory address, doesn't mean the memory, BUS etc can.... > I made a simple program to test the speed of a innerloop, and got > these interesting results (on a Pentium 200 mhz): > This unuseable loop: > __asm__ __volatile__ ( > "1:movb %%al, (%%edi) > incl %%edi > incl %%eax > decl %%ecx > jnz 1b" > : : "ecx" (60000), "D" (video_buffer) : "ecx", "eax", "edi" ); > seemed to take about 19 - 20 clocks per cycle ! ouch!! You will have cache hits (where a 32 bytes block is loaded into your cache) when you do a move from memory, and your VRAM is most likely very slow which would suggest the 19-20 clock move. Have you tried the same code on system memory moves? It would most likely be faster... > while this: > __asm__ __volatile__ ( > "1:incl %%edi > incl %%eax > decl %%ecx > jnz 1b" > : : "ecx" (60000), "D" (video_buffer) : "ecx", "eax", "edi" ); > only took about 2 clocks p. cycle !!! > (must be the pairing!) The first 2 incl instructions pair, taking 1 cycle, The decl/jnz pair to take 1 cycle. As your not accessing any memory (your doing nothing in your loop but modifying registers) it will naturally not need to access anything and be able to run at full speed. Leathal.