delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1998/12/30/09:05:43

From: vcarlos35 AT juno DOT com
To: djgpp AT delorie DOT com
Date: Wed, 30 Dec 1998 09:04:24 EST
Subject: Re: pairable instructions much faster than the string
operations on a Pentium and above ?!
Message-ID: <19981230.090441.5903.0.vcarlos35@juno.com>
References: <368A195D DOT F315167E AT gmx DOT de>
X-Mailer: Juno 1.49
X-Juno-Line-Breaks: 0,2-3,5-38,40-42,44,46-49
Reply-To: djgpp AT delorie DOT com

On Wed, 30 Dec 1998 13:15:25 +0100 Christian Hofrichter
<ChristianHofrichter AT gmx DOT de> writes:
>For along time I believed that string operations (rep stosl; rep 
>movsl) were the fastest methods to write to memory blocks untill I heard
that 
>a Pentium can execute two instructions simultaneously. So I realized 
>that there are better methods to move memory blocks !
>
>" rep stosl " : takes 3 clock cycles on a Pentium
>
>
>asm("1:\n\t"
>       "movl (%%ebx),%%eax\n\t" /*pairable in U-pipe */
>       "addl   $4,%%ebx\n\t"         /*pairbale in V-pipe  */
>       "decl   %%ecx\n\t"               /*pairable in U-pipe */
>       "jnz 1b":                           /*pairbale in V-pipe  */
>                     :"a"(55/*any value
>*/),"c"((40*1024*1024)>>2),"b"(memory)
>                     :"%ecx","%ebx");
>This takes only 2 clock cycles !
>
>
>To test that, I allocated a buffer of 40 Mb. First I used memset, it
>took 690000 microseconds to fill the memory-block.
>Then I wrote it in assembler ( just to be sure) with stosl and it took
>the same time (how surprising ).
>And then I wrote the code above and now it took only approximately
>426000 microseconds to fill the memory-block !!
>That is approximate the same ratio like 3 clock cycles to 2 clock
>cycles.
>
>So how about a new optimation-switch in djgpp, called pairable
>instructions ? After all  it can often double the speed of the 
>program. I can also be used to improve graphic-performence, can't it ?
>

AFAIK, having a compiler automatically pair instructions (especially one
such as gcc which runs on a wide variety of platforms) would pretty much
be an impossible task. Instruction pairing rules are complicated and
dependent
on the CPU to a great extent. For example, on a 6th generation CPU, your
code is not optimal because the increased register dependencies make
it difficult for the out-of-order core to extract maximum parallelism
from
your code. Additionally, you have to worry about increased aggregate
opcode
size and mispredicted branches.

Karl

___________________________________________________________________
You don't need to buy Internet access to use free Internet e-mail.
Get completely free e-mail from Juno at http://www.juno.com/getjuno.html
or call Juno at (800) 654-JUNO [654-5866]

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019