Xref: news2.mv.net comp.os.msdos.djgpp:1884 From: korpela AT albert DOT ssl DOT berkeley DOT edu (Eric J. Korpela) Newsgroups: comp.os.msdos.djgpp Subject: Block Moves (Re: ASM code & Random) Date: 16 Mar 1996 21:50:20 GMT Organization: Cal Berkeley-- Space Sciences Lab Lines: 190 Message-ID: <4ifd2s$t3u@agate.berkeley.edu> References: <1996Mar5 DOT 164831 AT zipi DOT fi DOT upm DOT es> <4i2f8q$51p AT mack DOT rt66 DOT com> <31483E82 DOT 7EEB AT i-link DOT net> NNTP-Posting-Host: albert.ssl.berkeley.edu Keywords: GCC pentium optimization To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp In article <31483E82 DOT 7EEB AT i-link DOT net>, Brad Burgan wrote: >From what I have read, that would not be a good thing. Intel has been designing >their chips more and more towards the simple operands and have not paid much >attention to the string operands, in order to make it easier for compilers >to operate. I will write a test program and run on 386/486/Pent and see if >MOVS is faster than MOV and J?, but I think that might be a little off topic, so >if someone could email me where to send this report? It will do test in 16-bit >and 32-bit. With all this talk about memory copy speeds, I decided to write up a little program on my P90 to see what memory copy algorithm was fastest. (I use EMX under OS/2, but from what I understand this should compile under DJGPP just fine.) The results I get are quite suprising, and not at all what I would have expected given Intel's document on optimization. According to Intel, the fastest algorithm for moving memory should be the "ld ld st st" method. (see the code at the end of this message for a look at how it works.) In fact, using the floating point unit for the transfer was the fastest method. According to Intel "Moving a floating point memory to memory should be done by integer moves instead if doing fld-sdtp." The other suprise was that there wasn't much difference between 64 bit aligned and 32 bit aligned block moves. In fact the "ld ld st st" method was faster for 32 bit aligned blocks. The results and the code are below. I hope that someone will look at it to make sure I did it right. I'd also like to see results from a 386, 486, and a Pentium Pro. ------------------------------------------------------------------------------ 64K aligned blocks using _tmalloc() (Mb/sec) rep stosl ld st ld ld st st C code fildq fistpq ----------------------------------------------------------- 32.206119 34.782609 31.796502 26.041667 38.986355 33.500838 33.557047 32.948929 26.773762 41.067762 32.206119 32.467532 31.847134 26.007802 39.062500 33.167496 34.782609 32.679739 26.560425 40.733198 34.782609 32.520325 34.013605 27.137042 42.918455 ----------------------------------------------------------- 33.2 33.6 32.6 26.5 40.8 4 byte aligned blocks using malloc() (Mb/sec) rep stosl ld st ld ld st st C code fildq fistpq ----------------------------------------------------------- 31.695721 33.500838 33.003300 26.350461 37.174721 31.250000 31.055901 32.679739 26.212320 37.243948 34.602076 33.167496 36.166365 27.932961 43.290043 32.840722 33.670034 34.246575 27.210884 39.761431 31.695721 33.500838 33.333333 26.455026 38.314176 ----------------------------------------------------------- 32.4 33.0 33.9 26.8 39.2 ------------------------------------------------------------------------------- #include #include #include inline void copy1(int *p1, int *p2, int n) { asm(" repnz movsl " : : "S" (p1), "D" (p2), "c" (n)); } inline void copy2(int *p1, int *p2, int n) { asm(" dec %2 jl 1f 0: movl (%0,%2,4),%%ebx movl %%ebx,(%1,%2,4) dec %2 jge 0b 1: " : : "r" (p1), "r" (p2), "r" (n) : "ebx"); } inline void copy3(int *p1, int *p2, int n) { asm(" test $1,%2 jz 0f dec %2 movl (%0,%2,4),%%ebx movl %%ebx,(%1,%2,4) 0: shrl %2 dec %2 jl 2f 1: movl (%0,%2,8),%%ebx movl 4(%0,%2,8),%%eax movl %%ebx,(%1,%2,8) movl %%eax,4(%1,%2,8) dec %2 jge 1b 2: " : : "S" (p1), "D" (p2), "c" (n) : "ebx","eax"); } inline void copy4(int *p1,int *p2, int n) { register int i =n; register int *pp1 =p1; register int *pp2 =p2; if (i & 1) { i--; pp2[i]=pp1[i]; } i=(i>>1); while (i) { i--; pp2[i*2]=pp1[i*2]; pp2[i*2+1]=pp1[i*2+1]; } } inline void copy5(int *p1,int *p2, int n) { asm(" test $1,%2 jz 0f dec %2 movl (%0,%2,4),%%ebx movl %%ebx,(%1,%2,4) 0: shrl %2 dec %2 jl 2f 1: fildq (%0,%2,8) fistpq (%1,%2,8) dec %2 jge 1b 2: " : : "S" (p1), "D" (p2), "c" (n) : "ebx"); } int main(void) { int *p1=(int *)malloc(65536); int *p2=(int *)malloc(65536); int n,i,clock0,clock1,clock2,clock3,clock4,clock5; printf("%x %x\n",(int)p1,(int)p2); for (n=0;n<(65536/sizeof(int));p1[n]=p2[n]=n++); clock0=clock(); for (i=0;i<200*1024/64;i++) copy1(p1,p2,n); clock1=clock(); for (i=0;i<200*1024/64;i++) copy2(p2,p1,n); clock2=clock(); for (i=0;i<200*1024/64;i++) copy3(p1,p2,n); clock3=clock(); for (i=0;i<200*1024/64;i++) copy4(p1,p2,n); clock4=clock(); for (i=0;i<200*1024/64;i++) copy5(p1,p2,n); clock5=clock(); printf("%f %f %f %f %f \n",1.0*CLOCKS_PER_SEC/(clock1-clock0)*200, 1.0*CLOCKS_PER_SEC/(clock2-clock1)*200, 1.0*CLOCKS_PER_SEC/(clock3-clock2)*200, 1.0*CLOCKS_PER_SEC/(clock4-clock3)*200, 1.0*CLOCKS_PER_SEC/(clock5-clock4)*200); } -- Eric Korpela | An object at rest can never be korpela AT ssl DOT berkeley DOT edu | stopped. Click here for more info.