delorie.com/archives/browse.cgi   search  
Mail Archives: pgcc/2001/02/20/14:59:09

Date: Tue, 20 Feb 2001 21:58:44 +0200 (EET)
From: Tuukka Toivonen <tuukkat AT s-inf-pc24 DOT oulu DOT fi>
To: Nick Kurshev <nickols_k AT mail DOT ru>
cc: "pgcc AT delorie DOT com" <pgcc AT delorie DOT com>
Subject: Re: Re: Probably pgcc-2.95.2.1 does not optimized propertly?
In-Reply-To: <E14VHDU-00009a-00@mx7.port.ru>
Message-ID: <Pine.LNX.4.21.0102202141440.3407-100000@s-inf-pc24.oulu.fi>
MIME-Version: 1.0
Reply-To: pgcc AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: pgcc AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com

On Tue, 20 Feb 2001, Nick Kurshev wrote:

> Well, I did my own investigation and results say that you are wrong, please see below:

Just for your amusement, I made some tests using my own timing code (with
rdtsc and rdpmc) too. The CPU is AMD Athlon 800 MHz (cool CPU but not too
good documentation).

This is on Linux 2.4.0, even if it doesn't really matter...

The compiler is AthlonGCC with arguments 
-O3 -fomit-frame-pointer -mathlon -mcpu=athlon -march=athlon 
-malign-functions=4 -funroll-loops -fexpensive-optimizations
-malign-double -fschedule-insns2 -mwide-multiply

> This code tested PADDB instruction
> a) non MMX version of code:
>         "movb	(%2),  %%dl\n"
>         "addb	%%dl,  (%2)\n"

Could be done 4 bytes parallel using the usual 32-bit registers, but then
it shouldn't overflow...

My thoughts:
	xor eax,eax
	mov [var],eax
is better for code cache but worse for register pressure than
	mov dword[var],0
so it probably depends on context which one is better.

> P.S.: All tests I did with using of my own project BIEW that can be found at http://biew.sourceforge.net.

I made my tests using my own ugly code. Available for request along with a
patch against Linux 2.4.0 to enable rdpmc instruction (that bit in cr4...)

And the results:

/* A function must save registers: EBX,ESI,EDI
 * Arguments are passed in: EAX,EDX,ECX
 */

/* Empty call: 5 clocks. This is substracted from the following benches below */
void benchtest(void) {
}


/* 1 clock */
void benchtest(void) {
	asm volatile(
	"movd     %eax, %mm0\n"
	"movd     %mm0, %eax\n"
	);
}

/* 0 clocks. Perfect parallelism! */
void benchtest(void) {
	asm volatile(
	"movl	%edi, %eax\n"
	"movl	%eax, %edi\n"
	);
}

/* 2 clocks */
void benchtest(void) {
	asm volatile(
	"pushl	%edi\n"
	"popl	%edi\n"
	);
}

/* 1 clock */
int x1,x2,x3,x4,x5;
void benchtest(void) {
	x1 = 0;		/* Generates:	movl $0,x1 */
	x2 = 0; 	/*		movl $0,x2 */
	x3 = 0; 	/*		movl $0,x3 */
	x4 = 0; 	/*		movl $0,x4 */
	x5 = 0; 	/*		movl $0,x5 */
}

/* 1 clock, equally fast to above one */
int x1,x2,x3,x4,x5;
void benchtest(void) {
	asm volatile(
	"xorl	%eax, %eax\n"
	"movl	%eax, x1\n"
	"movl	%eax, x2\n"
	"movl	%eax, x3\n"
	"movl	%eax, x4\n"
	"movl	%eax, x5\n"
	);
}

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019