Mail Archives: pgcc/2001/02/20/13:15:43
Hello!
On Sun, 18 Feb 2001 18:40:04 +0100, Marc Lehmann wrote:
>> But code which is produced in last case probably will be executed much faster of first (and it is smaller).
>
>please benchmark. "probably" is always wrong, especially when quoting out
>of the amd optimization manual. The way to go, usually, is to implement
>the exact opposite of what is in the manual. I am quite fed-up with AMD
>since I invested quite some time in implementing some of their suggestions
>(for AMD-K6) only to find out code gets slower. Especially their mmx unit
>is the biggets joke ever heard :(
>
You declare that manuals contain the lie.
Well, I did my own investigation and results say that you are wrong, please see below:
I perform two series of tests:
1) celeron-266 + WinNT 4.0 (it's cause of value range)
2) K6-200 + Linux-2.4.1-ac4
The digits in the tests is very relative values which are indicate
how much million (e+06) loops was performed per second of time.
It does not indicate an absolute performance of selected instructions,
but relative because loop contain instructions but not consist from they only.
TEST#1
======
This code tested performance of MOV REG, MMREG instruction:
"movd %%edi, %%mm0\n"
"movd %%mm0, %%edi\n"
53.26-53.42 (celeron) 28.58 (k6)
TEST#2
======
This code tested performance of MOV REG, REG instruction:
"movl %%edi, %%eax\n"
"movl %%eax, %%edi\n"
53.25-53.40 (celeron) 40.02 (k6)
TEST#3
======
This code tested performance of PUSH / POP insns pair:
"pushl %%edi\n"
"popl %%edi\n"
44.48-44.52 (celeron) 36.38 (k6)
A short resume
============
For celeron:
mov reg, reg quickly of push / pop about 19%
mov reg, reg has same speed as mov reg, mmreg
For k6:
mov reg, reg quickly of push / pop about 11%
mov reg, reg quicly of mov reg, mmreg about 40%
Well mov REG, MMREG is very slow operation for k6 and fast for celeron (same as MOV REG, REG).
And it would be better do not use mmx optimization of pgcc for k6 architecture exactly.
TEST#4
=======
"movl $0, %%edx\n"
"movl %%edx, _var1\n"
"movl %%edx, _var2\n"
"movl %%edx, _var3\n"
"movl %%edx, _var4\n"
"movl %%edx, _var5\n"
29.66-29.68(celeron) 25.014(k6)
TEST#5
=======
"xorl %%edx, %%edx\n"
"movl %%edx, _var1\n"
"movl %%edx, _var2\n"
"movl %%edx, _var3\n"
"movl %%edx, _var4\n"
"movl %%edx, _var5\n"
29.66-29.68 (celeron) 25.013(k6)
TEST#6
=======
" movl $0, _var1\n"
" movl $0, _var2\n"
" movl $0, _var3\n"
" movl $0, _var4\n"
" movl $0, _var5\n"
26.20-26.71(celeron) 20.01(k6)
A short resume
============
Both Celeron and K6 have best performance with TEST#4 and TEST#5, but for TEST#6
celeron lose 11% versus TEST#4
k6 lose 25% versus TEST#4 of performance.
TEST#4 and TEST#5 have less of 0.1% difference (and we can ignore it)
But I had some more tests (below is one of):
TEST#7
=======
This code tested PADDB instruction
a) non MMX version of code:
"movb (%2), %%dl\n"
"addb %%dl, (%2)\n"
"movb 1(%2), %%dl\n"
"addb %%dl, 1(%2)\n"
"movb 2(%2), %%dl\n"
"addb %%dl, 2(%2)\n"
"movb 3(%2), %%dl\n"
"addb %%dl, 3(%2)\n"
"movb 4(%2), %%dl\n"
"addb %%dl, 4(%2)\n"
"movb 5(%2), %%dl\n"
"addb %%dl, 5(%2)\n"
"movb 6(%2), %%dl\n"
"addb %%dl, 6(%2)\n"
"movb 7(%2), %%dl\n"
"addb %%dl, 7(%2)\n"
b) MMX optimized code
"movq (%2), %%mm0\n"
"paddb (%2), %%mm0\n"
"movq %%mm0, (%2)\n"
14.52-14.83 (celeron: non MMX version) 9.52(k6 non mmx version)
8.45-8.61(celeron: mmx version with misaligned memory access) 22.23(k6 mmx misalign)
33.25-33.38(celeron: mmx 8-byte aligned version) 28.58(k6 mmx 8-byte aligned)
GENERAL RESUME:
=================
1. Official manual don't contain a lie (may be mistakes and errata only) and I still prone to trust them.
2. K6 has very slow MMX unit when it used as storage unit only. But MMX technology was introduced as
facility of vector computing and AMD has enough quick MMX unit for it. Celerons and P6 clone are really
biggets joke when code has misaligned memory access. I think if (p)gcc will be expanded up to "vector"
keyword then pgcc's MMX optimization for AMD will be useful too.
3. Even if will be found processor for which TEST#6 will be best solutions, anyway the code from TEST#5 is
best for size optimization, but it (see TEST#5) was not implemented in pgcc-2.95.2.1. In addition the code
from TEST#4 is well balanced between "space" and "time" optimization and probably for many programs it's
best solution (for hypothetical processors). May be would be better to enable similar optimizations for -O3 and
higher keys or put new key: -maggressive_optimization.
>> Avoid long instruction length Use x86 instructions that are
>> less than eight bytes in length. An x86 instruction that is longer
>>
>> Clear registers using MOV reg, 0 instead of XOR reg, reg
>
>is mov 0, reg longer than seven bytes?
See TEST#4 and TEST#5.
Best regards! Nick
P.S.: All tests I did with using of my own project BIEW that can be found at http://biew.sourceforge.net.
If you interested with them I can publish or send you little patch for program and you will be able to
repeate all tests.
- Raw text -