X-pop3-spooler: POP3MAIL 2.1.0 b 3 961213 -bs- Delivered-To: pcg AT goof DOT com Message-ID: <19980223225820.34616@cerebro.laendle> Date: Mon, 23 Feb 1998 22:58:20 +0100 From: Marc Lehmann To: pgcc Subject: simd instructions for gcc Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.88 X-Operating-System: Linux version 2.1.85 (root AT cerebro) (gcc version pgcc-2.91.06 980129 (gcc-2.8.0 release)) Status: RO Content-Length: 10790 Lines: 288 Please try to keep this local to pgcc-list at the moment. Anyway anybody having comments? you can find more info under http://www-personal.umich.edu/~hasdi/mmx.html -----Forwarded message from Hasdi R Hashim ----- GCC SIMD (Single Instruction Multiple Data) Support (rough draft v0.1 2/22/1998) by Hasdi R Hashim ( hasdi AT umich DOT NOSPAM DOT edu) When Intel came out with MMX, very few software took advantage of it. The reasons vary. One is that there is not enough MMX users compared to non-MMX users to commit to MMX-supportable product or MMX-only product. Second, MMX has to be coded in assembly, which very few hackers are competent enough to do. Even then, coding something is assembly makes it harder to port to another architecture. So the question C/C++ programmers has been scratching their head for years is that why isn't there a compiler switch to take advantage of MMX? Introduction to MMX MMX is about using using 64-bit word to do operations on 8 8-bit data, 4 16-bit data, or 2 32-bit data at once (in a single cycle). In addition to that, it allows you to perform saturation operation in a SINGLE cycle (other architectures require costly branches and/or obscure code that very few people can read). MMX falls under SIMD (single instruction multiple data) class of instructions. Intel is not the only processor with SIMD. Digital Alpha (asserted by some to be the first), PA-RISC, and UltraSparcs are some other processors already with varying degree of SIMD capability. MMX has added difficulty to work with compared to other processors because MMX uses a register file originally designed to be stack-oriented rather than linear-oriented. The register file has to be modified to work under two exclusive modes of operation, linear by MMX and stack by floating-point. This difference in mode is transparent to the operating system and other legacy programs. Complications with Adding SIMD Support The reason for lack of C support for MMX has less to do with the complexity in mixing MMX with floating-point code. In fact, if that is the only obstacle in supporting MMX in a C compiler like GCC, you are very very lucky. The problem has to do with the language itself. Consider the following C++ code snippet: char *a,*b,*c; .... for(int i = 0; i < SIZE; ++i) { int j = i; int k = i; a[i] = b[j] + c[k]; } In order for the compiler to convert this to... char *a,*b,*c; .... for(int i = 0; i < SIZE; i += 8) { int j = i; int k = i; ((mmx *)a)[i/8] = paddb(((mmx *)b)[j/8], ((mmx *)c)[k/8]); } ....the compiler MUST prove that: * SIZE is a multiple of 8 (okay, that is easy for the compiler to fix) * i,j, and k increment by 1 simulatenously at every interation (okay, that is easy to detect) * The pointer different between a, b, and c must be in modulus of 8 OR they must not overlap each other The first two is easy to overcome. The third one is almost impossible to prove, unless a, b, and c is statically or locally allocated in the same file as the code is taken from. This is important because you must show that for every operation done on an element of an 8-byte vector is independent of other elements on the same 8-byte vector. You can thank Dennis Ritchie for this, for disallowing the noalias keyword in C. In most practical multimedia applications, if you do want a, b, and c to overlap, you want them to overlap entirely. Because of this, rather than write parallel code and HOPE the compiler will use MMX or other parallel opcodes, we should be explicit about it. For example, char *a,*b,*c; .... mm8_t *ma = (mm8_t *)a; mm8_t *mb = (mm8_t *)b; mm8_t *mc = (mm8_t *)c; for(int i = 0; i < SIZE/MM8_SIZE; i++) { ma[i] = add8(mb[i],mc[i]); } mm8_t is not necessarily 64-bit, as I will explain later. The compiler will have to decide how to best optimize this vector operation. It can assume a,b and c either does not overlap or overlap in its entirety. Why add SIMD extension? First of all, this is an innocent extension. At worst, this nothing more than inline function or macro. A compiler with SIMD support will use this hint to better optimize the code. Most compilers already built-in extensions to take advantage of SIMD. For example, GCC has extended asm, inline keyword and __attribute__((const)). The only change to be in GCC to support MMX is only in adding the 'x' constraint (ehem.. that is not very true he he :). Second, you may argue that this extension is architecture specific. Not so, because if the compiler can recognize parallel operations, in a regular sparc or PA-RISC 1.x, it will generate code like this: for(int i = 0; i < SIZE/4; i+= 4) { int a = mb[i]; int b = mc[i]; int mask = 0x80808080UL; int lsign = a & mask; int left = a &~ mask; int rsign = b & mask; int right = b &~ mask; int carry = left + right; ma[i] = carry ^ (lsign ^ rsign); } This is twice as fast as regular byte-wise addition (I'll post the numbers later. TQ). In 64-bit machine like Alpha and PA-RISC 2.0, I expect it to be four times as fast. A programmer can use these extensions to take advantage of MMX but leave the code portable enough for other processors. Third, we have to look at the advancement in processors and their practical applications. Processors tend to get wider and wider (64-bit wide word is in transition and 128-bit may be in the cards) but in multimedia and embedded applications, most data may never grow beyond 8 or 16 bits. 32-bits is more than enough colors the human eye can actually discriminate, and likewise with audio samples. In embedded enviroment, the sampled data coming from ADC is usually 8, 12 or 16 bits wide. In conclusion, this extension is nice to have because it does not require drastic changes in the compiler, it is not architecture specific and, it is very useful for multimedia and embedded applications. A list of extensions Most of them are pretty intuitive. Using pack/unpack requires some explanation. Since you do not know how wide mm8_t, mm16_t and mm32_t will be, you rely on pack/unpack to do the packing for you. For example, say that you want 16-bit precision addition on 8-bit data. Turn this.. char *a,*b,*c; .... mm8_t *ma = (mm8_t *)a; mm8_t *mb = (mm8_t *)b; mm8_t *mc = (mm8_t *)c; for(int i = 0; i < SIZE/MM8_SIZE; i++) { ma[i] = add8(mb[i],mc[i]); } ...to this: char *a,*b,*c; .... for(int i = 0; i < SIZE; i += MM16_SIZE/2) { mm16_t ma = unpack8_s16(b + i); mm16_t mb = unpack8_s16(c + i); pack16_8(a,add8(ma,mb)); } Notice that the unpacked data element is half of MM16_SIZE. Go figure the rest. LABELS/FUNCTIONS Description Data types suitable for mm8_t, mm16_t, mm32_t 8-bit, 16-bit, 32-bit operations respectively MM8_SIZE, MM16_SIZE, MM32_SIZE size of the data types mm8_t add8(mm8_t a, mm8_t b); mm16_t add16(mm16_t a, mm16_t b); perform regular addition mm32_t add32(mm32_t a, mm32_t b); mm8_t adds8(mm8_t a, mm8_t b); mm16_t adds16(mm16_t a, mm16_t b); perform addition with signed mm32_t adds32(mm32_t a, mm32_t saturation b); mm8_t addu8(mm8_t a, mm8_t b); mm16_t addu16(mm16_t a, mm16_t b); perform addition with mm32_t addu32(mm32_t a, mm32_t unsigned saturation b); mm8_t sub8(mm8_t a, mm8_t b); mm16_t sub16(mm16_t a, mm16_t b); perform regular subtraction mm32_t sub32(mm32_t a, mm32_t b); mm8_t subs8(mm8_t a, mm8_t b); mm16_t subs16(mm16_t a, mm16_t b); perform subtraction with mm32_t subs32(mm32_t a, mm32_t signed saturation b); mm8_t subu8(mm8_t a, mm8_t b); mm16_t subu16(mm16_t a, mm16_t b); perform subtraction with mm32_t subu32(mm32_t a, mm32_t unsigned saturation b); mm8_t gt8(mm8_t a, mm8_t b); mm16_t gt16(mm16_t a, mm16_t b); do n-bit comparison set to 0xffff... if true 0 if false mm32_t gt32(mm32_t a, mm32_t b); mm8_t lt8(mm8_t a, mm8_t b); mm16_t lt6(mm16_t a, mm16_t b); mm32_t lt32(mm32_t a, mm32_t b); mm8_t ge8(mm8_t a, mm8_t b); mm16_t ge16(mm16_t a, mm16_t b); mm32_t ge32(mm32_t a, mm32_t b); mm8_t le8(mm8_t a, mm8_t b); mm16_t le16(mm16_t a, mm16_t b); mm32_t le32(mm32_t a, mm32_t b); mm8_t eq8(mm8_t a, mm8_t b); mm16_t eq16(mm16_t a, mm16_t b); mm32_t eq32(mm32_t a, mm32_t b); mm8_t ne8(mm8_t a, mm8_t b); mm16_t ne16(mm16_t a, mm16_t b); mm32_t ne32(mm32_t a, mm32_t b); mm8_t mul8(mm8_t a, mm8_t b); mm16_t mul16(mm16_t a, mm16_t b); perform multiplication mm32_t mul32(mm32_t a, mm32_t b); mm8_t sra8(mm8_t a, int b); mm16_t sra16(mm16_t a, int b); perform shift arithmetic to mm32_t sra32(mm32_t a,int b); the right mm8_t srl8(mm8_t a, int b); mm16_t srl16(mm16_t a, int b); perform logical shift to the mm32_t srl32(mm32_t a, int b); right mm8_t sll8(mm8_t a, int b); mm16_t sll16(mm16_t a, int b); perform logical shift to the mm32_t sll32(mm32_t a, int b); left mm16_t unpack8_u16(unsigned char *); mm16_t unpack8_u32(unsigned char *); unpack unsigned values mm32_t unpack16_u32(unsigned short *); mm16_t unpack8_s16(unsigned char *); mm32_t unpack8_s32(unsigned char *); unpack signed values mm32_t unpack16_s32(unsigned short *); void pack16_8(unsigned char *,mm16_t); void pack32_8(unsigned char *,mm32_t); pack values void pack32_16(unsigned short *,mm32_t); void pack16_s8(unsigned char *,mm16_t); void pack32_s8(unsigned char pack values with signed *,mm32_t); saturation void pack32_s16(unsigned short *,mm32_t); void pack16_u8(unsigned char *,mm16_t); void pack32_u8(unsigned char pack values with unsigned *,mm32_t); saturation void pack32_u16(unsigned short *,mm32_t); -----End of forwarded message----- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg AT goof DOT com |e| -=====/_/_//_/\_,_/ /_/\_\ --+ The choice of a GNU generation | |