delorie.com/archives/browse.cgi   search  
Mail Archives: pgcc/1998/02/23/22:00:02

X-pop3-spooler: POP3MAIL 2.1.0 b 3 961213 -bs-
Delivered-To: pcg AT goof DOT com
Message-ID: <19980223225820.34616@cerebro.laendle>
Date: Mon, 23 Feb 1998 22:58:20 +0100
From: Marc Lehmann <pcg AT goof DOT com>
To: pgcc <pgcc-list AT Desk DOT nl>
Subject: simd instructions for gcc
Mime-Version: 1.0
X-Mailer: Mutt 0.88
X-Operating-System: Linux version 2.1.85 (root AT cerebro) (gcc version pgcc-2.91.06 980129 (gcc-2.8.0 release))
Status: RO
Lines: 288

Please try to keep this local to pgcc-list at the moment.

Anyway anybody having comments? you can find more info under
http://www-personal.umich.edu/~hasdi/mmx.html

-----Forwarded message from Hasdi R Hashim <hasdi AT umich DOT edu>-----

GCC SIMD (Single Instruction Multiple Data) Support
(rough draft v0.1 2/22/1998)
by Hasdi R Hashim ( hasdi AT umich DOT NOSPAM DOT edu)

When Intel came out with MMX, very few software took advantage of it. The
reasons vary. One is that there is not enough MMX users compared to non-MMX
users to commit to MMX-supportable product or MMX-only product.  Second, MMX
has to be coded in assembly, which very few hackers are competent enough to
do. Even then, coding something is assembly makes it harder to port to
another architecture. So the question C/C++ programmers has been scratching
their head for years is that why isn't there a compiler switch to take
advantage of MMX?


Introduction to MMX

MMX is about using using 64-bit word to do operations on 8 8-bit data, 4
16-bit data, or 2 32-bit data at once (in a single cycle). In addition to
that, it allows you to perform saturation operation in a SINGLE cycle (other
architectures require costly branches and/or obscure code that very few
people can read). MMX falls under SIMD (single instruction multiple data)
class of instructions. Intel is not the only processor with SIMD. Digital
Alpha (asserted by some to be the first), PA-RISC, and UltraSparcs are some
other processors already with varying degree of SIMD capability. MMX has
added difficulty to work with compared to other processors because MMX uses
a register file originally designed to be stack-oriented rather than
linear-oriented.  The register file has to be modified to work under two
exclusive modes of operation, linear by MMX and stack by floating-point.
This difference in mode is transparent to the operating system and other
legacy programs.


Complications with Adding SIMD Support

The reason for lack of C support for MMX has less to do with the complexity
in mixing MMX with floating-point code. In fact, if that is the only
obstacle in supporting MMX in a C compiler like GCC, you are very very
lucky. The problem has to do with the language itself. Consider the
following C++ code snippet:

    char *a,*b,*c;
    ....
    for(int i = 0; i < SIZE; ++i) {
        int j = i;
        int k = i;
        a[i] = b[j] + c[k];
    }

In order for the compiler to convert this to...

   char *a,*b,*c;
    ....
    for(int i = 0; i < SIZE; i += 8) {
        int j = i;
        int k = i;
        ((mmx *)a)[i/8] = paddb(((mmx *)b)[j/8], ((mmx *)c)[k/8]);
   }

....the compiler MUST prove that:

   * SIZE is a multiple of 8 (okay, that is easy for the compiler to fix)
   * i,j, and k increment by 1 simulatenously at every interation (okay,
     that is easy to detect)
   * The pointer different between a,  b, and c must be in modulus of 8 OR
     they must not overlap each other

The first two is easy to overcome. The third one is almost impossible to
prove, unless a, b, and c is statically or locally allocated in the same
file as the code is taken from. This is important because you must show that
for every operation done on an element of an 8-byte vector is independent of
other elements on the same 8-byte vector.  You can thank Dennis Ritchie for
this, for disallowing the noalias keyword in C. In most practical multimedia
applications, if you do want a, b, and c to overlap, you want them to
overlap entirely.

Because of this,  rather than write parallel code and HOPE the compiler will
use MMX or other parallel opcodes, we should be explicit about it.  For
example,

    char *a,*b,*c;
    ....
    mm8_t *ma = (mm8_t *)a;
    mm8_t *mb = (mm8_t *)b;
    mm8_t *mc = (mm8_t *)c;

    for(int i = 0; i < SIZE/MM8_SIZE; i++) {
        ma[i] = add8(mb[i],mc[i]);
    }

mm8_t is not necessarily 64-bit, as I will explain later. The compiler will
have to decide how to best optimize this vector operation.  It can assume
a,b and c either does not overlap or overlap in its entirety.


Why add SIMD extension?

First of all, this is an innocent extension. At worst, this nothing more
than inline function or macro. A compiler with SIMD support will use this
hint to better optimize the code. Most compilers already built-in extensions
to take advantage of SIMD. For example, GCC has extended asm, inline keyword
and __attribute__((const)).  The only change to be in GCC to support MMX is
only in adding the 'x' constraint (ehem.. that is not very true he he :).

Second, you may argue that this extension is architecture specific. Not so,
because if the compiler can recognize parallel operations, in a regular
sparc or PA-RISC 1.x, it will generate code like this:

    for(int i = 0; i < SIZE/4; i+= 4) {
        int a = mb[i];
        int b = mc[i];
        int mask = 0x80808080UL;
        int lsign = a & mask;
        int left = a &~ mask;
        int rsign = b & mask;
        int right = b &~ mask;
        int carry = left + right;

        ma[i] = carry ^ (lsign ^ rsign);
    }

This is twice as fast as regular byte-wise addition (I'll post the numbers
later. TQ). In 64-bit machine like Alpha and PA-RISC 2.0, I expect it to be
four times as fast. A programmer can use these extensions to take advantage
of MMX but leave the code portable enough for other processors.

Third, we have to look at the advancement in processors and their practical
applications. Processors tend to get wider and wider (64-bit wide word is in
transition and 128-bit may be in the cards) but in multimedia and embedded
applications, most data may never grow beyond 8 or 16 bits. 32-bits is more
than enough colors the human eye can actually discriminate, and likewise
with audio samples. In embedded enviroment, the sampled data coming from ADC
is usually 8, 12 or 16 bits wide.

In conclusion, this extension is nice to have because it does not require
drastic changes in the compiler,  it is not architecture specific and, it is
very useful for multimedia and embedded applications.


A list of extensions

Most of them are pretty intuitive.  Using pack/unpack requires some
explanation. Since you do not know how wide mm8_t, mm16_t and mm32_t will
be, you rely on pack/unpack to do the packing for you. For example, say that
you want 16-bit precision addition on 8-bit data. Turn this..


   char *a,*b,*c;
   ....
   mm8_t *ma = (mm8_t *)a;
   mm8_t *mb = (mm8_t *)b;
   mm8_t *mc = (mm8_t *)c;

   for(int i = 0; i < SIZE/MM8_SIZE; i++) {
        ma[i] = add8(mb[i],mc[i]);
   }

...to this:

   char *a,*b,*c;
   ....
   for(int i = 0; i < SIZE; i += MM16_SIZE/2) {
        mm16_t ma = unpack8_s16(b + i);
        mm16_t mb = unpack8_s16(c + i);
        pack16_8(a,add8(ma,mb));
  }

Notice that the unpacked data element is half of MM16_SIZE. Go figure the
rest.

 LABELS/FUNCTIONS                  Description
                                   Data types suitable for
 mm8_t,  mm16_t, mm32_t            8-bit, 16-bit, 32-bit
                                   operations respectively
 MM8_SIZE, MM16_SIZE, MM32_SIZE    size of the data types
 mm8_t add8(mm8_t a, mm8_t b);
 mm16_t add16(mm16_t a, mm16_t b);
                                   perform regular addition
 mm32_t add32(mm32_t a, mm32_t b);
 mm8_t adds8(mm8_t a, mm8_t b);
 mm16_t adds16(mm16_t a, mm16_t
 b);                               perform addition with signed
 mm32_t adds32(mm32_t a, mm32_t    saturation
 b);
 mm8_t addu8(mm8_t a, mm8_t b);
 mm16_t addu16(mm16_t a, mm16_t
 b);                               perform addition with
 mm32_t addu32(mm32_t a, mm32_t    unsigned saturation
 b);
 mm8_t sub8(mm8_t a, mm8_t b);
 mm16_t  sub16(mm16_t a, mm16_t
 b);                               perform regular subtraction
 mm32_t sub32(mm32_t a, mm32_t b);
 mm8_t subs8(mm8_t a, mm8_t b);
 mm16_t  subs16(mm16_t a, mm16_t
 b);                               perform subtraction with
 mm32_t subs32(mm32_t a, mm32_t    signed saturation
 b);
 mm8_t subu8(mm8_t a, mm8_t b);
 mm16_t  subu16(mm16_t a, mm16_t
 b);                               perform subtraction with
 mm32_t subu32(mm32_t a, mm32_t    unsigned saturation
 b);
 mm8_t gt8(mm8_t a, mm8_t b);
 mm16_t  gt16(mm16_t a, mm16_t b); do n-bit comparison set to
                                   0xffff... if true 0 if false
 mm32_t  gt32(mm32_t a, mm32_t b);
 mm8_t lt8(mm8_t a, mm8_t b);
 mm16_t  lt6(mm16_t a, mm16_t b);
 mm32_t lt32(mm32_t a, mm32_t b);
 mm8_t  ge8(mm8_t a, mm8_t b);
 mm16_t  ge16(mm16_t a, mm16_t b);

 mm32_t  ge32(mm32_t a, mm32_t b);
 mm8_t  le8(mm8_t a, mm8_t b);
 mm16_t  le16(mm16_t a, mm16_t b);

 mm32_t  le32(mm32_t a, mm32_t b);
 mm8_t  eq8(mm8_t a, mm8_t b);
 mm16_t  eq16(mm16_t a, mm16_t b);

 mm32_t  eq32(mm32_t a, mm32_t b);
 mm8_t ne8(mm8_t a, mm8_t b);
 mm16_t  ne16(mm16_t a, mm16_t b);

 mm32_t  ne32(mm32_t a, mm32_t b);
 mm8_t  mul8(mm8_t a, mm8_t b);
 mm16_t  mul16(mm16_t a, mm16_t
 b);                               perform multiplication
 mm32_t  mul32(mm32_t a, mm32_t
 b);
 mm8_t sra8(mm8_t a, int b);
 mm16_t  sra16(mm16_t  a, int b);  perform shift arithmetic to
 mm32_t sra32(mm32_t a,int b);     the right
 mm8_t srl8(mm8_t a, int b);
 mm16_t  srl16(mm16_t a, int b);   perform logical shift to the
 mm32_t srl32(mm32_t a, int b);    right
 mm8_t sll8(mm8_t a, int b);
 mm16_t  sll16(mm16_t a, int b);   perform logical shift to the
 mm32_t sll32(mm32_t a, int b);    left
 mm16_t  unpack8_u16(unsigned char
 *);
 mm16_t  unpack8_u32(unsigned char
 *);                               unpack unsigned values
 mm32_t  unpack16_u32(unsigned
 short *);
 mm16_t  unpack8_s16(unsigned char
 *);
 mm32_t  unpack8_s32(unsigned char
 *);                               unpack signed values
 mm32_t  unpack16_s32(unsigned
 short *);
 void  pack16_8(unsigned char
 *,mm16_t);
 void  pack32_8(unsigned char
 *,mm32_t);                        pack values
 void  pack32_16(unsigned  short
 *,mm32_t);
 void  pack16_s8(unsigned char
 *,mm16_t);
 void  pack32_s8(unsigned char     pack values with signed
 *,mm32_t);                        saturation
 void  pack32_s16(unsigned  short
 *,mm32_t);
 void  pack16_u8(unsigned char
 *,mm16_t);
 void  pack32_u8(unsigned char     pack values with unsigned
 *,mm32_t);                        saturation
 void  pack32_u16(unsigned  short
 *,mm32_t);


-----End of forwarded message-----

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg AT goof DOT com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019