From: "John Carter" <ECE AT dwaf-hri DOT pwv DOT gov DOT za>
Organization:  Dpt Water Affairs & Forestry (IWQS)
To: djgpp AT delorie DOT com
Date:          Thu, 1 Feb 1996 08:16:57 +0200
Subject:       Re: binary representation of floats
Message-ID: <20E61D84C99@dwaf-hri.pwv.gov.za>

Greetings,

>What I want to do is to get the binary representation of a float
>or double (the way these numbers are stored in memory).

From my little Intel i486 programmers reference manual....

The i486 processor represents real numbers of the form :-
(-1)^s 2^E (b_0 . b_1 b_2 b_3 .. b_(p-1))
where :-
 s is the sign bit and = 0 or 1
 E = any integer between Emin and Emax inclusive
 b_i = 0 or 1
 p = the number of bits of precision.

The i486 stores real numbers in a three field binary format that
resembles scientific or exponential notation. The format consists of
the following fields:
  * The number's significant digits are held in the significand
    field, b_0 . b_1 b_2 b_3 ... b_(p-1)
  * The exponent field e = E + bias locates the binary point within
    the significant digits.
  * 1 bit sign field.

The i486 usually carries the digits of the significand in normalized
form that except for the value 0, the significand contains an integer
bit and fractions bits like so :- 1.fff...f

By normalizing the number, the integer bit is always one, so the i486
does NOT actually store this one in single and double precision
formats. (However, it is physically always present in the extended
format.)

In order to simplify the comparing of real numbers the i486 stores
the exponents in a biased form. This means that a constant is added
to the true exponent. The bias varies according number format, and is
chosen to force the biased exponent to always be positive. This
allows to real numbers of the same format and sign to be compared as
if they were unsigned binary integers. A numbers true exponent is
found by subtracting the bias.

While the number is in the FPU it is always in extended precision,
only when stored in memory is it changed to single or double.

Parameter      |   Format
               |  Single   | Double | Extended
Width          |  32       |    64  |  80
p              |  24       |    53  |  64
Exponent Width |   8       |    11  |  15
Emax           | 127       |  1023  |+16383
Emin           |-126       | -1022  |-16382
Exp Bias       |+127       | +1023  |+16383

The order in the formats is always sign as most significant bit, (Bit
31 for single / 63 for double...), then biased exponent, then
significand. The least significant bit of the significand is always
bit 0.

As far as I can know, DJGPP maps
float to intel 32 bit single precision.
double to intel 64 bit double precision.

Calculations are done on the FPU in the 80 bit precision whether you
are dealing with floats or doubles. The results are truncated
whenever they are stored in memory. (You may find that you get better
precision out of your calculations if you optimize -O3, as there
would be fewer truncations.)

As far as I know, GCC can't store extended precision numbers. (How
about a "long double" type folks?)

GCC does have a nice long long int, which as seems to be a software
emulated 64 bit int. I don't think this uses the intel FPU 64 bit long
int, but I'm open for correction on that one. Maybe a wee bit of
disassembly will tell us.


John Carter
Institute for Water Quality Studies. Department of Water Affairs.
Internet : ece AT dwaf-hri DOT pwv DOT gov DOT za  Phone    : 27-12-808-0374x194      
Fax : 27-12-808-0338 [Host for Afwater list server]

Founder of the Council for Unnatural Scientists.