Computer Engineering / Computer Science 126
mce.GIF (14558 bytes)

Supplemental Notes

Doug Sapp
[Chapter 1] [Chapter 2] [Chapter 3] [Chapter 4]

Floating Point

A) Introduction

Computers are integer machines. In order to hold a number other than an integer we need to develop a representation. The Institute for Electrical and Electronic Engineers, IEEE, has defined a widely used standard for floating point notation which we will also use.

As you will soon see, performing arithmetic operations with floating point is computationally intensive.  Performing operations with software would be a great burden on the CPU.  In order to alleviate the burden the operations are performed with hardware called floating point units, FPUs or math coprocessors.

  32 bit Unsigned Integer 32 bit IEEE Floating Point
Range 0 - FFFFFFFF +/- 3.40282346638xE38
Accuracy Dead On within 1.19209289551xE-7
Fractional Part? No Yes
Negative Numbers? No Yes

B) Floating point numbers resemble scientific notation

Scientific notation: s * m x 10e where

  • s is the sign
  • m is the mantissa
  • e is the exponent
  • 10 is the base

Example: Convert the number -1234.5678 to scientific notation.

  • s = (-1)
  • m = (1.2345678)
  • e =  (3)
  • answer = -1.2345678x104

C) Normalized numbers

Just as in scientific notation floating point mantissas must be normalized.   Normalized form means that there is only 1 digit in the ones place and the rest remain to the right of the decimal point.  To normalize a number we can adjust the exponent.

To normalize -123.45678x101 all we have to do is shift the decimal two places to the left.  We can do this by increasing the exponent by 1 each time we move the decimal one placemarker to the left.  The same thing applies to the denormalized number -0.0012345678x106, except this time we will be moving the decimal point to the right and we decrease the exponent each time.


D) Floating Point Notation

We will focus on the IEEE 754 (32 bit single precision) floating point standard.   After understanding this notation you can adapt it to all other floating point notations.

Floating point numbers are nothing more than binary numbers in a certain predefined form.  The form consists of three main parts:

  • Sign - 0 is positive, 1 is negative (1 bit)
  • Exponent - gives the floating point number range (8 bits)
  • Mantissa - gives the floating point number accuracy (23 bits)

     BYTE 1     |      BYTE2      |      BYTE3      |      BYTE4
S E E E E E E E | E M M M M M M M |
M M M M M M M M | M M M M M M M M

The exponent is biased by -127.  This enables it to represent very small numbers along with very large numbers. When deriving the exponent you need to subtract 127 to get the actual exponent.  If your exponent is 10000000b, subtract 127 to get the actual exponent of 1.  If your exponent is 01111100b, subtract 127 to get the actual exponent of -3.  There are some special cases which we will learn about later that limits the exponent to a range of -126 to 127.  Since we are dealing in binary the base is 2 unlike scientific notation which has base 10. 

The mantissa is 23 bits long with a hidden bit at the beginning.  Remember what we said about normalizing numbers?  The mantissa must have a 1 at the beginning.  If it requires a 1 every time why not assume it is always there and free up an extra bit for the mantissa?  By doing this we can extend the mantissa to 24 bits - the assumed 1 + 23 bits.  There are some special cases where the hidden bit is a 0 which we will cover later.


E) Conversions

Example: #E0781CF8h

     BYTE 1     |      BYTE2      |      BYTE3      |      BYTE4
1 1 1 0 0 0 0 0 | 0 1 1 1 1 0 0 0 | 0
0 0 1 1 1 0 0 | 1 1 1 1 1 0 0 0

In integer form (0 - s) * m x 2e where:

  • Sign = 1 is negative
  • Exponent = (192 - 127) = 65
  • Mantissa = 1.11110000   00111001  11110000  (remember the hidden bit)
  • Answer = -1.11110000001110011111000x265d   (binary mantissa)
  • Answer = -1.F039F0x265d  (hex mantissa)
  • Answer = -1.93838405609x265d  (decimal mantissa)
  • Answer = -7.15137491985x1019  (integer answer in scientific notation)

Example: 1234.5678d

  • Convert to binary:10011010010.1001000101011011b
  • Normalize (shift decimal point left and calculate exponent)
  • Normalized:1.00110100101001000101011011b (with exponent 10)
  • Sign = positive is 0
  • Exponent = (10 + 127) = 137 = 10001001b
  • Mantissa = 1.00110100   10100100  01010110  (some bits get discarded)

     BYTE 1     |      BYTE2      |      BYTE3      |      BYTE4
0 1 0 0 0 1 0 0 | 1 0 0 1 1 0 1 0 | 0
1 0 1 0 0 1 0 | 0 0 1 0 1 0 1 1 1

In hex: #449A522Bh


F) Special Cases

Zeros

+0.0 and -0.0 are represented by having an exponent of zero and a mantissa of zero.  The hidden bit is assumed 0 when exponent is zero.

00000000 | 00000000 | 00000000 | 00000000 = +0.0
10000000 | 00000000 | 00
000000 | 00000000 = -0.0

Infinities

+¥ and -¥ are represented by having an exponent of all ones and a mantissa of zero.

01111111 | 10000000 | 00000000 | 00000000 = +infinity
11111111 | 10000000 | 00
000000 | 00000000 = -infinity

NaNs - (Not a Number)  ie: 0/0

NaNs are represented by having an exponent of all ones and a mantissa of non-zero.

01111111 | 11000000 | 00000000 | 00000000 = NaN
11111111 | 11111000 | 00
010000 | 00011000 = NaN

Denormals

Denormals - numbers that are not normalized.  Exponent is all zeros and mantissa is non-zero.  Hidden bit is assumed 0 since the exponent is zero.   Denormals represent numbers that are very close to zero.

00000000 | 01000000 | 00000000 | 00000000 = +5.87747x10-39
10000000 | 00000011 |
11000000 | 10000001 = -3.445639x10-40

As you can see, this is why the exponent can only go from -126 to 127 for normalized numbers.  Exponent of 0 is reserved for denormals and zero.   Exponent of 128 is reserved for infinity and NaNs.


G) Other Notations

IEEE Double Precision uses 64 bits: 1 for the sign, 11 for the exponent and 52 for the mantissa.

IEEE Extended Precision uses 80 bits


H) Representation Problems

Arithmetic Overflow

Numbers that are too big to be represented in floating point notation.  9.9999x10500

Arithmetic Underflow

Numbers that are too small to be represented in floating point notation. 1x10-500

Cancellation Error

Adding a very small number to a very large number results in no change to the larger.
     1x1020 + 1x10-20 = 1x1020


I) Addition in Floating Point

  1. Binary shift smaller number to match larger number's exponent
  2. Perform 24bit addition on mantissa
  3. Add carry (if any) to exponent and renormalize mantissa if necessary

Example: #7271A05Fh + #702B847Ch (4.785905E+30 + 2.123284E+29)

1) #702B847Ch is smaller so shift mantissa right until exponent matches larger number (larger exponent is 101d)

mantissa = AB847C, exponent = 97d   (mantissa of #702B847Ch with hidden bit!)
mantissa = 55C23E, exponent = 98d   (shift right)
mantissa = 2AE11F, exponent = 99d    (shift right)
mantissa = 15708F, exponent = 100d   (shift right)
mantissa = 0AB847, exponent = 101d   (shift right and now exponents match)

2) Perform 24 bit addition on the two mantissas.

_71A05F
+0AB847
_7C58A6 (no carry)

3) No carry so exponent stays the same.  Combine the mantissa back together to form the new floating point answer.  #727C58A6h = 4.998233E+30


J) Multiplication in Floating Point

  1. Add exponents without bias and then re-bias
  2. 24bit multiply mantissa
  3. Normalize result
  4. Keep only 23 bits and combine with previous exponent to get answer

Example: #42421010h * #44003311h (48.51569 * 512.7979)

  • Exponents are 5d and 9d so new exponent is = 5+9 = 14  (with bias = 141)
  • Multiply mantissas = C21010 * 803311 = 612EBE164110   (48 bit answer)
  • Normalize result: (shift 1 time to the left) = C25D7C2C8220
  • Keep the 24 most significant bits.  C25D7C  (23 + hidden bit)
  • Hide the MSB => 425D7C  (now it's 23 bits)

#46C25D7Ch = 24878.7421876

This document is (c) 1998 Doug Sapp.
The contents may not be reproduced without consent of the author.
All statements of fact made in the document are true to the best of the author's knowledge and if incorrect reflect errors that are solely his own.