F
) and exponent (E
) with a radix of 2, in the form of F×2E
. Both E
and F
can be positive as well as negative. Modern computers adopt IEEE 754 standard for representing floating-point numbers. There are two representation schemes: 32-bit single-precision and 64-bit double-precision. In 32-bit single-precision floating-point representation:
Normalized Form
Let's illustrate with an example, suppose that the 32-bit pattern is 1 1000 0001 011 0000 0000 0000 0000 0000, with:
S = 1
E = 1000 0001
F = 011 0000 0000 0000 0000 0000
In the normalized form, the actual fraction is normalized with an implicit leading 1 in the form of 1.F. In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2-2 + 1×2-3 = 1.375D.
The sign bit represents the sign of the number, with S=0 for positive and S=1 for negative number. In this example with S=1, this is a negative number, i.e., -1.375D.
In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-127). This is because we need to represent both positive and negative exponent. With an 8-bit E, ranging from 0 to 255, the excess-127 scheme could provide actual exponent of -127 to 128. In this example, E-127 = 129-127 = 2D.
Hence, the number represented is -1.375×22 = -5.5D.
De-Normalized Form
Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot represent the number zero! Convince yourself on this!
De-normalized form was devised to represent zero and other numbers.
For E=0, the numbers are in the de-normalized form. An implicit leading 0 (instead of 1) is used for the fraction; and the actual exponent is always -126. Hence, the number zero can be represented with E=0 and F=0 (because 0.0 × 2-126 = 0).
We can also represent very small positive and negative numbers in de-normalized form with E = 0. For example, if S = 1, E = 0, and F = 011 0000 0000 0000 0000 0000. The actual fraction is 0.011 = 1×2-2 + 1 × 2-3 = 0.375D. Since S = 1, it is a negative number. With E = 0, the actual exponent is -126. Hence the number is -0.375 × 2-126 = -4.4 × 10-39, which is an extremely small negative number (close to zero).
Summary
In summary, the value (N) is calculated as follows:
Example 1: Suppose that IEEE-754 32-bit floating-point representation pattern is 0 10000000 110 0000 0000 0000 0000 0000.
Sign bit S = 0 ⇒ positive number
E = 1000 0000B = 128D (in normalized form)
Fraction is 1.11B (with an implicit leading 1) = 1 + 1×2-1 + 1×2-2 = 1.75D
The number is +1.75 × 2(128-127) = +3.5D
Example 2: Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110 100 0000 0000 0000 0000 0000.
Sign bit S = 1 ⇒ negative number
E = 0111 1110B = 126D (in normalized form)
Fraction is 1.1B (with an implicit leading 1) = 1 + 2-1 = 1.5D
The number is -1.5 × 2(126-127) = -0.75D
Example 3: Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110 000 0000 0000 0000 0000 0001.
Sign bit S = 1 ⇒ negative number
E = 0111 1110B = 126D (in normalized form)
Fraction is 1.000 0000 0000 0000 0000 0001B (with an implicit leading 1) = 1 + 2-23
The number is -(1 + 2-23) × 2(126-127) = -0.500000059604644775390625 (may not be exact in decimal!)
Example 4: (De-Normalized Form): Suppose that IEEE-754 32-bit floating-point representation pattern is 1 00000000 000 0000 0000 0000 0000 0001.
Sign bit S = 1 ⇒ negative number
E = 0 (in de-normalized form)
Fraction is 0.000 0000 0000 0000 0000 0001B (with an implicit leading 0) = 1×2-23
The number is -2-23 × 2(-126) = -2×(-149) ≈ -1.4×10-45
Largest positive number: S=0, E=1111 1110 (254), F = 111 1111 1111 1111 1111 1111.
Smallest positive number: S=0, E=0000 00001 (1), F = 000 0000 0000 0000 0000 0000.
Same as above, but S=1.
Largest positive number: S=0, E=0, F=111 1111 1111 1111 1111 1111.
Smallest positive number: S=0, E=0, F=000 0000 0000 0000 0000 0001.
Same as above, but S=1.
Notes For Java Users
You can use JDK methods Float.intBitsToFloat(int bits) or Double.longBitsToDouble(long bits) to create a single-precision 32-bit float or double-precision 64-bit double with the specific bit patterns, and print their values. For examples,
System.out.println(Float.intBitsToFloat(0x7fffff));
System.out.println(Double.longBitsToDouble(0x1fffffffffffffL));
The value (N) is calculated as follows:
Normalized Floating-Point Numbers
In normalized form, the radix point is placed after the first non-zero digit, e,g., 9.8765D×10-23D, 1.001011B×211B. For binary number, the leading bit is always 1, and need not be represented explicitly - this saves 1 bit of storage.
In IEEE 754's normalized form:
Take note that n-bit pattern has a finite number of combinations (= 2n), which could represent finite distinct numbers. It is not possible to represent the infinite numbers in the real axis (even a small range says 0.0 to 1.0 has infinite numbers). That is, not all floating-point numbers can be accurately represented. Instead, the closest approximation is used, which leads to loss of accuracy.
The minimum and maximum normalized floating-point numbers are:
Denormalized Floating-Point Numbers
If E = 0, but the fraction is non-zero, then the value is in denormalized form, and a leading bit of 0 is assumed, as follows:
Denormalized form can represent very small numbers closed to zero, and zero, which cannot be represented in normalized form, as shown in the above figure.
The minimum and maximum of denormalized floating-point numbers are:
Special Values
32 docs|15 tests
|
|
Explore Courses for Computer Science Engineering (CSE) exam
|