Computer Science Engineering (CSE) Exam  >  Computer Science Engineering (CSE) Notes  >  Digital Logic  >  Number Representations & Computer Arithmetic (Fixed & Floating point)

Number Representations & Computer Arithmetic (Fixed & Floating point)

Floating Point Representation

This section explains the 32-bit IEEE-754 single-precision floating point representation used by most computers. The representation encodes a wide range of real numbers in a fixed 32-bit pattern using three fields: sign, exponent and fraction (mantissa). The stored bits and an implicit leading bit together form the significand (also called mantissa or fraction value).

Structure of a 32-bit (single-precision) floating point word

  • Sign - 1 bit (most significant bit). Value 0 means positive, value 1 means negative.
  • Exponent - 8 bits. Encoded with a bias. The bias for an exponent field of k bits is 2k-1 - 1. For k = 8, bias = 27 - 1 = 127.
  • Fraction (stored mantissa) - 23 bits. These bits store the fractional part of the significand. For normalized numbers an implicit leading 1 is assumed so the effective significand has 24 bits (1.fraction). For subnormal numbers the implicit leading 1 is 0 and the significand is 0.fraction.

Reconstructing the numeric value from the bit fields

Let S be the sign bit, e the unsigned integer value of the exponent field, and f the fractional value represented by the 23 fraction bits (f = b1/2 + b2/4 + b3/8 + ... where b1,b2,... are fraction bits). The value represented is:

For normalized numbers (1 ≤ e ≤ 254):

(-1)S × (1 + f) × 2e - bias

For subnormal numbers (e = 0 and f ≠ 0):

(-1)S × (0 + f) × 21 - bias

Special encodings (e = 255):

  • If e = 255 and fraction = 0: value is signed infinity (±∞ depending on S).
  • If e = 255 and fraction ≠ 0: value is NaN (Not a Number).
  • If e = 0 and fraction = 0: value is signed zero (+0 or -0).

Example 1: Binary (32 bits) → Decimal

Given IEEE-754 word: 11000001110100000000000000000000

Stepwise reconstruction (each line is one step of reasoning):

Sign bit S = 1, so the number is negative.

Exponent bits = 10000011 (the 8 bits after the sign).

Exponent field value e = 10000011₂ = 131₁₀.

Bias = 127, so exponent E = e - bias = 131 - 127 = 4.

Fraction bits (23 bits) = 10100000000000000000000₂.

Fractional value f = 1×(1/2) + 0×(1/4) + 1×(1/8) + 0×(1/16) + ... = 0.5 + 0 + 0.125 = 0.625.

Significand = 1 + f = 1.625 (implicit leading 1 for normalized numbers).

Numeric value = (-1)1 × 1.625 × 24.

Evaluate 24 = 16 and 1.625 × 16 = 26.

Final value = -26.

Conversion procedure: Decimal → IEEE-754 (single precision)

General steps to convert a real decimal number to 32-bit IEEE-754 single precision:

  1. Determine the sign bit: 0 if the number is ≥ 0, 1 if it is negative.
  2. Work with the absolute value of the number and convert it to binary (integer and fractional parts separately).
  3. Normalize the binary number so it is in the form 1.xxxxx × 2E for a non-zero normalized value. Count E (the exponent) accordingly.
  4. Compute the biased exponent e = E + bias (bias = 127 for single precision). If the biased exponent fits in 1..254, write it as the 8-bit exponent field.
  5. Take the fractional part after the leading 1 (the bits after the binary point in 1.xxxxx) and fill or truncate it to 23 bits to form the fraction field. Apply the chosen rounding mode if truncation is required (default IEEE rounding is round to nearest, ties to even).
  6. Handle special cases: if the true exponent E is too small to be represented as a normal number, produce a subnormal encoding if possible; if E is too large, produce ±∞ or raise overflow according to implementation.

Example 2: Convert -17 to 32-bit IEEE-754

Target value: -17

Stepwise conversion (each line is one step of reasoning):

Sign bit S = 1 because the number is negative.

Absolute value 17 in binary is 10001₂.

Normalize: 10001₂ = 1.0001₂ × 24, so E = 4.

Bias = 127, so biased exponent e = E + bias = 4 + 127 = 131.

Exponent field (8 bits) = 131₁₀ = 10000011₂.

Fractional part after the leading 1 is 0001 and concatenate zeros to make 23 bits: 00010000000000000000000.

Putting fields together: sign = 1, exponent = 10000011, fraction = 00010000000000000000000.

Final 32-bit IEEE-754 representation: 1 10000011 00010000000000000000000.

Important notes, limits and special values

  • Hidden (implicit) bit: For normalized numbers the leading bit of the significand is implicitly 1 and is not stored; that is why the stored 23 bits plus implicit 1 give 24 significant bits of precision.
  • Precision: Single precision provides about 24 bits of binary precision ≈ 7 decimal digits of precision.
  • Range: Normalized exponent E ranges from -126 to +127 (that is e = 1 to 254). The largest finite single precision value is approximately (2 - 2-23) × 2127 ≈ 3.4028235 × 1038. The smallest positive normalized value is 2-126 ≈ 1.17549435 × 10-38. Subnormal numbers allow smaller magnitudes down to about 1.40129846 × 10-45.
  • Subnormal (denormal) numbers: When the exponent field e = 0 and the fraction ≠ 0, the number is subnormal and the significand does not have an implicit leading 1. Subnormals fill the gap between zero and the smallest normalized number, at reduced precision.
  • Zeros, infinities and NaNs: Exponent e = 0 and fraction = 0 encodes ±0. Exponent e = 255 and fraction = 0 encodes ±∞. Exponent e = 255 and fraction ≠ 0 encodes NaN (signalling or quiet NaN depending on fraction bits).
  • Rounding and exceptions: When the exact value cannot be represented in 23 fraction bits the value is rounded. The default IEEE mode is round to nearest, ties to even. Overflow, underflow, inexact and invalid operations are handled according to IEEE-754 rules and may set floating-point status flags in hardware or software.

Common pitfalls for students

  • Confusing the number of stored fraction bits (23) with the effective significant bits (24 including the implicit leading 1 for normalized numbers).
  • For e = 0 treat the number as subnormal (no implicit 1). For e = 255 treat the number as special (∞ or NaN).
  • When converting decimal fractions to binary, repeating binary fractions occur frequently; determine enough bits and then round according to IEEE rules rather than truncating without care.

Summary: A 32-bit IEEE-754 floating point number encodes sign, biased exponent and a stored fraction. The implicit leading 1 for normalized values gives 24 bits of significand precision. Always handle the special exponent patterns e = 0 and e = 255 separately. Follow the normalization, biasing and rounding rules when converting between decimal and binary floating point representations.

The document Number Representations & Computer Arithmetic (Fixed & Floating point) is a part of the Computer Science Engineering (CSE) Course Digital Logic.
All you need of Computer Science Engineering (CSE) at this link: Computer Science Engineering (CSE)

FAQs on Number Representations & Computer Arithmetic (Fixed & Floating point)

1. What are number representations in computer arithmetic?
Ans. Number representations in computer arithmetic refer to the different ways in which numbers are stored and manipulated in a computer system. This includes fixed-point and floating-point representations. Fixed-point representations use a fixed number of bits to represent both the integer and fractional parts of a number, while floating-point representations use a dynamic range of bits to represent numbers with a varying magnitude and precision.
2. What is the difference between fixed-point and floating-point representations?
Ans. The main difference between fixed-point and floating-point representations is the way they handle the decimal point. In fixed-point representations, the decimal point is fixed at a specific position, typically dividing the bits into integer and fractional parts. On the other hand, floating-point representations allow the decimal point to "float" and adjust its position based on the magnitude of the number, providing a wider range and greater precision for representing numbers.
3. How does a computer perform arithmetic operations on fixed-point numbers?
Ans. Arithmetic operations on fixed-point numbers are performed using the same techniques as integer arithmetic. The computer interprets the fixed-point numbers as integers and applies the appropriate arithmetic operations, such as addition, subtraction, multiplication, and division. However, care must be taken to handle overflow or underflow situations, where the result of an operation exceeds the range of the fixed-point representation.
4. What are the advantages of using floating-point representations in computer arithmetic?
Ans. Floating-point representations offer several advantages in computer arithmetic. Firstly, they provide a wider range of representable numbers, allowing for both very small and very large values. Secondly, they offer higher precision, allowing for more accurate calculations with decimal numbers. Additionally, floating-point representations also support special values like infinity and NaN (Not a Number), which can be useful in certain scientific or engineering applications.
5. What are the limitations of floating-point representations in computer arithmetic?
Ans. While floating-point representations offer increased range and precision, they also have certain limitations. One limitation is the loss of precision when performing operations on numbers with significantly different magnitudes, known as the "floating-point rounding error." Another limitation is the inability to represent some decimal numbers exactly due to the finite number of bits available. These limitations can lead to inaccuracies in certain calculations, requiring careful consideration and implementation of algorithms to minimize errors.
Explore Courses for Computer Science Engineering (CSE) exam
Get EduRev Notes directly in your Google search
Related Searches
Number Representations & Computer Arithmetic (Fixed & Floating point), video lectures, Number Representations & Computer Arithmetic (Fixed & Floating point), Semester Notes, Extra Questions, Previous Year Questions with Solutions, Free, MCQs, Summary, pdf , Important questions, practice quizzes, mock tests for examination, Sample Paper, ppt, Number Representations & Computer Arithmetic (Fixed & Floating point), past year papers, Objective type Questions, study material, Viva Questions, Exam, shortcuts and tricks;