My quest to implement IEEE754 floating point numbers

GitHub repository.

So we had to implement IEEE754 as an assignment. We were told by Павел Скаков that only one student was able to finish this assignment with 100% test completion. So I was up for the challenge. High testosterone levels apparently make good programmers. (I am not joking. There's an actual correlation. Being an engineer is generally a manly trait. Can't argue with biology. Though, autistic people of any gender can be good at programming; but I digress)

In this quest to implement IEEE754 numbers, a small library called Berkeley TestFloat was really helpful. Basically, finding good tests is half the task done.

You probably already know the basics of IEEE754 floating point format, so I won't bore you with another detailed explanation. Still, here's a quick refresher. A 32-bit single precision floating point number can be stored in a C struct like this:


    struct Float32 {
        uint32_t s : 1; // sign bit
        uint32_t e : 8; // exponent
        uint32_t m : 23; // mantissa
    }
        
(bitfields allow using a set number of bits for every field)
Single-precision bit pattern Value
0 < e < 255 (-1)s × 2e-127 × 1.m (normal numbers)
e = 0; m ≠ 0 (at least one bit in m is nonzero) (-1)s × 2-126 × 0.m (subnormal numbers)
s = 0; e = 255; m = 0 (all bits in m are zero) +INF (positive infinity)
s = 1; e = 255; m = 0 (all bits in m are zero) -INF (negative infinity)
e = 255; m ≠ 0 (at least one bit in m is nonzero NaN (Not-a-Number)

Rounding

Weirdly, we have to discuss rounding first, because this is going to define the operations that we do. I've implemented four rounding modes:

  1. To nearest, ties to even:
    Round to closest. If directly at the middle, round to even. This can be implemented by maintaining two boolean variables which I call first_rounded_bit and sticky_bit. first_rounded_bit is the most significant bit that got rounded off and sticky_bit is 0 if and only if all bits after first rounded bit are zero (like bitwise OR of all of them). This works like this:
    Example 1: ...01|110101
    "|" denotes the place we want to round off, so we want to get something that ends with ...01 + rounding. 1 right after "|" is the first_rounded_bit. After that, there are some non-zero bits, so sticky_bit = 1. Here, we can see that, when sticky_bit == 1 and first_rounded_bit == 1, we round up. Example 2: ...01|000000
    Here, first_rounded_bit = 1 and sticky_bit = 0, which means we are exactly in between ...10 and ...00. Think of it as rounding (...).5 to nearest even integer. (For example, you can read on why round() in Python does this) Here, we just check if ...01 is even and round up (add 1) if it is. In this case ...01 & 1 == 1, so it's odd.
    Example 3: ...01|01110 first_rounded_bit == 0 sticky_bit == 1
    Example 4: ...01|00000 first_rounded_bit == 0 sticky_bit == 0
    You can see how we don't need to do anything in these cases (that is, round down, discard the rounded bits).
  2. Towards zero:
    The easiest rounding mode. Just don't do anything and discard the rounded bits.
  3. Towards positive infinity:
    We round down in every case, but one: if first_rounded_bit == 1 and sticky_bit == 1 and the number is positive, that is, we actually round UP, not as an absolute value.
  4. Towards negative infinity:
    Same as 3, but only for negative numbers.

High-level idea for every operation

  • Addition:

    We need to shift the mantissa of one number to match the exponent of another. (m << 1 effectively doubles the number, so we can subtract 1 from e to preserve the value; same with m >> 1) We'll shift the bigger (as in absolute value) number's mantissa up to match the smaller exponent, so we can extract rounding information from last bits later (the other way around would do, too, and it is even more popular, it seems).

    The difference between exponents can be pretty big (more than 28 for float32). That would mean that shifting may exceed 64 bits that are available as normal C/C++ types. Luckily, we don't actually need to shift that much. Think about it. If we were to shift a small number (difference of exponents greater than 23) we would exceed the precision bits of mantissa. So we can assume that the result is the max operand and just round the result accordingly. Considering that the first rounded bit can influence rounding, it's best to change this condition to (a.e - b.e > 24) and handle other cases normally.

    For subnormal numbers, we perform normalization. Basically, shift mantissa up until there is 1 in the implicit 0's place while changing exponent accordingly. After that we can insert the implicit one (for normal numbers) and perform integer addition. When we set bit 24 (for float32) we get a number of form 1.(last 23 bits) which is a number in range [1, 2).

    The result of addition may be greater than or equal to 2 (indicated by 25-th bit being set for float32).

  • Subtraction:

    By subtraction, I mean addition of numbers of different sign. Actual (-) operation can be implemented trivially as adding negated second operand, i. e. a - b = a + (-b).

    In this case we need to perform integer subtraction instead of addition. The (a.e - b.e > 24) became (a.e - b.e > 25) in my implementation, because of the edge case where a.e - b.e == 25 and the first rounded bit becomes 0 after borrowing for subtraction. An example of subtraction in such case can be found in the big code comment in add() definition.

  • Multiplication:

    Normalize subnormals, add in implicit 1.
    Here, the obvious idea is to add the exponents as you would do in algebra and then multiply the mantissas:
    24-bit number × 24-bit number
    The resulting number is up to 48 bits, so we need to normalize it.

  • Division:

    Similar to multiplication, but we need to perform a floating point division on mantissas. This can be done by shifting first operand 24 bits right (for float32) and doing integer division.

    // (24-25 bits).(24 '0' bits) / 1.(23 bits)
    // if am >= bm then result is 1.(24 bits)
    uint64_t resm = dividend / divisor;
    uint64_t rem = dividend % divisor;
    bool first_rounded_bit = resm & 1;
    bool sticky_bit = rem != 0;
    resm >>= 1;
    This way we get the first 24 bits of the fractional part. The remainder allows us to get the information about the rest of the fractional part needed for rounding.

After all the described operations we need to normalize the number in case the result is ≥ 2.0 or < 1.

Note on the multiplicative operations: if the result is subnormal, i.e., biased exponent value < 0, then we need to first make it into proper subnormal form (shift mantissa so that unbiased exponent is -126) and only then round the result. I do it the following way:

if (res_exp > 0)
{
    round(&resm, rounding, first_rounded_bit, sticky_bit, sign);
    // if rounding resulted in resm == 2.0
    if (resm >> (MBITS + 1))
    {
        resm >>= 1;
        res_exp++;
    }
}
else
{
    resm <<= 1; // shift left to make space for sticky bit
    res_exp--;
    sticky_bit |= first_rounded_bit;
    resm |= sticky_bit;	   // save sticky bit as last bit
                                          // which is going to be shifted anyway
}

I did not discuss special cases like NaN or Inf, because they are quite intuitive and can be easily understood in the code. I'd like to point out only one particular class of them. And it's zeros. The thing is, when you add zeros of different sign, the result depends on the rounding. The result is +0 except for rounding towards negative infinity when it's -0. And you also have this:

-0 + x = x
+0 + x = x, x ≠ -0
+0 + -0 = +0

So you kind of have to 'if' it:

    if (b.is_zero())
    {
        if (rounding == ROUNDING_TOWARD_NEG_INFINITY
            && a.is_zero() && a.s != b.s)
        {
            return F { 1, 0, 0 };
        }
        if (a.is_zero() && a.s && !b.s)
            return b;
        return a;
    }