Floating-point addition

Floating-point addition

Floating point numbers format
The binary coding of floating-point real numbers is composed of three fields. The sign S (1 bit), the exponent E (8 bits) and the mantissa M, or significand (23 bits).
The number value is (-1)^S * 2^{(E - 127)} * (1 + M / 8388608 ) . However if E = 0, the number value is (-1)^S * 2^(-126) * ( M / 8388608 ) and if E = 255, the value is infinite.

Addition and subtraction

Since floating-point numbers are coded as "sign/absolute-value", reversing the sign-bit inverses the sign. Consequently the same operator performs equally well addition or subtraction according to the two operand's signs.
Addition/subtraction of floating-point numbers S = A + B is more complex than multiplication or division of floating-point numbers.
Floating-point addition progresses in 4 steps:

Mantissa alignment if A and B exponents are different,
Addition/subtraction of the aligned mantissas,
Renormalization of the mantissas sum S if not already normalized,
Rounding of sum S' .

The alignment step keeps a "guard bit" and a "round bit" and yields a "sticky bit" for the rounding step. The "sticky bit" is the "or" of all the bits discarded during the alignment shift.

Adder/ subtractor
A floating-point adder is made up of the following blocks:
Block 1: outputs the larger of the two exponents (8 bits), outputs the absolute value of the exponents difference (5 bits), reveals the "hidden bit" of both mantissas.
Block 2: output at left the smaller operand mantissa (23 bits), output at right the larger operand mantissa (23 bits). The "implicit bit" is added, totaling 24 bits.
Shifter 1: shifts to the right the smaller operand mantissa, appends the "guard bit", "round bit" and "sticky bit", totaling 27 bits.
Complementer: on request, does the logic complement for a subtraction.
Adder 1: adds the two inputs and the carry in. Outputs the sum and a carry out totaling 28 bits, 2 of them before the decimal point, 5 bits serve for the rounding, of whom 2 dropped.
Zero-leading-counter: the ZLC output is the number of leading '0'. If inhibited outputs 1.
Shifter 2: shifts to the left ( ZLC – 2 ) positions (i.e. from 2 to the right up to 23 to the left). The fist bit is dropped ( implicit '1' ).
Adder 2: subtracts ( ZLC – 1 ) from the larger of the two exponents.

This adder/subtractor implements faithfully the IEEE 754 standard. Nevertheless some blocks that would clutter up the figure are not included, namely the rounding logic , the exceptions (infinite, NaN, denormal, zero) and the "flags".

Alignment shifter
Floating-point numbers addition requires integer additions/subtractions, parametrised shifts (to the right for alignment, to the left for renormalization) and a counting of the result leading zeroes . Addition/subtraction can be completed with delay log₂(n). The parametrised shift's delay, below, is log₂(n) as well. A circuit of "or" and "and" gates computes the "sticky" bit.

Zero leading counter ( ZLC )
A binary tree counts up by dichotomy the number of '0' in the most significant positions. If the size of the sub-strings is a power of two, then there is no need for adders but multiplexers can be used instead. Actually only the size of the left substring has to be a power of two. The substring at right must simply be shorter than or of the same size as the left substring.

Cell of zero leading counter
This cell combines the number of leading '0' of two 16-bit strings to obtain the number of leading '0' of the concatenation of the two strings (32 bits).
. if X < 16 then S = X else S = 16 + Y

Zero leading prediction
From the mantissas A and B, one can construct in constant time a string P with the same number of leading zeroes, but for at most one, as the difference D = A – B without having to wait for the subtraction completion. When fed to a ZLC, this string predicts the number of positions required by the shifter. If the result of the shift still exhibits a leading zero, then a shift of one more position is necessary to normalize the result. Otherwise the shifted value is already normalized.

The prediction is valid if A is normalized and B less than or equal to A. This is the always the case in a significand subtraction. The leading zeroes result from a carry string 'P'* 'G' 'K'* , made up with a number (possibly null) of 'P' followed by a unique 'G' followed by a number (possibly null) of 'K' . The predictor cell outputs a '0' for every pair of symbols in: 'P' 'P' ; 'P' 'G' ; 'G' 'K' et 'K' 'K' and outputs a '1' for every other pair.
This predictor does not take into account the carry propagation, which may lead to an error of at most one position in the predicted bitstring. Since only one bit in 'P'* 'G' 'K'* might be incorrectly predicted, the error is tolerable.

Zero leading prediction adjustment
The prediction is incorrect only if the carry string starts with 'P'* 'G' 'K'* 'P' 'P'* 'K'. The following circuit output 'Y' whenever the prediction is incorrect, therefore too small by one.

Z indicates a string 'K'* 'P'*
Q indicates a string 'P'* 'G' 'K'* 'P'* (including only one 'G')
N indicates a string starting with 'P'* 'K'
Y indicates a string starting with 'P'* 'G' 'K'* 'P' 'P'* 'K', i.e. Q followed by N.
U indicates any other string.

Predictor cell
The leading '0' prediction cell output a '1' at the end of the string 'P'* 'G' 'K'* and '0' inside the string (and don't care neither inside nor at the end).

Absolute value of the difference
The absolute value of the exponents difference is needed to control Shifter 1. With four slight modifications, Sklansky's adder with a late carry-in returns S = ½ A – B ½.

Rounding to the nearest
In a floating point addition, the adder output S is first normalized by shift and then rounded up by adding 0 or 1 ulp, to give the result mantissa. The 28-bit S is labeled "adder out" in the floating point adder above. Check whether you are acquainted with the normalization and rounding.

The vertical arrow permutes the order of the shift and addition. The shift to the left of more than 1 position that never leads to an addition of 1 ulp is disregarded. With a pre-shift one position to the left in case of subtractions, we come down to the following cases :

shift to the right 0 or 1 position (S or S/2)
addition of 0 or 1 ulp (with carry propagation) (S or S/2 or S + 1 or S/2 + 1)

Speculative rounding
To avoid the extra delay due to the carry propagation when adding 1 ulp for the rounding, a three-output adder precalculates : S, S' = S + 1 and S" = S + 2. From this three outputs, all the possible final results can be obtained with a mere extra shift right: S, S/2, S' = S + 1, S"/2 = S/2 + 1.