REGULAR, AREA-TIME EFFICIENT CARRY-LOOKAHEAD ADDERS

Tin-Fock Ngai * - Mary Jane Irwin **

* Electrical Engineering, Stanford University, CA
** Computer Science, Penn State University, PA

Abstract

For fast binary addition, a carry-lookahead (CLA) design is the obvious choice [OnAt83, Baj83]. However, the direct implementation of a CLA adder in VLSI faces some undesirable limitations. Either the design lacks regularity, thus increasing the design and implementation costs, or the interconnection wires are too long, thus causing area-time inefficiencies. Together, these factors defeat the advantage of the size of addition. Brent and Kung solved the regularity problem by reformulating the carry chain computation [BrKu82]. They showed that an n-bit addition can be performed in time $O(\log n)$, using area $O(n \log n)$ with maximum interconnection wire length $O(n)$. In this paper, we give an alternative log $n$ stage design which is nearly optimum with respect to regularity, area-time efficiency, and maximum interconnection wire length.

The Carry-lookahead Scheme

Let $a_{n-1}a_{n-2} \ldots a_0$ and $b_{n-1}b_{n-2} \ldots b_0$ be two n-bit binary numbers with a sum of $s_{n-1} \ldots s_0$. The carry-lookahead scheme computes the $s_i$'s by

$$c_{i+1} = g_i + p_i c_i$$
$$s_i = a_i \Theta b_i \Theta c_i$$

for $i = 0, 1, \ldots, n-1$

where

$g_i = a_i b_i$ (carry generate)

$p_i = a_i + b_i$ (carry propagate)

$\Theta$ denotes logical or

$x \Theta y$ denotes $x$ and $y$

$\Theta$ denotes exclusive or

It is easy to show that

$$c_{im+1} = g_{im} + \sum_{j=0}^{m-1} \prod_{i=j+1}^{m} p_{sj} c_{j+1}$$

For large $m$, the above carry computation is difficult to implement due to the practical limitations on fan-in and fan-out. In order to reduce the complexity, it is common practice to group carries into blocks [WaFl82]. The block carry scheme is

$$c_{i+1} = g'_i + p'_i c'_i$$
$$c_{mi} = c'_i$$

$$c_{ri+1} = g_{ri+1} + \sum_{j=0}^{m-1} \prod_{i=j+1}^{m} p_{sj} c_{j+1} +$$

$$\prod_{i=0}^{m} p_{ri} c_i$$

for $0 \leq m \leq r - 2$

where

$g'_i = g_{ri+1} - 1$ (block carry generate)

$p'_i = \prod_{j=0}^{m} p_{sj}$ (block carry propagate)

$r$ is the blocking factor

The same technique can be applied iteratively to compute the block carries. This scheme is illustrated in Figure 1 for a 27-bit CLA adder using a blocking factor of three in three levels.
Because we are using NMOS technology as our model, all complex gates use complementary logic. Thus, it is desirable to modify the above scheme to reflect the use of a negative logic system. Assume that at the k-th level of the iteration, the block carry computation is

\[ c_{k+1} = g_k + p_k c_k \]

We can reformulate this computation into negative logic as

\[ c_{k+1} = -[g_k + p_k(-c_k)] , \]

where

\[ g_k = -(g_k + p_k) \]
\[ p_k = -(p_k) \]

\( x \) denotes the complement of \( x \) (not \( x \))

Then for next level of iteration, we have

\[ c_{k+1} = g_k^{*t+1} + p_k^{*t+1} c_k^{*t+1} \]
\[ c_k = -(c_k^{*t+1}) \]

\[ c_{k+m+1} = \left( g_{k+m} + \sum_{i=0}^{m-1} \left( \prod_{j=0}^{i-1} p_{k+j} \right) c_{k+i} \right) + \left( \prod_{j=0}^{m} p_{k+j} \right) c_k^{*t+1} \]

for \( 0 \leq m \leq r - 2 \)

where

\[ g_k^{*t+1} = G_k + \sum_{i=0}^{t} \left( \prod_{j=0}^{i-1} P_{k+j} \right) G_{k+i} \]

\[ p_k^{*t+1} = \prod_{j=0}^{t} P_{k+j} \]

With this reformulation, all block carries, block carry generates and block carry propagates can be directly implemented with complex gates.

**Adder Design**

We found that a blocking factor of three for the first two levels of the iteration and four for all higher levels resulted in a more area efficient VLSI NMOS design as will be demonstrated in the next section. With this scheme there are four types of basic units to be designed (only three of which are depicted in Figure 1). These are the primitive unit (P), the 3-input block carry unit (BC3), the 4-input block carry unit (BC4), and the block carry generation unit (BG). NMOS layouts of these units are shown in Figure 2.

**Primitive unit (P)** This basic unit includes four subunits: input/output, block propagate and block generate, carry generation and summation. We postpone the discussion of the input/output subunit. Here we assume that the input/output subunit is of low complexity and uses constant area. In the block propagate and block generate subunit, block propagate \( p^1 \) and block generate \( g^1 \) are formed directly from the operands \( a, a_0, a_0 \) and \( b, b_0, b_0 \). (We restrict our attention to the least significant unit for notational simplicity without loss of generality.)

\[ g^1 = -(a_2 + p_0)(a_2 + a_1 + p_1)(a_2 + a_1 + a_0 + p_0) \]
\[ = -a_2 b + (a_1 + b_0)(a_1 + b_0) \]
\[ p^1 = -(a_2 + a_1 + p_0) = -(a_2 b + a_1 + a_0 + b_0) \]

The carry generation subunit computes all the carry bits internal to the primitive unit from the block carry \( c_k \) which is input to the unit.

\[ c_0 = -(c_d) \]
\[ c_1 = -(G_0 + P_0 c_d) \]
\[ c_2 = -(G_1 + P_1 G_0 + P_1 P_0 c_d) \]
Figure 2a. The Primitive Unit Layout

Figure 2b. The BC3 Layout

Figure 2c. The BG Layout
where
\[ G_0 = -(s_0 + p_0) - (a_0 + b_0) \]
\[ G_1 = -(s_1 + p_1) - (a_1 + b_1) \]
\[ P_0 = -g_0 = -(a_0b_0) \]
\[ P_1 = -g_1 = -(a_1b_1) \]

The summation subunit computes the sum bits from the internal carries and the operand bits.
\[ s_t = a_t b_t \oplus c_{t-1}, \quad t = 0, 1, 2 \]

The corresponding NMOS complex gate layout for the primitive unit is shown in Figure 2a.

**Block carry units (BC3 and BC4)** These two basic units form the block carry generates and block carry propagates from either three (BC3) or four (BC4) inputs which will serve as inputs to the next higher level and also compute the block carries which will serve as inputs to the subsequent lower level. Implementation of the block carry generates and propagates using complex gates is straightforward. For example in BC4
\[ g^{k+1} = \left( g^k + p^k \right) \left( g^k + p^k \right) \left( g^k + p^k \right) \left( g^k + p^k \right) \]
\[ p^{k+1} = -(g^k + g^k + g^k + g^k) \]

The block carries are computed only when the block carry generated by the next higher level becomes available.
\[ c^k = -(c^{k+1}) \]
\[ c^k = -(g^k + p^k c^{k+1}) \]
\[ c^k = -(G_t + P_t c^{k+1}) \]
\[ c^k = -(G_t + P_t c^{k+1}) \]
\[ \text{where} \]
\[ G_t = g^t + p^t g^k \]
\[ P_t = p^t b^k \]
\[ G_t = g^t + p^t \left( g^t + p^t b^k \right) \]
\[ P_t = p^t b^k \]

\[ \uparrow \text{in BC4 unit only} \]

Figures 2b shows the corresponding NMOS complex gate layout for the BC3 unit.

**Block carry generation unit (BG)** This basic unit forms block carries at the highest level of iteration. If \( K \) is the total number of iteration levels, then

\[ c^K = \begin{cases} -c_0 & \text{if } K \text{ is odd} \\ c_0 & \text{if } K \text{ is even} \end{cases} \]

and
\[ c^K = g^K + p^K c^K \]
\[ c^K = g^K + p^K \left( g^K + p^K c^K \right) \]
\[ c^K = (g^K + p^K g^t) + (p^K b^t) \left( g^K + p^K c^K \right) \]

\[ \uparrow \text{as needed} \]

Figure 2c shows the corresponding NMOS complex gate layout for the BG unit.

**Area and Time Measures**

Figure 3 is an NMOS layout for a 27-bit adder using the scheme derived above. This layout has been fit in area \( O(n) \) by using a recursive technique for embedding tree-like structures in the plane [Ui84]. As can be seen in the floorplan schematic in Figure 4a, a 27-bit adder can be packed most densely by using a blocking factor of three at each of the levels. For larger adders, a blocking factor of three for the first two levels and four thereafter leads to the most efficient layout as seen in Figures 4b and 4c. The flexibility that comes from being able to vary the blocking factor at various levels can be used to make optimum use of space. Note that all interconnections between the basic units involve only nearest neighbors with a maximum wire length of \( O(\sqrt{n}) \).

Brent and Kung’s adder [BrKu82] can also be embedded in \( O(n) \) area with a maximum wire length of \( O(\sqrt{n}) \), but the lack of flexibility in the blocking factor (only two) results in a much less densely packed layout with considerably more area dedicated to interconnect.

Obviously, any log_2 n stage adder can add two n-bit numbers in time \( O(\log n) \). Another motivation in using negative logic was to speed up the addition even more. At each level, we need to form the carry generate and carry propagate terms and to form the block carries out after receiving the block carry in. While a positive logic approach would require four NMOS gate delays to accomplish these operations, the theoretical lower bound for the negative logic approach should be two gate delays. SPICE simulations have indicated that three gate delays may be a more realistic measure for our complex gates. Table 1 gives a comparison of the computation time for a serial scheme, Brent and Kung’s scheme, and our scheme for various n. \( T \) is the average delay of a simple NAND/NOR gate measured at the inverter voltage level. The delay of an exclusive or gate is assumed to be 2T.

<table>
<thead>
<tr>
<th>n</th>
<th>Serial</th>
<th>B&amp;K</th>
<th>N&amp;I</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>18</td>
<td>15</td>
<td></td>
</tr>
<tr>
<td>9</td>
<td>20</td>
<td>8</td>
<td></td>
</tr>
<tr>
<td>16</td>
<td>34</td>
<td>19</td>
<td></td>
</tr>
<tr>
<td>27</td>
<td>56</td>
<td>12</td>
<td></td>
</tr>
<tr>
<td>64</td>
<td>130</td>
<td>27</td>
<td></td>
</tr>
<tr>
<td>108</td>
<td>218</td>
<td>16</td>
<td></td>
</tr>
<tr>
<td>256</td>
<td>514</td>
<td>35</td>
<td></td>
</tr>
<tr>
<td>432</td>
<td>866</td>
<td>20</td>
<td></td>
</tr>
</tbody>
</table>

For large n, our design uses time \( 27 \times \log n \) as compared to \( 27 \times 2\log n \) for Brent and Kung’s design.

**Input-Output Consideration**

In the previous section, our design was based on the assumption that external inputs and outputs are locally available at the primitive units. When more realistic input and out-
Figure 3. Layout of a 27-bit CLA Adder
put constraints are considered, we face the following well known dilemma - one can pack \( n \) processing elements densely in an area of \( O(n) \) with the smallest linear dimension (such as in a circle or in a regular polygon) but if all inputs and outputs are only available on the boundary of the area, how can the \( O(n) \) I/O wires be routed across the boundary of length \( O(\sqrt{n}) \)? In our CLA design, there are two obvious possible solutions.

The first solution is to relax the area efficiency requirement; let all I/O wires be routed across the boundary as shown in Figure 5a. The area is then of \( O(n^2) \). As \( n \) increases, the portion of the area used for interconnection increases significantly. Obviously for very large \( n \), this solution uses the most area for input/output wires and the most time for input and output along the I/O wires. Area and time complexities become \( O(n^2) \) and \( O(n) \), respectively. Although such asymptotic behavior is undesirable, this solution may be appealing for not too large \( n \)'s. It is estimated that for \( n \) in the thousands, the total area is only several times of that of the previous 'generic' adder and the total time (including the input/output time) is less than twice that of the generic adder. Furthermore, although the length of the I/O wires is of \( O(n) \), the maximum size of addition allowed is still large due to the fact that the proportional constant with \( n \) is small.

The other solution relaxes the time efficiency requirement by assuming that inputs and outputs are time multiplexed. In this case the number of I/O wires can be reduced to match the length of the boundary. However, the input and output will take time \( O(\sqrt{n}) \). The data lines, common address lines, clock and control lines must be provided as shown in Figure 5b. In each primitive unit, the input-output subunit has registers and the decoding circuit. With the appropriate address on the bus, data may be loaded into or read out of the registers. Asymptotically, this solution uses area proportional to \( n \log n \). But practically for an \( n \) as large as \( 10^5 \), the addition I/O wire area is still no more than that of the generic adder. This solution will require \( \sqrt{n} \) clocks for each input (output) with the minimum clock period allowed proportional to \( \sqrt{n} \). The total I/O time is then proportional to \( n \) which is much longer than the addition time. Therefore, this solution would not be desirable for just additions. By expanding the functionality of the primitive unit, such as performing an accumulation function, the design may be appealing. For example, multiplication can then be done in time proportional to \( n \log n \). This outweighs the I/O time and results in a multiplier of time \( O(n \log n) \). This is only desirable in the cases where processing time is much longer than the input/output time. Furthermore, the input-output wire reduction also conforms to the limited I/O pin restriction for VLSI chips.

The input-output difficulties discussed above are due to the planar restriction on VLSI processing. Recently, a considerable amount
of research effort has been put into developing 3-D VLSI processing and 3-D microassembly techniques [GrNE84]. Our generic adder would be ideal for these new emerging technologies.

Conclusion
We have shown that using the conventional iterative CLA scheme, nearly optimal CLA adders can be obtained. The layout of the adder is highly regular. Neglecting input-output restriction, the addition of two n-bit numbers can be performed in time $O(\log n)$, using area $O(n)$ with maximum interconnection wire length $O(\sqrt{n})$. When input-output is considered, two different alternative designs are evaluated. Even for practically large n's, it is shown that reasonably good area and time efficiencies can be achieved.

Acknowledgements
This work is supported in part by the Army Research Office under Contract DAAG29-83-K-0126. Thanks go to Shishpal Rawat for work on the CIF layout plots.

References