DESIGN OF AN ARITHMETIC ELEMENT FOR SERIAL PROCESSING IN AN ITERATIVE STRUCTURE

Jawahri N. Goyal
Department of Computer Science
University of Illinois Urbana, Illinois 61801

Abstract

This paper describes the arithmetic and logic design of the digit processing logic of an arithmetic element. The arithmetic element is used in an iterative arithmetic and arithmetic processing takes place serially on a digit by digit basis with the most significant digit first. Starting from the arithmetic specification of the digit processing logic, the arithmetic design (namely, the choice of number system, number representation and the digit algorithm) is developed. Algebraic and logic designs of the logic necessary to execute the digit algorithm and its implication for LSI implementation are discussed.

1. Introduction

The advent of large-scale integration (LSI) technology and the demand for high speed digital computer systems has posed a new challenge to the designers of digital systems. Efficient use of LSI capabilities requires that the digital systems be modular, that these modules consist of as few different types as possible and that the interconnection structure between the modules be uniform. On the module level, the module itself should have a high gate to pin ratio and a regular and repetitive internal logical structure. Further, the module should be as locally autonomous as possible, communicating with only a few of its neighboring modules for information. This avoids the high pin count and the necessity of large on-chip drive capability and consequent high power dissipation.

In this paper, we consider some aspects of the design of a basic arithmetic processor and its implications for LSI technology implementation. A basic arithmetic processor performs the binary numbers addition/subtraction, multiplication and division of two operands. The basic arithmetic micro-instruction of an arithmetic unit which performs multiplication and division by alternately doing one addition/subtraction and shifting can be characterized by the transfer function

$$A' = A + (m \cdot \phi)$$

where \(A', A\) and \(\phi\) are consistently represented numbers and \(m\) is the multiplier or quotient digit.

Atrubin [1], Goyal [2], and Pisterzi [3] describe algorithms to process a sequence of microinstructions in a modular, iteratively structured network of identical finite state machines. This structure very elegantly meets the constraints of LSI technology. In such a structure, the results are obtained on a digit by digit basis. In Atrubin's method, the processing begins with the least significant digits of the operands and correspondingly the least significant digits of the result are available first. On the other hand, in Pisterzi's model, the processing takes place beginning with the most significant digits of the operands and most significant digits of the result are obtained first. The most-significant-digit-first (MSDF) approach has the advantages of early overflow detection, normalization concurrent with processing and early termination of processing as soon as enough significant digits in the result have been obtained. Moreover, the quotient generation and operand normalization processes inherently require the examination of most significant digits first. Finally, the MSDF algorithm in an iterative structure has the potential, under suitable environment, of achieving (pipeline) of successive instructions.

This paper presents the results of a study into the arithmetic and logic design of the digit processing logic of the finite state machine. The radix chosen for use in the digit processing logic is a design parameter. The main considerations of the design were its suitability for LSI implementation.

To put things in perspective, a brief description of the organization and operation of the iterative structure is given in section 2. Section 3 presents the major content of this paper. It describes the arithmetic design of the digit processing logic of the finite state machine in terms of number system, number representation and the digit algorithm. The algebraic design of the logic necessary to implement the digit algorithm is explained next. We further give a brief description of the logic design of the digit processing logic and its implications on the pin requirements for various values of radix. Finally alternatives are suggested for reducing the number of pins.

2. Arithmetic Unit Organization

The arithmetic unit organization considered in this paper was first proposed by Pisterzi [3]. Structurally, it consists of a linear cascade of identical arithmetic elements called Digit Processing Modules (DPM) and a Global Control Unit (GCU). The GCU receives an arithmetic instruction (e.g., ADD, MULTIPLY, etc.) from an external source and converts them into a sequence of elementary microinstructions to be executed by the linear cascade of DPM. However, the GCU communicates only with the most significant DPM and the microinstructions flow serially (in a pipelined manner) from the most significant DPM (closest to the GCU) to the least significant DPM, instead of being broadcast to all modules simultaneously. Figure 1 shows the schematic organization of such an arithmetic unit. Each DPM retains the values of one digit of each of the active operands in its registers and collectively, the DPMs contain the mantissas of all active operands and do the processing on them. The DPMs have the capability of performing microinstructions which will (when performed by all DPM) form sums, perform shifts, and do inter-register transfers, etc. Because the quotient generation and operand normalization processes require the examination of the most significant digits, the operands are placed in the DPM so that the digits of each of the operands are available to the microinstructions in order of decreasing significance. Thus the most significant digits of the operands are placed in the DPM which communicates with the GCU.

Each DPM performs the same sequence of microinstructions. A given microinstruction is not executed by all DPM in synchronization, but rather must be executed by them in sequence (i.e., first by DPM1, then DPM2, ...). As soon as all the DPMs, which contain information required by DPM1 to perform microinstruction

223
j+1 (referred to as \( \mu_{j+1} \)), have executed \( u_j \) and have sent the required information to DMR\(_j\). \( \mu_{j+1} \) may be performed by DMR\(_j\). The microinstructions must have regular data requirements so that an additional DMR executes \( \mu_j \) one more DMR may execute \( \mu_{j+1} \).

Clearly, the DMR registers do not contain entire operands as long as any of the DMRs are actively executing microinstructions. Each DMR contains the digits from the results of the last microinstruction executed. A more formal and detailed explanation of the operation of a DMR is given in the appendix.

All the microinstructions executed by the DMRs are so designed that the processing begins with the most significant digit of the operands and proceed to those with decreasing significance. For example, consider the multiplication algorithm which is right directed. It is implemented as a repeated sequence of shift-left multiplier, multiply and add, and shift-left-accumulator microinstructions in that order. During the multiplication process, the cascade network of DMRs behave as what Datta and Ferrar[4] call a "column-wise" operating multiplier: the product digits of a column of product-matrix are generated within different DMRs and summed along the cascade network with other digits of the same column and with carries (transfers) of the preceding columns.

For a detailed description of the implementation of various arithmetic algorithms and a complete description of the different elementary microinstructions executed by the DMRs, see Goyal [5].

3. Design of the Digit Processing Module

From the description of a typical DMR, it is evident that the operations must be completely independent of the significance of the digits retained by a DMR. All DMRs may then be identical. A DMR is a finite state, complex logic module with control logic and combinational digit processing logic.

To achieve a compromise between the serial processing and the desired arithmetic speed, an arithmetic step is carried out in higher radix, \( r = 2^k \), such that \( k \) k-bits of the results are obtained at any step. \( k \) is chosen as a design parameter. The major consideration in the arithmetic and logic design has been the desirability of its implementation in LSI technology.

3.1 Arithmetic Design

The basic digit level arithmetical function performed by a DMR can be specified as

\[
a'_i = a_i + m_j \delta_i
\]

where \( a'_i \), \( a_i \), and \( \delta_i \) are consistently represented digits in the active registers of the DMR and \( m_j \) is a multiplier or quotient digit such that

\[
|m_j| \leq r-1, \ r \text{ being the radix.}
\]

3.1.1 Choice of Number System

Let \( \alpha_j \) denote the number of DMRs from which a given DMR requires information in order to execute the microinstruction \( \mu_j \). Because of the iterative structure, it is evident that for efficient operation of the unit, \( \alpha_j \) should satisfy two constraints:

a) The microinstructions should have regular data requirements independent of the significance of the digits retained by the DMR. That is, \( \alpha_j \) should be the same for each DMR.

b) The value of \( \alpha_j \) should be as small as possible because the execution rate of a given microinstruction is inversely proportional to \( \alpha_j \).

In a conventional weighted number system, carry or borrow into any digital position is a function of all the digits to the right of it. The value of \( \alpha_j \) therefore, is clearly a function of significance of the digit itself. Hence conventional number system is unsuitable. The redundant number system which gives a bounded value of \( \alpha_j \) is essential.

3.1.2 Choice of Number Representation and Redundancy

The major factors influencing the choice of redundant number representation and the amount of redundancy in the number system are the following:

a) the ease of conversion from conventional number representation to redundant number representation,

b) its compatibility with the widely employed conventional binary number system,

c) normalization of operands to widely employed conventional binary number system,

d) LSI technology constraints, namely

i) minimization of the number of types of cells (in the arithmetic and logic sense) required for higher radix \( (r=2^k) \),

ii) implementation of the digit processing logic, and

iii) minimization of the number of input and output pins.

In this study, signed digit (SD) representation and maximal redundancy is chosen because it meets most of the requirements. That is, for radix \( 2^k \), the digit set \( \{2^k-1, 2^k, 1, 0, -1, -2^k, \ldots, -2^k \} \) is used. The overbar is used to designate negative digit values. Implications of maximal redundancy, "pseudo normal" forms, and the use of "pack-add" algorithm are discussed in Avizienis [6] and Goyal [5].

Two modes of representation for the signed digit are used, depending on the area of application:

a) Redundant Binary Mode - Each digit, radix \( 2^k \), is represented by \( k \) redundant binary digits. Each redundant binary digit is chosen from the digit set \( \{1, 0, 1\} \).

b) Sign-Magnitude (SM) Mode - Each digit, radix \( 2^k \), is represented by a single sign bit and a magnitude represented by \( k \) bits.

3.1.3 Digit Algorithm

Figure 2 shows the functional representation of the digit processing logic of a DMR. Essentially, it consists of a digit multiplier and an adder. Efficient use of LSI technology in the implementation of radix-\( 2^k \) digit processing logic dictates that it should be made up of identical logical cells. One approach that achieves this goal is to design the logic as a one
dimensional linear cascade of K-stages of radix-2 arithmetic processing structures.

Since the arithmetical design of the adder is influenced by the arithmetical and logical organization of the multiplier, the design of the digit multiplier is discussed next followed by that of the adder.

3.1.3.1 Design of the Digit Multiplier

The digit multiplier forms the product of the multiplicand and multiplier digit. In order to illustrate the algorithm, let the two digits to be multiplied be denoted by \( X \) and \( Y \). In the most general way, they can be represented as

\[
X = \sum_{i=0}^{K-1} x_i 2^i, \quad x_i \in \{0,1,\ldots,K-1\}
\]

\[
Y = \sum_{j=0}^{K-1} y_j 2^j, \quad y_j \in \{0,1,\ldots,K-1\}
\]

The product \( XY \) is achieved by a \( K \times K \) square array of redundant binary product cells. Each cell forms the product of two redundant binary digits \( x_i \) and \( y_j \) and its output product digit \( p_{ij} \) is in the digit set \( \{0,1,\ldots,K-1\} \). (In actual practice, the product-matrix generator forms the product of two radix-\( 2^K \) signed digits encoded in \( S_2 \) mode.)

The product may be viewed in terms of the sum of the \( p_{ij} \) terms of the same weight in the product-matrix.

\[
X \times Y = \sum_{i=0}^{2K-2} \left( \sum_{j=0}^{K-1} p_{ij} 2^j \right) 2^i
\]

where \( x_i, y_j = 0 \) for \( i < 0 \) and \( j < K-1 \).

Defining

\[
S_i = \sum_{j=0}^{K-1} p_{ij} 2^j
\]

\[
X \times Y = \sum_{i=0}^{2K-2} S_i 2^i
\]

where \( S_i \) is the sum of the entries in the \( i \)th column of the product-matrix. The number of product-matrix elements in the \( i \)th column is given by

\[
N_i = \begin{cases} 
1+1 & 0 \leq i \leq K-1 \\
1-(2K-1) & K \leq i \leq 2K-2
\end{cases}
\]

The number of product elements, \( p_{ij} \), is maximum in column of weight \( 2^K-1 \) and is equal to \( K \). The product elements in the other columns decrease uniformly by one on either side of this column as shown in Figure 4.

Equation (3) shows that a linear cascade of \( 2K-3 \) multi-input redundant binary adders (MIRBA) corresponding to each column of the product matrix are needed to generate the radix-\( 2^K \) digit product. Note that the number of inputs to each stage of the linear cascade of adders is different and is given by \( N_i \).

The columns of weight \( 2^i (K \leq i \leq 2K-2) \) of the product matrix can be considered as forming a carry or collective product transfer (CPT) to the next more significant radix-\( 2^K \) digital position (see Figure 4). These CPT columns have weights \( 2^{i-K} \) with respect to the higher significant digital position. When these CPT columns are added in the appropriate (or the same weight) MIRBA of the higher significant digital position, all the stages of the linear cascade of MIRBAs become identical, each stage having \( X \) inputs. This is shown in Figure 4.

Equation (2) shows that it is necessary to add one radix-\( 2^K \) digit \( a_k \) to the output of digit multiplier. So, a MIRBA capable of adding \( (K+1) \) redundant binary \( (1,0,1) \) inputs is required.

3.1.3.2 Design of Multi-input Redundant Binary Adder (MIRBA)

A MIRBA is a limited carry-borrow propagation adder which accepts several redundant binary inputs (digit set \( \{1,0,1\} \)) and produces one redundant binary output (with appropriate adder "Transfers" for more significant adjacent adder stages).

Let us define a new parameter \( \alpha^b \). The redundant binary output of any MIRBA is dependent on the "Transfers" (the composite term for carry/borrow) input to that MIRBA. In redundant number system, the "Transfers" are functions of "primary" inputs (other than Transfer inputs) to only limited number of adjacent less significant MIRBAs. \( \alpha^b \) denotes the number of such adjacent MIRBAs whose "primary" inputs determine the output of a given MIRBA.

The radix-\( 2^K \) digit processing logic in say \( D(M) \) consists of a \( K \)-stage linear cascade of \( (K+1) \) input MIRBAs. Exceopt for the most significant MIRBA in \( K \)-stage cascade, the inputs to the MIRBAs in \( D(M) \) are functions of radix-\( 2^K \) operand digits in \( D(M) \) and \( D(M-1) \) (accumulator digits \( a_k \), multiplier digit \( m_j \) and multiplicand digits \( e_i, e_{i-1} \)). (See Figure 9.) Thus \( \alpha^b \) is related to \( \alpha \) by equation (4).

\[
\alpha^b = \frac{\alpha - 1}{K} + 1
\]

Three different approaches for the arithmetic (algebraic) design of a MIRBA were considered.

3.1.3.2.1 Robotach's Technique [7]

This technique is an explicit transformation technique which converts a given input digit set into the required output digit set by a series of Simple Transformations. Using this approach, we find that for \( (K+1) \) input MIRBA, a four level (3 stages of Simple Transformations) redundant binary structure is necessary for \( k > 2 \). It can be shown [5] that a four level (in the arithmetic sense) redundant binary structure can be designed to accept as many as \( 51 \) redundant binary inputs and produce one redundant binary output by higher order Simple Transformations.

However, the logic design of the bottom (nearest to input digit set) two levels is highly complicated for \( K > 5 \) that is radix \( 32 \), if the are to be implemented in two or three logic levels. In practice, the technique is to break down the bottom level structure into equivalent simpler structures frequently at the cost of increasing the number of levels. Moreover,
such a structure is not suitable for LSI implementation because no elementary logic cell, other than the primitive NAND, NOR can be repetitively used in a uniformly interconnected pattern for the implementation of (K+1) input MIRBA.

Table 1 shows the values of $a_j^b$ and $a_j^c$ for various values of $K$ for a (K+1) input MIRBA.

3.1.3.2.2 $\omega_2$-sum Tree Technique

A conceptually simple approach is to realize the (K+1) input MIRBA by a $\omega_2$-sum tree structure of two input redundant binary adders (RBA-2). The logic design of such RBA-2 was studied in detail by Borovec [6] and we shall interchangeably use the term Borovec Unit (BU) for RBA-2 in the following discussion. Figure 5 shows one such BU consisting of a cascade combination of a symmetric subtractor [9] (SS) and a symmetric adder [9] (SA) and a D-element. The D-element decomposes a redundant binary input into positive (0,1) and negative (0,1) binary outputs.

For a (K+1) input MIRBA, the tree structure has $t$ levels such that

$$t = \lceil \log_2(K+1) \rceil$$

and the number of BUs required is $K$. Figure 6 shows the $\omega_2$-sum tree structure for a six input MIRBA.

In this configuration,

$$a_j^b = 2t = 2\lceil \log_2(K+1) \rceil$$

and

$$a_j^c = \frac{2\lceil \log_2(K+1) \rceil - 1}{K} + 1.$$

The value of $a_j^b$ for various values of $K$ is tabulated in Table 1. From the table we find that for $k=2$ and $K=4$, that is, radices 4 and 16, the value of $a_j^b$ is 2 and 4 for all other values of $K$. Since minimum value of $a_j^b$ is desirable, a different arrangement of BUs as described in third approach given next can be used to achieve $a_j^b$.

3.1.3.2.3 Tree-structure using RBA-3s and RBA-2s

In this configuration, 3-input redundant binary adders (RBA-3) and RBA-2s are connected in a tree structure.

An RBA-3 consists of two BUs, a D-element and a C-element arranged as shown in Figure 7. The C-element composes two binary inputs (0,1; 0,0) into one redundant binary (1,0,1) output. The lower BU in combination with C-element and D-element acts as a redundant binary (3,2) counter. The upper BU forms the sum of the sum-output of the lower BU and the 'Transfer' output of the lower BU of adjacent less significant RBA-3.

For a design of (K+1) input MIRBA, RBA-3s are used whenever it can be fully utilized, that is, three inputs are available for addition; and RBA-2s are used when only 2-inputs are to be added at any level of the tree structure. (An exception occurs for $K=3$ where $\omega_2$-sum tree technique is necessary.) Figure 8 shows a 6 input MIRBA using this configuration.

The number of BUs required in this technique is $K$ for a (K+1)-input MIRBA. The number of BU levels is also $\lceil \log_2(K+1) \rceil $. Table 1 shows the values of $a_j^b$ and $a_j^c$ for various of $K$. It shows that $a_j^c$ is for all values of $K$ except K-3. It shows that $a_j^c$ is for all values of $K$ except K-3.

The tree structure configurations described in 3.1.3.2.2 and 3.1.3.2.3 have the following advantages compared to Rokhatsch's technique:

a) It is more general and has the same configuration for any value of $K$.

b) It makes use of only one kind of cell, that is, Borovec Unit for the implementation of MIRBA.

c) The various BUs are uniformly and regularly interconnected.

Because of b) and c) above, this implementation meets the LSI constraints of structure regularity and minimum part number type.

3.2 Logic Design

The major consideration in the logic design is the choice of an encoding for the radix-2^K Signed Digit. The encoding of the digit has implications on the logic complexity of the digit processing logic and also on the pin complexity (total number of pins) of the DM.

As suggested in section 3.1.2, two modes of representation for a radix-2^K Signed Digit are used:

a) Sign-Magnitude (SM) mode

b) Redundant Binary mode

In SM mode, a radix-2^K Signed Digit is encoded as a (K+1)-tuple binary logic vector. This requires (K+1) binary storage elements. Conversion of an input operand in conventional number system (binary and sign-magnitude) to the Signed Digit form necessary for the DMs is trivial. This is carried out by just attaching the sign of the conventional number to each group of K bits. For these reasons, SM mode encoding for the radix-2^K Signed Digit is used for internal storage of the operand and result digits and for transfer of these digits between DMs.

Redundant binary mode is used for the arithmetic processing specifically for the multi-input redundant binary adders in the DM. Each redundant binary digit requires 2 bits for representation and thus a radix-2^K Signed Digit is represented by a logic vector of length 2(K+1). Borovec [6] studied in detail the logic design of two input redundant binary adders (RBA-2) for all the nine distinct ways, under permutation and negation, of assigning three values (1,0,1) to four states of two binary variables.

Out of these nine distinct formats, the sign-magnitude format (referred as SM) is found to be the most suitable. In this format, if a redundant binary digit $l_{i+1}$ (1,0,1) is represented by two binary variables $l_i$ and $l_{i+1}$ (1,0,1), the SM encoding is given by

$$l_{i+1} = (-1)^{l_i} l_{i+1}.$$

Although Borovec's study shows that this format does not give minimal logic complexity for a BU, this still seems to be the best compromise for the following reasons:

1) Conversion from SM mode to SM format is
trivial and involves appending the single sign bit to each of the \( K \)-magnitude bits.

11) Negation of a radix-\( K \) digit for subtraction requires only complementation of each of the sign bits.

111) Design of a one-dimensional iterative encoder to convert the redundant binary output of the MIRBAs to \( \text{SM}_R \) mode is the simplest for the \( \text{SM}_R \) format encoding.

11v) In the product-matrix generator, the logic for the generation of each product digit of the matrix consists of a single AND-gate and an Exclusive-OR gate. (In practice, however, both the multiplier and multiplicand \( K \)-radix-\( K \) digits are in \( \text{SM}_R \) format and the product-matrix logic consists of \( K \)-AND gates and only one Exclusive-OR gate for the sign.)

Let \( \lambda_1, \lambda_2 \) and \( m_1, m_1 \) denote the redundant binary inputs and \( d_1, d_1 \) denote the output of a BU. Let \( t_1^-, t_1^+ \) and \( t_1^- \) be the input and output "Transfers." The internal logic of a BU is given by the set of Boolean equations (5).

\[
\begin{align*}
\lambda^+_1 = (-1)^{d_1} & \lambda^-_1 \\
m^+_1 = (-1)^{d_1} m^-_1 \\
d^+_1 = t_1^- + m_1 + t_1^+ \\
e_1 = t_1 \\
t_1^- = \lambda^+_1 \lambda^-_1 \\
t_1^+ = m_1 \lambda_1 \\
t_1^-_1 = (t_1^- + m_1) t_1^- + e_1 m_1 + e_1 \\
t_1^+_1 = (t_1^+ + m_1) t_1^+ + e_1 m_1 + e_1 \\
\end{align*}
\]

If, however, the inputs and outputs are encoded in Transfer Format \( (\lambda^+_1 = \lambda^-_1 = \lambda_1, \text{etc.}) \), then a BU can be implemented by a cascade of two ordinary binary full adders—each acting as a \( (3:2) \) counter [5]. This leads to a simpler logic cell for the LSI implementation of MIRBA at the cost of larger logic delay in the generation of MIRBA output.

3.2.1 Pin Complexity

Figure 9 shows the schematic diagram of the radix-\( K \) digit processing logic in a DM. It consists of one product-matrix generator, \( K \) of \((K+1)\)-input redundant binary adders (MIRBA) and an encoder which converts the redundant binary result output of the MIRBAs to \( \text{SM}_R \) format for storage or inter DM communication, \( \text{AT}_1 (\text{AT}_1^+) \) represent the adder "transfers" into (out of) the least significant (most significant) MIRBA of DM, from (into) the most significant (least significant) MIRBA of adjacent DM, \((\text{DM}_1^+, \text{DM}_1^-)\). \( \text{CPT}_1 (\text{CPT}_1^+) \) is the "collective product transfer" from (into) the product matrix generator of the adjacent DM, from (into) \( \text{DM}_1^+ \).

The number of pins required for the digit processing logic is the sum of the number of pins required for \( \text{AT}_1, \text{AT}_1^+, \text{CPT}_1, \text{CPT}_1^+ \), for one multiplier digit and for the \( \text{SM}_R \) encoded output of the MIRBAs. The tree structure configuration of \((K+1)\)-input MIRBA uses \( K \) BUs and each BU requires two pins for the Transfers \((0,1, 0,1)\). Thus \( 2K \) pins are needed for \( \text{AT}_1 \) or \( \text{AT}_1^- \). \( \text{CPT}_1 (\text{CPT}_1^+) \) require one pin for the sign information (assuming both the multiplier and multiplicand digits are \( \text{SM}_R \) encoded and \((K+1)-c-1, \ldots, 1 = (K+1)-l \) pins for magnitude information. Therefore, the total number of pins required by the digit processing logic, excluding control is given by

\[
\text{Total # of pins required} = \text{P = AT}_1 + \text{AT}_1^- + \text{CPT}_1 + \text{CPT}_1^- + \text{P_\text{EAO}}
\]

\[
\begin{align*}
\text{P_\text{AT}_1} &= 2K + (K+1)-l + 1 = \frac{K(K+1)}{2} + 1 \\
\text{P_\text{AT}_1^-} &= (K+1) + (K+1) \\
\text{P_\text{CPT}_1} &= K^2 + 5K + 4
\end{align*}
\]

where

\[
\begin{align*}
\text{P_\text{AT}_1} &= \# \text{ of pins required for } \text{AT}_1, \text{etc.} \\
\text{P_\text{AT}_1^-} &= \# \text{ of pins required for multiplier digit} \\
\text{P_\text{EAO}} &= \# \text{ of pins required for SM_R encoded output of MIRBAs.}
\end{align*}
\]

Table 2 shows the values of \( P \) for various values of \( K \). The value of \( P \) for \( K \geq 5 \) is impractical.

However, the number of pins required for \( \text{AT}_1 (\text{AT}_1^-) \) can be reduced if the \( K \) "positive transfers" \((0,1)\) and the \( K \) "negative transfers" \((0,1)\) of the \((K+1)\)-input MIRBA are encoded. They can be each encoded as \((\text{SP}_2(K+1))\) outputs. The "Transfer encoder" can be implemented using \( \leq K \) conventional binary full adders in \((\text{SP}_2(K+1)) \cdot \text{K-level} \) [10]. The corresponding "Transfer decoder" in the adjacent DM is simply a fan-out network. In this network, the encoded Transfer of weight \( W \) is fanned-out to \( W \) "Transfer" inputs (of the same sign as the encoded Transfer) of the least significant MIRBA of the adjacent DM. Similarly, the number of pins required for \( \text{CPT}_1 (\text{CPT}_1^-) \) can also be reduced from \((K(K+1)+1)\) to only \((K+1)\) if \( \text{CPT}_1 \) is generated in \( \text{DM}_1^- \) instead of in \( \text{DM}_1^+ \). Note that \( \text{CPT}_1 \) is a function of the multiplicand digit in \( \text{DM}_1^+ \) and the multiplier digit, the latter being the same in both \( \text{DM}_1^+ \) and \( \text{DM}_1^- \). Thus \( \text{DM}_1^- \) needs to know only the multiplicand digit in \( \text{DM}_1^+ \) to generate \( \text{CPT}_1 \).

This requires only \((K+1)\) pins. Let us call this method of transmitting \( \text{CPT}_1 \) to \( \text{DM}_1^- \) as "Indirect Generation" (IG) for lack of a better term. Let \( P_{\text{TEG}} \) denote the total number of pins required when the Transfer Encoder (TE) and Indirect Generation (IG) of \( \text{CPT}_1 \) methods are used. Then

\[
P_{\text{TEG}} = 2(\text{SP}_2(K+1)) + 2(\text{SP}_2(K+1)) + (K+1)(K+1)(K+1) = 4(\text{SP}_2(K+1)) + (K+1)
\]

The value of \( P_{\text{TEG}} \) is tabulated in Table 2 for various values of \( K \). The table shows quite a reduction in the pin complexity although it is achieved at the expense of introducing an additional type of cell (full adders) in the LSI implementation of the DM.
The number of pins can be still further reduced if the multiplier digit has a redundancy ratio \( \rho \) such that \( \rho < 2/3 \). Such a multiplier digit can be recoded in Nonadjacent Format (that is, the recoded digit has only nonadjacent nonzero redundant binary digits). This reduces the maximum number of nonzero redundant binary inputs to a MIRBA from \( (K+1) \) to \( \left( \frac{K}{2} \right) + 1 \). The number of Borovce units required in the realization of MIRBA are correspondingly reduced to \( \left( \frac{K}{2} \right) + 1 \). Hence the number of pins when multiplier digit redundancy is reduced to \( \rho < 2/3 \) is given by

\[
F_{ZEB} = 4 \left( \beta_{K-2} \left( \frac{K}{2} \right) + 1 \right) \cdot 4(K+1) \cdot \rho \leq \frac{2}{3}
\]

This is tabulated in Table 2 and shows that only a minor reduction occurs.

Table 2 shows the pins required for only the combinational part of the digit processing logic of the arithmetic element on BM. In order to perform the arithmetic algorithm serially on a digit by digit basis, some amount of local sequential control is necessary, although such control is very simple. The local control complexity is independent of the radius. Choice of the radius for implementation of a DM on a single LSI chip would depend on maximum allowable pads (pins) on the chip. A trade off must also be made between the cost of combinational part (arithmetic cost) and cost of sequential control both in terms of their logic complexity and pin complexity when choosing a radius or value of \( K \). Details about the local control required for a DM can be found in Sogal [5].

Although the design presented in the present paper was specifically developed for serial processing in an iterative structure, the same arithmetic element can also be used in a purely combinational, parallel two-dimensional array arrangement for arithmetic processing. It can also be used in a bus structured and associative processor configuration with proper sequential local control.

4. Summary

The design of an arithmetic element for use in a linear, iteratively structured arithmetic unit, in which arithmetic processing takes place serially, on a digit by digit basis and with most-significant digit first, is presented. The main considerations in the design were the desirability of the arithmetic element's implementation (as a single module) in LSI technology. The design of the digit processing logic is described at two levels. Starting from the arithmetic specification of the digit processing logic, the arithmetic design (namely, the choice of number system, number representation and the digit algorithm) is developed, with radius as the design parameter. Then a brief description of the logic design of the digit processing logic which implements the digit algorithm is given. The implications of the logic design on the LSI implementation, namely on the logical complexity, part number type and pin requirements are discussed.

5. Acknowledgments

The author is grateful to his advisor, Professor James E. Robertson, for the invaluable guidance, useful suggestions and encouragement throughout this work. The author would also like to express his appreciation to his colleague Ms. Mary J. Irwin for many suggestions that improved the readability of this paper, to Mrs. June Wingler for typing the paper and to Mr. Mark Goebel for the figures.

References


Appendix

Operation of a DM

The following description has been adopted from Pisteri [3] with slight modifications.

The processing performed by the DMs can be described by the following:

\[
\begin{align*}
\overline{x}_3 &= \tau_z(j-1,k-1,j'k') \\
\overline{x}_2 &= \tau_z(j-1,k-1,j') \\
\overline{x}_1 &= \tau_z(j-1,k-1) \\
\overline{x}_0 &= \tau_z(j-1)
\end{align*}
\]

(4.1)

and

\[
\begin{align*}
\overline{x}_1 &= \tau_z(j-1,k-1,j'k') \\
\overline{x}_2 &= \tau_z(j-1,k-1,j') \\
\overline{x}_3 &= \tau_z(j-1,k-1) \\
\overline{x}_4 &= \tau_z(j-1)
\end{align*}
\]

(4.2)

where

\[
\overline{x}_i
\]

is the operand information contained in the \( i \)-th DM immediately following the execution of \( \mu_j \). It consists of the \( i \)-th digit of
each of the active operands.

\( v_j \) is the function employed to obtain the new operand set and is dependent on the microinstruction to be performed.

\( F_k \) is a "modifier" value which \( D_{M_k} \) transmits to \( D_{M_k+1} \) with the microinstruction to be performed next.

\( e_j \) is the function which each \( D_M \) performs to determine \( g_j \).

\( a_j \) is the number of \( D_M \)'s which must cooperate with the right neighbor of \( D_M \) performing \( u_j \) in order to generate the necessary \( g_j \).

\( j \) is the value which \( D_{M_k} \) transmits to the \( D_M \) executing \( u_j \).

\( f_j \) is the function \( D_{M_k} \) employs to determine \( g_j \).

The operation of a typical \( D_M \), \( D_{M_1} \) say, is as follows. It begins in a state in which it is receptive to information defining the next microinstruction to be performed. \( D_{M_1} \) receives this information and the value of \( j \) from its left neighbor \( D_{M_1-1} \). Then \( D_{M_1} \) determines \( g_j \). The function \( f_j \) can be implemented either totally in parallel in \( D_{M_1} \) from inputs of active operand digits in \( D_{M_1+1} \) through \( D_{M_1+a} \) or it can be realized in a time sequential fashion. The former approach takes many more pins. A detailed description of the two approaches can be found in [5].

\( D_{M_1} \) also determines \( F_k \) by performing equation (A.3) and transmits this value and the identity of \( u_j \) to \( D_{M_1+1} \).

Some time later, \( D_{M_1} \) receives a signal from \( D_{M_1+1} \) indicating that \( D_{M_1+1} \) has executed \( u_j \). \( D_{M_1} \) then executes \( u_j \) (the necessary \( g_j \) and \( a_j \) being ready by now). \( D_{M_1} \) transmits a signal at this time to \( D_{M_1+1} \) which indicates that \( D_{M_1+1} \) may execute \( u_j \).

When \( D_{M_1} \) receives an acknowledgment from \( D_{M_2+1} \), it goes into a state where it is receptive to information concerning \( u_{j+1} \). The sequence above then repeats.

Table 2

<table>
<thead>
<tr>
<th>radix</th>
<th>( k )</th>
<th>( s )</th>
<th>( \rho ) ( s^{1/2} )</th>
<th>( \rho ) ( s^{1/2} ) ( \leq 2/3 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>2</td>
<td>18</td>
<td>20</td>
<td>16</td>
</tr>
<tr>
<td>8</td>
<td>3</td>
<td>28</td>
<td>24</td>
<td>24</td>
</tr>
<tr>
<td>16</td>
<td>4</td>
<td>34</td>
<td>32</td>
<td>28</td>
</tr>
<tr>
<td>32</td>
<td>5</td>
<td>36</td>
<td>36</td>
<td>36</td>
</tr>
<tr>
<td>64</td>
<td>6</td>
<td>40</td>
<td>40</td>
<td>40</td>
</tr>
<tr>
<td>128</td>
<td>7</td>
<td>44</td>
<td>44</td>
<td>44</td>
</tr>
<tr>
<td>256</td>
<td>8</td>
<td>52</td>
<td>52</td>
<td>52</td>
</tr>
</tbody>
</table>

* \( \rho \) is the redundancy ratio for multiplier digit.

Table 1

Values of \( a_j \) and \( g_j \) for Various \((k+1)\) Input MINA Configurations

<table>
<thead>
<tr>
<th>( r )</th>
<th>( k )</th>
<th>( a_j )</th>
<th>( g_j )</th>
<th>( a_j )</th>
<th>( g_j )</th>
<th>( a_j )</th>
<th>( g_j )</th>
<th>( a_j )</th>
<th>( g_j )</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>2</td>
<td>3</td>
<td>2</td>
<td>4</td>
<td>3</td>
<td>4</td>
<td>3</td>
<td>4</td>
<td>3</td>
</tr>
<tr>
<td>8</td>
<td>3</td>
<td>4</td>
<td>2</td>
<td>4</td>
<td>2</td>
<td>5</td>
<td>3</td>
<td>5</td>
<td>2</td>
</tr>
<tr>
<td>16</td>
<td>4</td>
<td>4</td>
<td>2</td>
<td>5</td>
<td>3</td>
<td>5</td>
<td>2</td>
<td>5</td>
<td>2</td>
</tr>
<tr>
<td>32</td>
<td>5</td>
<td>4</td>
<td>2</td>
<td>5</td>
<td>2</td>
<td>5</td>
<td>2</td>
<td>5</td>
<td>2</td>
</tr>
<tr>
<td>64</td>
<td>6</td>
<td>5</td>
<td>2</td>
<td>5</td>
<td>2</td>
<td>6</td>
<td>2</td>
<td>6</td>
<td>2</td>
</tr>
<tr>
<td>128</td>
<td>7</td>
<td>5</td>
<td>2</td>
<td>5</td>
<td>2</td>
<td>6</td>
<td>2</td>
<td>6</td>
<td>2</td>
</tr>
<tr>
<td>256</td>
<td>8</td>
<td>5</td>
<td>2</td>
<td>5</td>
<td>2</td>
<td>6</td>
<td>2</td>
<td>6</td>
<td>2</td>
</tr>
</tbody>
</table>
Figure 1. The organization of linear iteratively structured arithmetic unit

Figure 2. Positional representation of digit processing logic

Figure 3. Product matrix

Figure 4. Illustration of overlapping product matrices and "collective product transfer"

Figure 5. Reversed shift/de shift redundant binary adder (RSH-2)