# θ(logN) ARCHITECTURES FOR RNS ARITHMETIC **DECODING\*** K. M. Elleithy, M. A. Bayoumi, and K. P. Lee The Center for Advanced Computer Studies University of Southwestern Louisiana Lafayette, LA 70504, U.S.A. #### ABSTRACT Decoding in Residue Number System (RNS) based architectures can be a bottleneck. A high speed and flexible modulo decoder is an essential computational element to maintain the advantages of RNS. In this paper, a fast and flexible modulo decoder, based on the Chinese Remainder Theorem (CRT), is presented. It decodes a set of residues into its equivalent representation in either unsigned magnitude or 2's complement binary number system. Two different architectures are analyzed; the first one is based on using Carry Save Adders(CSA), while, the other is based on utilizing a modified structure of Carry Save Adders(MCSA). Both architectures are modular and are based on simple cells which leads to efficient VLSI implementation. The proposed decoder is fast, it has a time complexity of $\theta(logN)$ . #### 1. Introduction Recently, RNS has received increased attention due to its ability to support high-speed concurrent arithmetic [1-3]. Applications such as fast fourier transform, digital filtering, and image processing utilize the efficiencies of RNS arithmetics in addition and multiplication, they do not require the difficult RNS operations such as division and magnitude comparison. RNS has been employed efficiently in the implementation of several special purpose processors such as digital signal processors[4]. Since special purpose processors are associated with general purpose computers, binary-to-residue and residue-to-binary conversions become inherently important and the conversion process should not offset the speed gain in RNS operations. While the binary-to-residue conversion does not pose a serious threat to the speed gain in RNS operations, the residue-to-binary conversion can be a bottleneck. It is mainly carried out employing the Chinese Remainder Theorem (CRT) [5,6]. Several implementations of the residue decoder have been reported [7-12]. In [12], the proposed residue decoders are basically based on biased addition, and take advantage of the fast addition speed of CSA[13]. But, the conversion output is not in 2's complement form. The implementation in [11] requires that one of the moduli must be a power of two; therefore, it may be limited in application. The residue decoders in [7,8] are based on using three moduli in the form $(2^n-1, 2^n, 2^n+1)$ . Due to the limitation imposed on the number of moduli and the choice of them, it is limited in application. In [10], the residue decoder is based on the base extension technique, it uses modular look-up tables in its implementation. Since two moduli are fed into a look-up table, the choice of moduli must not be large for the implementation to be feasible. In addition, it does not support residue to 2's complement binary number system conversion. Although look-up tables are used in this scheme, its time complexity is $\theta(N^2)$ . In [14,15], the scheme used has a time complexity of $\theta((log N)^2)$ . In [9], a scheme of $\theta(log NP)$ (where P is the number of bits) is used to support only unsigned magnitude binary In this paper, a $\theta(logn)$ residue decoder capable of decoding a set of residues to its equivalent representation in unsigned magnitude or 2's complement binary number system is introduced. Two different architectures using CSAs based on [16] and MCSA [17] are implemented. In the following section, the RNS theory is reviewed. Section 3 discusses how this fast and flexible residue decoder can be implemented. Section 4 evaluates the speed performance of this residue decoder. ### 2. Residue Number System In RNS, an integer, X, can be represented by N-tuple of resi- $$X = (r_1, r_2, \dots, r_N)$$ $$X \mid_{m}, \text{ with respect to a set of}$$ where $r_i = |X|_{m_i}$ , with respect to a set of N moduli $\{m_1, m_2, \dots, m_N\}$ . In order to have a unique residue representation, the moduli must be pairwise relatively prime, that is, $$GCD(m_i, m_j) = 1,$$ for $i \neq j$ then it is shown that there is a unique representation for each number in the range of $0 \le X < \prod_{i=1}^N m_i = M$ where N is the number The arithmetic operation on two integers A and B is equivalent to the arithmetic operation on its residue representation, that is, $$A \cdot B \mid_{M} = \left( \left| A \mid_{m_{1}} B \mid_{m_{1}} \left|_{m_{1}} \right|_{m_{1}} \right|_{m_{1}} \left|_{m_{2}} B \mid_{m_{2}} \left|_{m_{2}} \right|_{m_{2}} \right|_{m_{2}} \right) \cdot \cdot \cdot \cdot \cdot \left| A \mid_{m_{N}} B \mid_{m_{N}} \left|_{m_{N}} \right|_{m_{N}} \right|_{m_{N}}$$ where '.' can be addition, subtraction, or multiplication. Therefore, it is desired to convert binary arithmetic on large integers to residue arithmetic on small residue digits in which the operations can be parallelly executed, and there is no carry chain between residue digits. For applications in digital signal processing, it is helpful to for applications in digital signal processing, it defines a dynamic range for the RNS with positive and negative integers. The dynamic range is defined as $\left[-\frac{M-1}{2}, \frac{M-1}{2}\right]$ for M odd and as $\left[-\frac{M}{2}, \frac{M}{2} - 1\right]$ for M even, or more specifically, for M $$X = \begin{cases} Z & \text{if } Z \le \frac{M-1}{2} \\ Z - M & \text{if } Z > \frac{M-1}{2} \end{cases}$$ $$X = \begin{cases} Z & \text{if } Z < \frac{M}{2} \\ Z - M & \text{if } Z \ge \frac{M}{2} \end{cases}$$ and for M even, $$X = \begin{cases} Z & \text{if } Z < \frac{M}{2} \\ Z - M & \text{if } Z \ge \frac{M}{2} \end{cases}$$ where Z is an integer within the legitimate range, $0 \le Z < M$ . Any integer, X, within the dynamic range can be represented by N residue <sup>\*</sup>This work was supported in part by NSF grant No. MIP-8809811. The conversion from RNS to weighted binary number system is done by using the CRT, which states that $$\left|X\right|_{M} = \left|\sum_{j=1}^{N} \hat{m}_{j} \left|\frac{r_{j}}{m_{j}}\right|_{m_{j}}\right|_{M} \tag{1}$$ where $$M = \prod_{j=1}^{N} m_{j,} \hat{m}_{j} = \frac{M}{m_{j}}$$ Although the CRT provides a direct, fast, and simple conversion formula, the lack of large and fast modulo M adder has held back this approach. #### 3. The Residue Decoder The residue decoder based on the CRT can be implemented by a modulo M adder tree. The modulo M adders at each level are used to correct the partial sum so that it is within the legitimate range. Since modulo M adder is very slow, the possible implementation may pose an overhead to the overall speed performance of an RNS processor. In addition, the CRT only converts residues to its binary representation in the legitimate range but not in the dynamic range. Therefore, conversion to 2's complement binary number system requires a final correction. In order to implement a high speed residue decoder that can perform conversion to both unsigned magnitude and 2's complement binary number system, the following solutions are proposed: - The number of modulo M adders or binary adders should be reduced to a minimum. - CSAs or MCSAs can be used wherever multi-operand addition is required due to its high addition speed. - Correction can be performed only at the last stage, and it supports conversion to both unsigned magnitude and 2's complement binary number system. For ease of residue decoder design, it is partitioned into 4 stages as shown in Figure 1. The input to the residue decoder are the residues and a control line, C, which determines the output to be in unsigned magnitude or 2's complement number system. Figure 1. Block Diagram of the Residue Decoder # 3.1. Partial Sum Generator The inputs to this stage are the N residues. The main function of this stage is to compute partial sums, $t_i$ 's, where $$t_i = \left| \begin{array}{c|c} \hat{m}_i & \frac{r_i}{m_i} \\ \end{array} \right|_{m_i} \left| \begin{array}{c} M \end{array} \right|$$ Since $m_i$ is usually small, the value of $t_i$ can be obtained by accessing a lookup table with a small address space. Hence, $r_i$ will serve as ROM address input, and $t_i$ will be obtained from ROM output. In most cases, it is better to reduce the number of partial sums, $(t_i$ 's), in order to reduce the complexity at lower stages and hence increase the residue decoder's speed as a whole. Since a modulus $m_j$ can be represented by $\left\lceil \log_2 m_j \right\rceil$ -bit binary number, the jth residue, $$r_j = \sum_{k=0}^{\left\lceil \log_2 m_j \right\rceil - 1} 2^k \, b_k^j$$ where $b_i^j \in \{0, 1\}$ . By substituting $r_j$ in eq. (1), we can rewrite the CRT as follows: $$\left| X \right|_{M} = \left| \sum_{j=1}^{N} \left| \sum_{k=0}^{\left[ \log_{2} m_{j} \right] - 1} \hat{m}_{j} \left| \frac{1}{\hat{m}_{j}} \right|_{m_{j}} 2^{k} b_{k}^{j} \right|_{M} \right|_{M}$$ (2) Hence, if we have a set of 8 moduli $\{2,3,5,7,11,13,17,23\}$ with residues $\{r_1,r_2,r_3,r_4,r_5,r_6,r_7,r_8\}$ , respectively, only 4 ROMs with 7-bit address input are needed to implement this level, and modulus summation of 4 operands instead of 8 is needed, where $$\begin{split} t_1 &= \left| \begin{array}{c|c} \frac{M}{2} \left| \frac{r_1}{2} \right|_2 + \frac{M}{3} \left| \frac{r_2}{3} \right|_3 + \frac{M}{5} \left| \frac{r_3}{5} \right|_5 \right|_M \\ t_2 &= \left| \begin{array}{c|c} \frac{M}{7} \left| \frac{r_4}{7} \right|_7 + \frac{M}{11} \left| \frac{r_5}{11} \right|_{11} \right|_M \\ t_3 &= \left| \begin{array}{c|c} \frac{M}{13} \left| \frac{r_6}{13} \right|_{13} + \sum_{k=0}^2 \frac{M}{17} \left| \frac{1}{17} \right|_{17} 2^k b_k^7 \right|_M \\ t_4 &= \left| \begin{array}{c|c} \frac{4}{k-3} \frac{M}{17} \left| \frac{1}{17} \right|_{17} 2^k b_k^7 + \frac{M}{23} \left| \frac{r_8}{23} \right|_{23} \\ \end{array} \right|_M \end{split}$$ ### 3.2. Partial Sum Adder By far the modulo M summation of partial sums, $(t_i$ 's,) poses the biggest challenge to the implementation of the residue decoder due to the slow computational speed of the modulo M adder. This stage can be implemented using two different approaches. ### §3.2.1. Implementation using CSA A multilevel CSA tree consists of N-2 CSAs and a carry propagate adder, CPA[13], are used to reduce A partial sums, t's, to a sum, S. Let l be the number of levels on a CSA tree, and $\theta(l)$ be the maximum number of operands that can be processed with a l-level CSA tree. We can compute $\theta$ by the recursive formula provided by Avizienis[18]. $$\theta(l) = \left\lfloor \frac{\theta(l-1)}{2} \right\rfloor * 3 + (\theta(l-1)) \mod 2$$ for $l = 2, 3, \dots, \text{ and initially } \theta(1) = 3$ (3) A CSA tree for adding 6 operands is shown in Figure 2. CPA is a (m-1)-bit two-level carry lookhead adder, CLA[13] where: $$m = \left\lceil \log_2(MA) \right\rceil$$ Hence, the output S is an m-bit number that is passed to the next stage. The complexity of the scheme is determined by Theorem 1. **Theorem 1:** The addition of N numbers using CSAs can be performed in $\theta(\log N)$ steps. Proof: The number of levels in a CSA tree is determined by: $$\theta(l) = \left\lfloor \frac{\theta(l-1)}{2} \right\rfloor * 3 + \theta(l-1) \bmod 2$$ To determine the number of levels required to add N numbers let us consider the following two cases: (i) $\theta(l-1)$ is even, then: Figure 2. An Example for Partial Sum Adder for $$\left| \begin{array}{c} \frac{\theta(l-1)}{2} \end{array} \right| = \frac{1}{2}\theta(l-1) \tag{4}$$ $$\theta(l-1) \mod 2 = 0 \tag{5}$$ Substituting in (3) using (4) and (5), we have: $$\theta(l) = \frac{3}{2}\theta(l-1) \tag{6}$$ Since $\theta(1) = 3$ , we can substitute in (6) to get successive values for $\theta(l)$ as follows: $$\theta(2) = \frac{3}{2} * 3$$ $$\theta(3) = (\frac{3}{2}) * 3$$ $$\theta(4) = (\frac{3}{2}) * 3$$ $$\theta(5) = (\frac{3}{2}) * 3$$ $$\theta(l) = \left(\frac{3}{2}\right)^{l-1} *3$$ $$= \left(\frac{3}{2}\right)^{l} *2$$ $\theta(l)$ represents the number of operands that can be added using a CSA tree that has l levels. Suppose that the number of operands is N then: $$N = \left(\frac{3}{2}\right)^{l} * 2$$ Taking the logarithm of both sides we have: $$\log N = l^* \log \frac{3}{2}$$ Then: $$l = \frac{1}{\log \frac{3}{2}} * \log N$$ We can find constants $C_1 > 0$ , $C_2 > 0$ , and $N_0 \ge 0$ , such that for all $N \ge N_0$ the following is true: $$C_1 \log N \le \frac{1}{\log \frac{3}{2}} * \log N \le C_2 \log N \tag{7}$$ Then $$C_1 \log N \le l \le C_2 \log N \quad \forall N \ge N_0 \tag{8}$$ Possible values for $C_1$ , $C_2$ and $N_0$ are 1,2,1. Equation (8) means that $l = \theta(\log N)$ . (ii) $\theta(l-1)$ is odd, then: $$3 * \left| \frac{\theta(l-1)}{2} \right| = \frac{3}{2}\theta(l-1) - 1.5 \tag{9}$$ $$\theta(l-1) \mod 2 = 1 \tag{10}$$ Substituting in (3) using (9) and (10), we get: $$\theta(l) = \frac{3}{2}\theta(l-1) - \frac{1}{2} \tag{11}$$ Since $\theta(1) = 3$ , we can substitute in (11) to get successive values for $$\theta(2) = \frac{3}{2} * 3 - \frac{1}{2}$$ $$\theta(3) = (\frac{3}{2})^2 * 3 - (\frac{3}{2})^1 * 3 - \frac{1}{2}$$ $$\theta(4) = (\frac{3}{2})^3 * 3 - (\frac{3}{2})^1 * 3 - (\frac{3}{2})^1 * 3 - \frac{1}{2}$$ $$\theta(l) = \left(\frac{3}{2}\right)^{l-1} *3 - \left(\left(\frac{3}{2}\right)^{l-2} + \left(\frac{3}{2}\right)^{l-3} + \dots + 1\right) *0.5$$ $$= \left(\frac{3}{2}\right)^{l} *2 - \frac{\left(\frac{3}{2}\right)^{l-1} - 1}{\frac{3}{2} - 1} *0.5$$ $$=\frac{4}{3}(\frac{3}{2})^{1}+1$$ $$N = \frac{4}{2}(\frac{3}{2})^{2} + 1$$ Suppose that the number of operands is N then: $N=\frac{4}{3}(\frac{3}{2})^{'}+1$ Using the same analytical method used for the case of even $\theta(l-1)$ we can find constants $C_1,\ C_2$ , and $N_0{\geq}0$ , such that for all $N{\geq}N_0$ the following the same analytical method used for the case of even $\theta(l-1)$ we can find constants $C_1,\ C_2$ , and $N_0{\geq}0$ , such that for all $N{\geq}N_0$ the following the same analytical method used for the case of even $\theta(l-1)$ we can find constants $C_1,\ C_2$ , and $N_0{\geq}0$ , such that for all $N{\geq}N_0$ the following the same analytical method used for the case of even $\theta(l-1)$ we can find constants $C_1,\ C_2$ , and $N_0{\geq}0$ , such that for all $N{\geq}N_0$ the following the same analytical method used for the case of even $\theta(l-1)$ we can find constants $C_1,\ C_2,\ C_3,\ C_4,\ C_4,\ C_5,\ C_6,\ C_8,\ C_$ lowing is true: $$C_1 {\log} N \leq \frac{1}{\log \frac{3}{2}} * {\log} N \leq C_2 {\log} N$$ From the previous analysis in both cases i and ii, N numbers can be added using CSAs in $\theta(logN)$ . $\square$ ### §3.2.2. Implementation using MCSA The MCSA is based on the idea of representing a number as a Carry and a Sum similar to CSA. It can be used in the modulo addition of two numbers to obtain a scheme that has a constant speed which does not depend on the number of bits. Basically CSA depends on the idea of not completing the addition process at a certain stage, but postpone it to the final stage. In the intermediate stages numbers are represented as Sum and Carry to avoid the complete addition process. The MCSA is used to add two numbers A and B in modulo m. Figure 3.a shows that A is represented as a pair of numbers $(A_S, A_C)$ , B is represented as $(B_S, B_C)$ , and the output C is represented as $(C_S, C_C)$ . Each number is represented as a group of Sum bits and Carry bits. There is no unique representation for $A_S$ and $A_C$ . The condition that need to be satisfied is: $$\left| A_S + A_C \right|_{\mathfrak{m}} = \left| A \right|_{\mathfrak{m}}$$ One possible representation is: $$A_S = |A|_m \qquad A_C = 0$$ Figure 3.a. A Modified CSA (MCSA). We need to add four numbers $(A_S, A_C, B_S, B_C)$ , which needs two steps of CSA. After the addition process we need to detect if -M or $2^*(-M)$ is required to adjust the result. The adjusting process takes at most three steps. The proposed algorithm for modulo m addition of two numbers can be described as follow: #### Algorithm modulo\_add (A, B, Result) Input: Two variables A and B in modulo m, A is represented as $A_S$ and $A_C$ . B is represented as $B_S$ and $B_C$ . All variables are n bit Output: Variable Result represented as $Result_{\mathfrak{C}}$ and $Result_{S}$ . The relation between A, B, and Result is: $Result = \begin{vmatrix} A + B \end{vmatrix}_{\mathfrak{m}}$ . ``` Procedure: begin Do in parallel begin Call Sum(temp<sub>1</sub>, A_S, A_C, B_S) Call Carry(temp 2, As, Ac, Bs) end Do in parallel begin Call Carry(temp3, temp1, temp2, Bc) Call Carry(temp 4, temp 1, temp 2, Bc) Case ( temp sub 2 \lfloor n+1 \rfloor + \text{temp sub 4 } \lfloor n+1 \rfloor ) of 0: Do in parallel begin Result_S := temp_3 Result_C := temp_4 end exit 1: Do in parallel begin Call Sum(temp_5, temp_3, temp_4, (2^n-m)) Call Carry(temp<sub>6</sub>, temp<sub>3</sub>, temp<sub>4</sub>, (2^n-m)) end 2: Do in parallel begin Call Sum(temp<sub>5</sub>, temp<sub>3</sub>, temp<sub>4</sub>, 2*(2^n-m)) Call Carry(temp<sub>6</sub>, temp<sub>3</sub>, temp<sub>4</sub>, 2*(2^n-m)) end end case Case ( temp sub 6 \lfloor n+1 \rfloor ) of 0. Do in parallel begin Result_S := temp_5 Result_C := temp_6 end exit 1: Do in parallel begin Call Sum(temp<sub>7</sub>, temp<sub>5</sub>, temp<sub>6</sub>, (2^n-m)) Call Carry(temps, temps, temps, (2"-m)) end case Case ( temp sub 8 / n + 1 / ) of 0: Do in parallel begin Result_S := temp_7 Result_C := temp_8 end 1: Do in parallel begin Call Sum(temp<sub>9</sub>, temp<sub>7</sub>, temp<sub>8</sub>, (2^n-M)) Call Carry(temp<sub>10</sub>, temp<sub>7</sub>, temp<sub>8</sub>, (2^n-M)) end ``` Do in parallel ``` \begin{array}{c} begin \\ Result_S := temp_0 \\ Result_C := temp_{10} \\ end \\ end \\ case \\ end. \\ \\ Sum (A , B , C , D) \\ begin \\ Do in parallel (1 \leq i \leq n) \\ A[i] := (B[i] \land C[i]) \lor (B[i] \land D[i]) \lor (C[i] \land D[i]) \\ end \\ \\ Carry (A , B , C , D) \\ begin \\ A[1] := 0 \\ Do in parallel (1 \leq i \leq n) \\ A[i+1] := B[i] \oplus C[i] \oplus D[i] \\ end \\ \end{array} ``` An implementation of the algorithm is shown in Figure 3.b. Figure 3.b. Different Stages of the MCSA. **Theorem 2:** The modulo adder scheme for adding two n-bit numbers in modulo m has an asymptotic time complexity $\theta(1)$ . Proof: To prove that the number of steps is constant (five) we need to prove that the last carry is equal to zero in five or less steps. Induction is used to prove the correctness of the theorem on the number of bits n. - [1] Basis step: for n = 0, means that we do not add any numbers and in this case the required number of steps is zero. - [2] Induction hypothesis: assume for a fixed arbitrary n≥0 that that the maximum number of steps is five. - [3] Induction step: for numbers with n+1 bits let: $\eta = temp_2[n+1] + temp_4[n+2]$ . Then we have the following cases: (a) $\eta$ =0: then the carry propagation stopped at bit n, and it ends after five steps at most according to the induction hypothesis. hypothesis. (b) $\eta=1$ : then the correction is $2^{n+1}-m$ in step 3. Since $m>2^n$ then $2^{n+1}-m<2^n$ , which means that $(2^{n+1}-m)[n]=0$ . The worst case we get to have $temp_3[n+1]$ and $temp_4[n+2]$ to be equal one. This means that $temp_{\delta}[n+1] = 0$ and $temp_{\delta}[n+2] = 1$ , then $temp_{\delta}[n+2] = 0$ . In this case the correction is done in two steps (step3 and step 4). (c) $\eta=2$ : then the correction is $2*(2^{n+1}-m)$ in step 3. The worst case we get to have $temp_3[n+1]$ , $temp_4[n+2]$ , and $2*(2^{n+1}-m)$ to be equal one. Then $temp_5[n+1]-1$ , $temp_6[n+1]-1$ and $2^{n+1}-M-0$ . At step4 $temp_7[n+1]-0$ and $temp_8[n+2]=1$ . At step5 $temp_9[n+1]=1$ and $temp_{10}[n+2]=0$ . In this case the correction is done in three steps (steps3-5). Since the adder has a fixed number of stages which does not depend on the operands' length, it can be used in the implementation of a pipelined multi-operand modulo addition scheme[19]. Example: As an example, the modulo addition of A = 1272 and B = 450 for m = 2050 is shown in Figure 3.c. There is no unique representation for A and B. One valid representation is shown in Figure 3.c. Figure 3.c shows the detailed modulo addition operation for this example. In step1 we get temp<sub>2</sub>[13] = 1, and in step2 we get $temp_4[13] = 1$ , which means that at step3 we have to add $2(2^n - M)$ . At step3 we get tempe[13] = 1, which means that at step4 we have to add $2^{n}-M$ . At step4 we get $temp_{8}[13] = 0$ , which means that the addition process stops at step4. The result of step4 is the final result. Figure 3.c. A detailed Example for the Modulo Addition. Theorem 3: Adding n numbers (y1, y2, ..., ya) in modulo M is equivalent to : - Adding $(y_1, y_2)$ modulo M ,..., $(y_i, y_{i+1})$ ,..., and $(y_{n-1}, y_n)$ gives $y_{12}, \ldots, y_{(n-1)n}$ . - Step [1] is repeated on $(y_{12}, y_{34}), ..., (y_{(n-3)(n-2)}, y_{(n-1)n})$ . - Step [2] is repeated for $\lceil \log N \rceil 2$ times to obtain one final output represented as a sum and carry. Proof: To add two numbers a and b in modulo M we have the following cases: (i) $$a < M$$ and $b < M$ then $a = \begin{vmatrix} a & b \\ a & b \end{vmatrix} = \begin{vmatrix} b & b \\ b & d \end{vmatrix}$ , then: $$\begin{vmatrix} a + b & b \\ a & d \end{vmatrix} = \begin{vmatrix} a & b \\ a & d \end{vmatrix} + b_M \begin{vmatrix} b \\ b & d \end{vmatrix}$$ (12) (ii) $$a > M$$ and $b < M$ then $b = \begin{vmatrix} b \end{vmatrix} M$ and $a = M + x$ , then: $$\begin{vmatrix} a + b \end{vmatrix}_{M} = \begin{vmatrix} M + x + b \end{vmatrix}_{M} = \begin{vmatrix} x + b \end{vmatrix}_{M}$$ (13) Since x < M and b < M, then from (12) and (13): $$\begin{vmatrix} a + b \end{vmatrix}_M = \begin{vmatrix} a_M + b_M \end{vmatrix}_M$$ (iii) a > M and b < M like case (ii). (iv) a > M and b > M then a = M + x and b = M + y, then: $\begin{vmatrix} a+b \end{vmatrix}_{M} = \begin{vmatrix} M+x+M+y \end{vmatrix}_{M} = \begin{vmatrix} x+y \end{vmatrix}_{M} = \begin{vmatrix} a_{M}+b_{M} \end{vmatrix}_{M}$ From the previous four cases: $$\begin{vmatrix} a+b \end{vmatrix}_{M} = \begin{vmatrix} a_{M} + b_{M} \end{vmatrix}_{M} \tag{14}$$ Since addition is associative then: required to add n numbers in modulo M. Since addition is associative then: $$\begin{vmatrix} y_1 + y_2 + \dots + y_n \\ y_n \end{vmatrix} = \begin{vmatrix} (y_1 + \dots + y_n) + (y_n + \dots + y_n) \\ (y_1 + \dots + y_n + \dots + y_n) \end{vmatrix}_{M}$$ $$= \begin{vmatrix} y_1 + \dots + y_n \\ y_n \end{vmatrix} + \begin{vmatrix} y_n + 1 + \dots + y_n \\ y_n \end{vmatrix}$$ (using 14). We can further expand this expression using the same method to get the addition process in the right hand side in terms of only two operands added in modulo M. Theorem 3 means that adding n numbers in modulo M can be performed using a binary tree consists of units that are capable of adding only two numbers in modulo M. MCSAs are used as those building blocks to perform the addition process. Since MCSA requires that inputs be represented in the form of sum and carry, then this form should be enforced at all levels. The form will be enforced automatically for levels $\geq 2$ , because the outputs of the previous levels are in the correct form. For first level we have the following: $Y_{iS} = y_i$ , $Y_{iC} = 0$ $\forall 1 \le i \le n$ For the last stage the output is in the form of sum and carry which is exactly the same form as CSAs. Figure 3.d. shows the binary tree (0,0) (0,0) (0,0)(0,0) (y,0)(y,0)(y,0)(y,0) Stage 1 MCSA MCSA MCSA MCSA y (n-3)(n-2) MCSA MCSA MCSA <sup>y</sup>1234 (n-3)(n-2)(n-1)n Stage 3 Stage [log n] Figure 3.d. Partial Sum Adder Implementation Using MCSAs # 3.3. Range Determinator This stage consists of three levels namely ROM, Magnitude Comparator(MC), and Bit Corrector(BC). The major function of this stage is to determine S range so that appropriate value can be subtracted from S to obtain the desired result. In order to accomplish this, 2 sets of values as shown in Table I have to be compared. For simplicity, we explain the first set then the second set. Since the input to this stage, S, is a large binary number, it is partitioned into groups of adjacent bits. For example, if S is a 24-bit number, we can partition S into 3 8-bit groups $G_1$ , $G_2$ , and $G_3$ , where $$G_1 = S_{7..0}, G_2 = S_{15..8}, \text{ and } G_3 = S_{23..16}$$ Since each group is fed into a ROM module as an address input, the number of bits in each group should be small so that small ROMs that are fast and occupy small silicon area are used to implement this level. However, the number of groups, g, should be kept small as possible since the complexity of MC cells is a function of the number of ROM modules, g. Hence, there are tradeoffs in choosing g and the number of bits in each group. As shown in Figure 4, the input to ith ROM module of the the first set of ROMs is $G_i$ , and the outputs are $B_i^{ij}$ s and $C_k^{ij}$ s. The function of this ROM module is depicted as follows: $$B_j^i = \begin{cases} 0 & \text{if } G_i \leq \left(jM-1\right)_i \\ 1 & \text{if } G_i > \left(jM-1\right)_i \end{cases} \quad \text{for } j = 1..A-1$$ | A | Magnitude Compared | | | |-----|--------------------|-----------------------|-----------------------| | | First Set | Second Set | | | | | if M Odd | if M Even | | 1 | M-1 | $\frac{M-1}{2}$ | $\frac{M}{2}$ -1 | | 2 | 2 <i>M</i> -1 | $\frac{3M-1}{2}$ | $\frac{3M}{2}$ -1 | | | | | | | n-1 | (n-1)M-1 | $\frac{(2n-3)M-1}{2}$ | $\frac{(2n-3)M}{2}-1$ | Table I Values Compared by Multi-magnitude Comparators Figure 4. Implementation of the first part of Range Determinator Stage $$C_k^i = \begin{cases} 0 & \text{if } G_i \neq \binom{kM-1}{i}, \\ 1 & \text{if } G_i = \binom{kM-1}{i}, \end{cases} \text{ for } k = 2..A-1$$ Clearly, these ROM modules serve as a partial multi-magnitude comparator that compares the input pattern S to the first set of values as shown in Table I and produce $g^*(2A-3)$ outputs that are to be fed into the MC level. The MC level consists of (A-1) MC cells. This level takes the input from ROM level and does further comparison so that a 2-level multi-magnitude comparator is formed. The complexity of a MC cell is a function of the number of ROM modules. If we have g ROM modules, then the Boolean equation for the Ith MC cell is as follows: $$MC_{l}^{1} = B_{l}^{\rho} + B_{l}^{\rho-1}C_{l}^{\rho} + B_{l}^{\rho-2}C_{l}^{\rho}C_{l}^{\rho-1} + \dots + B_{l}^{2}C_{l}^{\rho}C_{l}^{\rho-1}C_{l}^{\rho-2} \dots C_{l}^{3} + B_{l}^{1}C_{l}^{\rho}C_{l}^{\rho-1} \dots C_{l}^{3}C_{l}^{2}$$ (15) Hence we have, $$MC_l^1 = \begin{cases} 0 & \text{if } S < lM \\ 1 & \text{if } S \ge lM \end{cases}$$ for $l = 1, 2, \dots, A-1$ (16) Since S may be larger than several values compared, the outputs of several MC cells may be set to 1; therefore, the BC level is used to ensure that only one of the outputs of the $MC^1$ cells is set to one and also to indicate the appropriate range. In order to do so, A identical $BC^1$ cells are needed, and their common Boolean equation is as follows: $$BC_m^1 = \overline{MC_m^1 + \overline{MC_{m-1}^1}}$$ for $m = 1, 2, \dots, A$ where. $$MC_0^1 = 1 \text{ and } MC_A^1 = 0$$ (17) Hence, the range of S is determined to be $(m-1)M \leq S < mM$ if $BC_m^1 = 1$ . Figure 4 shows the implementation of the multi-magnitude comparator that compares S with the first set of values shown in Table I and its BC level. Thus far, the range determination enables the S modulus M operation to be performed by S-(m-1)M if $BC_m^1$ is set to one. Therefore, only residue to unsigned magnitude number system conversion is possible. However, for residue to 2's complement number system conversion, the second set of values, as shown in Table I, has to be compared with S by another multi-magnitude comparator which is done in the same way as previously explained. Figure 5 shows the input to the ith ROM module of the second set of ROMs is $G_i$ , and the outputs are $D_i^{ij}$ s and $E_k^{ij}$ s. The function of this ROM module is clearly depicted as follows: $$D_{j}^{i} = \begin{cases} 0 & \text{if } G_{i} \leq \left(\frac{(2j-1)M-1}{2}\right)_{i} \\ 1 & \text{if } G_{i} > \left(\frac{(2j-1)M-1}{2}\right)_{i} \end{cases}$$ $$E_{k}^{i} = \begin{cases} 0 & \text{if } G_{i} \neq \left(\frac{(2k-1)M-1}{2}\right)_{i} \\ 1 & \text{if } G_{i} = \left(\frac{(2k-1)M-1}{2}\right)_{i} \end{cases}$$ $$for i = 1, 2, ..., a$$ and for M even, $$D_{j}^{i} = \begin{cases} 0 & \text{if } G_{i} \leq \left(\frac{(2j-1)M}{2} - 1\right)_{i} \\ & \text{for } j = 1..A - 1 (20) \end{cases}$$ $$1 & \text{if } G_{i} > \left(\frac{(2j-1)M}{2} - 1\right)_{i} \\ E_{k}^{i} = \begin{cases} 0 & \text{if } G_{i} \neq \left(\frac{(2k-1)M}{2} - 1\right)_{i} \\ & \text{for } k = 2..A - 1 (21) \end{cases}$$ $$1 & \text{if } G_{i} = \left(\frac{(2k-1)M}{2} - 1\right)_{i} \end{cases}$$ The MC level of this part is exactly the same as previously proposed, that is, it consists of $A\ MC^2$ cells, and each $MC^2$ cell has the same Boolean equation as follows: $$\begin{split} MC_l^{\,2} &= D\ell + D\ell^{-1}E\ell + D\ell^{-2}E\ell E\ell^{-1} + \dots \\ &+ D_l^{\,2}E\ell E\ell^{-1}E\ell^{-2} \dots E_l^{\,3} \\ &+ D_l^{\,1}E\ell E\ell^{-1} \dots E_l^{\,3}E_\ell^{\,2} \end{split}$$ Since different set of values is compared with S, we have for M odd, $$MC_l^2 = \begin{cases} 0 & \text{if } S \le \frac{(2l-1)M-1}{2} \\ 1 & \text{if } S > \frac{(2l-1)M-1}{2} \end{cases}$$ and for M even, $$MC_l^2 = \begin{cases} 0 & \text{if } S < \frac{(2l-1)M}{2} - 1\\ 1 & \text{if } S \ge \frac{(2l-1)M}{2} - 1 \end{cases}$$ The BC level for this part of the design consists of $A\ BC^2$ cells. Each of these cells has a control line C. If C is equal to zero, then all the output lines of BC level will be equal to one and residue to unsigned magnitude number system conversion will be performed; otherwise, only one of the BC level output lines will be equal to one, and thus residue to 2's complement number system conversion will be performed. The Boolean equation for a BC cell is as follows: $$BC_m^2 = \overline{C \cdot MC_m^2 \cdot MC_{m-1}^2}$$ for $m = 1, 2, \dots, A$ where, $$MC_0^2 = 1$$ and $MC_A^2 = 0$ Therefore, the range of S is determined to be $\frac{(2m-3)M+1}{2} \leq S < \frac{(2m-1)M+1}{2} \quad \text{for } M \quad \text{odd, and}$ $\frac{(2m-3)M}{2} \leq S < \frac{(2m-1)M}{2} \quad \text{for } M \quad \text{odd, and}$ lower bound is equal to zero when m=1). Figure 5 shows the implementation of this part of the design. ## 3.4. Final Corrector This stage consists of A tristate multiplexers and a carry lookhead adder. The $BG^1$ input lines will be used to enable one of the tristate multiplexers while $BC^2$ input lines will be used as the selectors of the multiplexers. If $BC_i^{-1}$ is set, then $(i-1)M \leq S < iM$ . The lower bound (i-1)M will be subtracted from S if conversion to unsigned magnitude number system is desired, or S is less than $\frac{(2i-1)M+1}{2}$ for M odd or $\frac{(2i-1)M}{2}$ for M even; otherwise, the upper bound, iM, will be subtracted from S. The implementation of this stage is shown in Figure 6. The CLA is used to add the 2's complement of the value to be subtracted to S and output the desired result. Figure 5. Implementation of the Second part of Range Determinator Stage Figure 6. Implementation of Final Corrector Stage # 4. Performance Evaluation - [1] The partial sum generator is implemented using small ROMs, If the number of residues is N and each residue is represented in P bits, then it is required to use N ROMs. Each ROM is storing values bounded by M, then the size of each ROM is $2^P * \lceil \log M \rceil$ . The total area required for this stage is: $N * 2^P * \lceil \log M \rceil$ . Since ROMs have a constant time delay (P is a small number) which does not depend on N, then the delay of this stage is $\theta(1)$ . - [2] The partial sum adder is implemented in two different ways: (a) Using CSAs: The complexity of the scheme is determined by Theorem 1. Since each CSA has a constant time delay, then the total time required to add N numbers in modulo M is θ(logN). - (b) Using MCSAs: The number of levels required to perform the addition of N numbers using a binary tree of MCSAs is $\lceil \log N \rceil$ as it is shown in Theorem 3. Since at each level the required time is constant (MCSA has a constant time), then the total time required for this step using MCSAs is $\theta(\log n)$ . - [3] The range determinator consists of three different levels(Figure 4). The first level consists of g ROMs. The second level is the MC cells, which are combinational circuits that can be represented with a two level switching function. Finally the last level is a two stage combinational circuit. The Three levels have a constant time delay that does not depend on N. The previous analysis shows that the range determinator has a time delay of $\theta(1)$ . [4] The Final corrector consists of two stages. In the first stage we have A tristate multiplexers which have a constant delay equivalent to two serial NAND gates. The second stage is a CLA which has a constant delay and for number of bits less than 64 the delay is equivalent to the delay of 12 serial NAND gates as shown in [13]. For number of bits larger than this we can still obtain a constant delay CLA. Then the final corrector has a delay of θ(1). From cases [1]-[4] we see that all stages except the partial sum adder has a constant time delay which does not depend on the number of residues N. Only the second stage requires $\theta(\log N)$ steps. #### 5. Conclusions The residue decoder introduced in this paper has a total delay of $\lceil \log N \rceil$ . In addition, it has several advantages as listed below: - The design is quite modular and consists of simple cells such as small ROMs and MC cells. This makes the implementation of the whole residue decoder in a single chip is possible. - It doesn't have any limitation on the moduli used. - 3) It is flexible since it can convert residues to either unsigned magnitude or 2's complement number system, and it is controlled by only a control line, C. This means that it can be applied to wider area. - It is fast compared with most schemes proposed before since it has a time complexity of θ(log N). - 5) It can be easily pipelined without any modifications. #### Acknowledgement The authors wish to thank the reviewers for thier valuable comments. #### References - M. A. Bayoumi, "Digital Filter VLSI Systolic Arrays over Finite Fields for DSP Applications," Proc. of the 6th IEEE Annual Phoenix Conference on Computers and Communications, pp. 194-199, Feb. 1987. - [2] M. A. Bayoumi, G. A. Jullien, W. C. Miller, "A Look-up Table VLSI Design Methodology for RNS Structures Used in DSP Applications," IEEE Trans. on Circuits and Systems, pp. 604-616, Vol. 34, No. 6, June 1987. - [3] F. J. Taylor, "Residue Arithmetic: A Tutorial with Examples," IEEE Computer Magazine, pp. 50-62, May 1984. - [4] M. A. Bayoumi, "A High Speed VLSI Complex Digital Signal Processor Based on Quadratic Residue Number System," VLSI Signal Processing II, pp. 200-211, IEEE Press, 1986. - [5] N. S. Szabo and R. I. Tanaka, Residue Arithmetic and its Applications to Computer Technology, New York: McGraw-Hill, 1967. - [6] W. K. Jenkins, "Techniques for Residue to Analog Conversion for Residue Encoded Digital Filters," IEEE Trans. Circuits Syst., vol. CAS-25, pp. 553-562, July 1978. - [7] S. Andraos and H. Ahmed, "A New Efficient Memoryless Residue to Binary Converter," IEEE Trans. Circuits Syst., vol. 35, Nov. 1988, pp. 1441-1444. - [8] K. M. Ibrahim and S. N. Saloum, "An Efficient Residue to Binary Converter Design," IEEE Trans. Circuits Syst., vol. 35, pp 1156-1158, September 1988. - [9] S. Bandyopadhyay, G. A. Jullien, and A. Sengupta, "A Systolic Array for Fault-Tolerant Digital Signal Processing Using A Residue Number System Approach," Proc. of Intl. Conf. on Systolic Arrays, pp. 577-586, 1988. - [10] A. P. Shenoy and R. Kumaresan, "Residue to Binary Conversion for RNS Arithmetic Using Only Modular Look-up Tables," IEEE Trans. circuits Syst., vol. 35, pp. 1158-1162, September 1988 - [11] T. V. Vu, "Efficient Implementations of the CRT for Sign Detection and Residue Decoding," IEEE Trans. Comp., vol. C-34, pp. 646-651, July 1985. - [12] C. N. Zhang, B. Shirazi, and D. Y. Y. Yun, "Parallel Designs for Chinese Remainder Conversion," Proc. IEEE 16 th Annual Conf. on Parallel Processing, Aug. 1987. - [13] K. Hwang, Computer Arithmetic: Principles, Architecture, and Design. New York: Wiley, 1978. - [14] R. M. Capocelli and R. Giancarlo, "Efficient VLSI Networks for Converting an Integer from Binary System and Vice Versa," IEEE Trans. Circuits Syst., vol. 35, Nov. 1988, pp. 1425-1430. - [15] G. Alia and E. Martinelli, "A VLSI Algorithm for Direct and Reverse Conversion from Weighted Binary Number System to Residue Number System," IEEE Trans. Circuits Syst., vol. CAS-31, 1984, pp. 1033-1039. - [16] K. P. Lee, M. A. Bayoumi and K. M. Elleithy, "A Fast and Flexible Residue Decoder Based on The Chinese Remainder Theorem," The 1989 International Symposium on Circuits and Systems. - [17] K. M. Elleithy, "On Bit-Parallel Processing for Modulo Arithmetic," VLSI Technical Report TR86-8-1, The Center for Advanced Computer Studies, University of Southwestern Louisiana, 1986. - [18] A. Avizienis, "A Study of Redundant Number Representations for Parallel Digital Computers," Ph.D Thesis, Univ. of Illinois, Urbana, Illinois, May 1960. - [19] K. M. Elleithy, "On the Bit-Parallel Implementation for the Chinese Remainder Theorem," VLSI Technical Report TR87-8-1, The Center for Advanced Computer Studies, University of Southwestern Louisiana, 1987.