



| System            | Intro<br>Date | Technology | Class            | Nominal<br>Clock<br>Period<br>(nS) | Nominal<br>Clock<br>Frequency<br>(MHz) |
|-------------------|---------------|------------|------------------|------------------------------------|----------------------------------------|
| Cray-X-MP         | 1982          | MSI ECL    | Vector Processor | 9.5                                | 105.3                                  |
| Cray-1S,-1M       | 1980          | MSI ECL    | Vector Processor | 12.5                               | 80.0                                   |
| CDC Cyber 180/990 | 1985          | ECL        | Mainframe        | 16.0                               | 62.5                                   |
| IBM 3090          | 1986          | ECL        | Mainframe        | 18.5                               | 54.1                                   |
| Amdahl 58         | 1982          | LSI ECL    | Mainframe        | 23.0                               | 43.5                                   |
| IBM 308X          | 1981          | LSI TTL    | Mainframe        | 24.5, 26.0                         | 40.8,38.5                              |
| Univac 1100/90    | 1984          | LSI ECL    | Mainframe        | 30.0                               | 33.3                                   |
| MIPS-X            | 1987          | VLSI CMOS  | Microprocessor   | 50.0                               | 20.0                                   |
| HP-900            | 1982          | VLSI CMOS  | Micro-mainframe  | 55.6                               | 18.0                                   |
| Motorola 68020    | 1985          | VLSI CMOS  | Microprocessor   | 60.0                               | 16.7                                   |
| Bellmac-32A       | 1982          | VLSI CMOS  | Microprocessor   | 125.0                              | 8.0                                    |



























































































































$$E = \int_{t}^{t+T} V_{DD} \cdot i_{V_{DD}}(\tau) \cdot d\tau$$

Energy consumed by CSE during one clock period T, where *t* is chosen to include all relevant transitions: arrival of new data, clock pulse, and output transition.

This energy has four components:

$$E = E_{switching} + E_{short-circuit} + E_{leakage} + E_{static}$$

Switching Energy:

$$E_{switching} = \sum_{i=1}^{N} \alpha_{0-1}(i) \cdot C_i \cdot V_{swing}(i) \cdot V_{DL}$$

- N is the number of nodes
  C is the capacitance of the node I
- CI is the capacitance of the node 1
   a0-1(i) is the probability that a transition occurs at the node i

Vswing(i) is voltage swing of the node I











Energy Breakdown in Clocked-Storage Elements during one of the possible input data transitions

|                         | <b>E</b> <sub>0-0</sub> | <i>E</i> <sub>0-1</sub> | <i>E</i> <sub>1-0</sub> | <i>E</i> <sub>1-1</sub> |
|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|
| E <sub>cik</sub>        | Y/N                     | У                       | У                       | Y/N                     |
| <b>E</b> <sub>int</sub> | Y/N                     | У                       | У                       | Y/N                     |
| East                    | N                       | Y/N                     | Y/N                     | N                       |

#### Two cases:

- · Storage elements without pre-charge nodes
- Storage elements with pre-charge nodes

14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic

# CSE characterization and Test setup





v. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic





# Interface with Clock Network and Combinational Logic

- We assumed that the data and clock inputs were supplied by drivers with sufficient drive strength.
- The input clock and data capacitances are important interface parameters for the clock network and logic design.
- The clock network designer and logic designer need to be aware of these capacitances in order to design circuits that drive storage elements.



| Interface with Clock Network and                                                                                                                              | Interface with Clock Network and                                                                                                                                                                                                                                                                                                                                                             |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Combinational Logic                                                                                                                                           | Combinational Logic                                                                                                                                                                                                                                                                                                                                                                          |
| <i>Interface with Combinational Logic:</i><br>The relevant parameters to the combinational<br>logic designer are:                                             | <ul> <li>Interface with Clock Network:</li> <li>CSEs are affected by clock skew and clock slope.</li> <li>The total load of the clock distribution network is defined by the input capacitance of the clock node and number of CSEs on a chip.</li> </ul>                                                                                                                                    |
| <ul> <li>CSE input data slope</li> <li>Input data capacitance</li> </ul>                                                                                      | <ul> <li>Increase in <i>clock slope</i> results in degradation of the<br/>CSE performance - the clock network designer has to<br/>know what slopes CSE can tolerate.</li> <li>This is especially important if Flip-Flops are used.</li> </ul>                                                                                                                                                |
| The data slope affects performance and<br>energy consumption of both driving logic and<br>storage elements.<br>Clock and data slopes are generally not equal. | <ul> <li>The clock slope also affects energy consumption of the clock distribution network.</li> <li>If larger clock drivers with smaller fanout are used, the clock edges are sharper and the storage element performance better at the expense of an increase in energy consumption of the clock network.</li> <li>Ontimal tradeoff is achieved with minimal energy consumption</li> </ul> |
| lov. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic 25                                                                           | Nov. 14, 2003         Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic                                                                                                                                                                                                                                                                                                     |

# Interface with Clock Network and Combinational Logic

- To evaluate the total clocking energy per clock cycle in the entire clock subsystem, one needs to add the energy consumed in the clock distribution network.
- The energy consumed in the clock distribution network depends on the total switched capacitance which is determined by the total number of clocked storage elements on a chip and the input capacitance of their clock inputs, the total wiring capacitance, and the total switched capacitance of clock drivers as given by:

 $C_{distrib-net} = N_{FF} \cdot C_{in-Clk,FF} + C_{wire} + C_{sw-buff}$ 

The last two terms depend on buffer insertion/placement strategy and should be minimized.

v. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic











## Early Data Arrival Analysis (single clock, FF)

It is commonly misunderstood that the Flip-Flop provides edge-to-edge timing and is thus easier to use, as compared to the Latch based system, because it does not need to be checked for fast paths in the logic (Hold-time violation).

This is not true, and a simple analysis that follows demonstrates that even with the Flip-Flop design the fast paths can represent a hazard and invalidate the system operation.

Nov. 14, 2003

# Early Data Arrival Analysis (single clock, FF)

| If the clock controlling the Flip-Flop releasing the<br>data is skewed so that it arrives early,<br>and the clock controlling the Flip-Flop that |
|--------------------------------------------------------------------------------------------------------------------------------------------------|
| receives this data arrives late,                                                                                                                 |
| a hazard situation exists.                                                                                                                       |
| This same hazard situation is present if the data travels through a <i>fast path</i> in the logic.                                               |
| A <i>fast path</i> is the path that contains very few logic blocks, or none at all.                                                              |
| This hazard is also referred to as <i>critical race</i> (or <i>race-through</i> )                                                                |

lov. 14, 2003





# Analysis of a System using a Single Latch

- System using a single Latch is more complex to analyze than Flip-Flop based one.
- Single Latch is transparent while the clock in active and the possibility for the *race-through* exists.
- This analysis is still much simpler than a general analysis of a system using two Latches (Master-Slave Latch based system).
- Use of a single Latch represents a hazard due to the transparency of the Latch, which introduces a possibility of races in the system.
- Therefore, the conditions for the single-latch based system must account for critical race conditions.

Presence of the CSE delay decreases the "useful time" in the pipeline cycle. Therefore, in spite of the hazards introduced by such design, the additional performance gain may well be worth the risk.

Nov. 14, 2003





Nov. 14, 2003

## Late Data Arrival Analysis

• This gives a constraint for the clock speed in terms of P such as:  $P \ge \max\{T_L + T_T + U + D_{COM} - W, D_{DOM}\} + D_{LM}$ 

This inequality breaks down into two inequalities:

$$\begin{split} P \geq D_{LM} + D_{CQM} + T_L + T_T + U - W \\ P_m = P \geq D_{LM} + D_{DQM} \end{split}$$

This shows the minimal bound for  $\ensuremath{\textit{Pm}}$ , which is the time to traverse the loop:

"Starting from the leading edge of a clock pulse, there must be time, under worst case, before the trailing edge of the clock in the next cycle, for a signal to pass through the Latch and the logic block in time to meet the setup time constraint".

The value of P = Pm determines the highest frequency of the clock. Nov. 14, 2003



## Early Signal Arrival Analysis (Single Latch Based System)

The earliest arrival of the clock  $t_{CEL}$  happens when the leading edge of the clock is skewed to arrive at  $-T_L$ . Thus, the condition for preventing race in the system is expressed as:

$$\min\{-T_{L} + D_{CQm}, t_{DEArr} + D_{DQm}\} + D_{Lm} \ge W + T_{T} + H$$

The earliest possible arrival of the clock, plus clock-tooutput delay of the Latch has to occur earlier in time than early arrival of the data, thus:

 $-T_L + D_{CQm} + D_{Lm} \geq W + T_T + H$  which gives us a lower bound on the signal delay in the logic:

$$D_{\textit{Lm}} > D_{\textit{LmB}} \geq W + T_{\textit{T}} + T_{\textit{L}} + H - D_{\textit{CQm}}$$

Nov. 14, 2003

$$\begin{array}{c} \label{eq:constraint} \hline \textbf{Early Signal Arrival Analysis}\\ (Single Latch Based System) \end{array} \\ The conditions for reliable operation of a system using a single Latch are: \\ P_m = P \geq D_{LM} + D_{CQM} + T_L + T_T + U - W \\ P \geq D_{LM} + D_{DQM} \\ D_{Lm} > D_{LmB} \geq W + T_T + T_L + H - D_{CQm} \\ \hline \end{array}$$

## Early Signal Arrival Analysis (Single Latch Based System)

Maximum useful value for  $\,\mathcal{W}\,\text{is obtained when the period}\,\,\mathcal{P}\,$  is minimal:

$$W^{opt} = T_L + T_T + U + D_{CQM} - D_{DQM}$$

Substitute the optimal clock width  $W^{opt}$  we obtain the values for the maximal speed and minimal signal delay in the logic which has to be maintained in order to satisfy the conditions for optimal single-latch system clocking:

$$P \ge D_{LM} + D_{DQM}$$

 $D_{LmB} = 2(T_T + T_L) + H + U + D_{CQM} - D_{CQm} - D_{DQM}$ single Latch system, it is possible to make the clock period P as small as

In a single Latch system, it is possible to make the clock period Pas small as the sum of the delays in the signal path: Latch and critical path delay in the logic. This can be achieved by adjusting the clock width W, while taking care of  $D_{LmB}$  low 14 2003 Analysis of a System with two-phase Clock and two Latches in an M-S arrangement



### System using two-phase clock and two latches in M-S arrangement

From the latest signal arrival analysis, several conditions can be derived. First, we need to assure an orderly transfer into  $L_2$  Latch (*Slave*) from the  $L_1$  Latch (*Master*), even if the signal arrived late (in the last possible moment) into the (*Master*)  $L_1$  Latch. This analysis yields the following conditions:

$$W_{2} \ge V + U_{2} - U_{1} + D_{1DQM} + T_{2T} + T_{1L}$$
$$W_{1} + W_{2} \ge V + U_{2} + D_{1CQM} + T_{1T} + T_{2T}$$

These conditions assure timely arrival of the signal into the *L2* Latch, thus an orderly *L1-L2* transfer (from *Master* to *Slave*) 14 2003

## System using two-phase clock and two latches in M-S arrangement

The analysis of the latest arrival of the signal into L1 Latch in the next cycle (critical path analysis) yields to the equations:

$$\begin{split} P &\geq D_{1DQM} + D_{2DQM} + D_{LM} \\ W_1 &\geq -P + D_{1CQM} + D_{2DQM} + U_1 + D_{LM} + T_{1L} + T_{1T} \\ P &\geq -V + D_{2CQM} + U_1 + D_{LM} + T_{1T} + T_{2T} \end{split}$$

This conditions assure timely arrival of the signal that starts on the leading edge of  $\phi 1$ , traverses the path through *L2*, the longest path in the logic and arrives before the trailing edge of  $\phi 1$ , in time to be captured.

The last equation shows that the amount of overlap V between the clocks  $\phi 1$  and  $\phi 2$  allows the system to run at greater speed. 14 2003

### System using two-phase clock and two latches in M-S arrangement

If we increase V we can tolerate longer "*critical path*"  $D_{LM}$ . However, the increase of the clock introduce a possibility of race conditions, thus requiring a *fast path* analysis. High-performance systems are designed with the objective of maximizing performance. Therefore, overlapping of the clocks is commonly employed, leading to the constraint of the minimal signal delay in the logic  $D_{LMB}$ :

$$D_{Lm} > D_{LmB} = V + H_1 + T_{1T} + T_{2L} - D_{2COm}$$

The maximal amount of overlap V that can be used is:

$$V_{\max} = T_{1T} + T_{2L} + D_{2CQM} + U_1 - D_{1DQM} - D_{2DQM}$$

For maximal performance, it is possible to adjust the clock overlap V so that the system runs at the maximal frequency. ov. 14. 2003

















# **Example: Clocking the Alpha Processor** Combining Eq. (4.34) and Eq. (4.39) and rearranging, we obtain a set of bounds for $W_1$ and P: $W_1 \ge U_1 - U_2 + D_{2DQM} + T_7 - T_2 + D_{22M}$ (4.40a) $P \ge U_1 + D_{2cQM} + 2T_7 + D_{22M}$ . (4.40b) Combining Eq. (4.38a) and Eq. (4.40a) we obtain a third and often the most critical bound for the clock period: $P = W_1 + W_2 \ge D_{1DQM} + D_{21M} + D_{2DQM} + D_{22M}$ . (4.41)

Nov. 14, 2003



Nov. 14, 2003





# Example: Clocking the Alpha Processor

Substituting the values into Eqs. (4.38,4.40 and 4.41) we obtain:

$$\begin{split} W_2 &\geq 270\,ps\\ P &\geq 320\,ps\\ W_1 &\geq 230\,ps\\ P &\geq 270\,ps\\ and the most critical bound for P,\\ P &= W_1 + W_2 &\geq 500\,ps \;. \end{split}$$

Thus the minimal clock period is  $P_{min}$ =500ps, and maximal frequency at which this system can run is  $f_{max}$ =2GHz.

Digital system using a single-phase clock and dual-edge triggered storage elements



# Absorbing Clock Uncertainties

- Clocking in high-performance digital systems is most seriously affected by *clock skew* and *clock jitter*.
- · We will treat both of them as clock uncertainties.
- Trends:
  - Relative portion of the clock cycle budgeted for clock uncertainty increases.
  - Clock distribution becomes progressively difficult due to: • load mismatch
    - Process, voltage, and temperature variations.
  - The clock uncertainties occupy increasing portion of the cycle time; typically 2 FO4.
- The ability to reduce impact of these uncertainties is one of the most important properties of the highperformance system.

Nov. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic



























## Time Borrowing

We define time borrowing as:

- Dynamic time borrowing
  - the essential condition is that there are no "hard" boundaries between stages, i.e. that the storage elements are transparent at the time when data arrives.
  - Occurs in two clocking styles, level sensitive and soft-edge clocking.
- Static time borrowing
  - a technique of intentionally delaying the clock by inserting delay between clock inputs of the clocked storage elements. The clocks are scheduled to arrive so that the slower paths obtain more time to evaluate, taking away the time from faster paths.
  - It can operate with conventional hard-edge Flip-Flops.
  - Also called opportunistic skew scheduling

Nov. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic



#### Timing Analysis with Time Borrowing: Late Data Arrival

The arrival time of the input to the subsequent Latch  $t_{D,\mu l}$  is equal to the sum of the arrival times of the input to the preceding Latch  $t_{D,\mu}$  Data-to-Output delay of the Latch  $D_{DQ,\mu}$  and logic delay  $D_{LM,i}$  of the current stage.

$$t_{D,i+1} = t_{D,i} + D_{DO,i} + D_{LM,i}$$

The arrival of input to (2N+1)th Latch (input to (N+1)th stage) is:

5.9) 
$$t_{D,2N+1} = t_{D,1} + \sum_{i=1}^{2N} (D_{DQ,i} + D_{LM,i})$$

(Assumption: All logic blocks are used in time borrowing)

Nov. 14, 2003 Digital

Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic

17

#### Timing Analysis with Time Borrowing: Late Data Arrival

We assume that the after N stages, the pipeline produces data at the same point in the transparency period of clock phase  $\Phi_1$  at which the input data was acquired in the first clock cycle. Therefore,  $t_{D,2N+l}$ - $t_{D,1}$  is equal to N clock periods P.

$$(5.10) t_{D,2N+1} - t_{D,1} = NP$$

Combining Eq. (5.9) and Eq. (5.10), we obtain the requirement for minimum clock period under late data arrival:

(5.11) 
$$P = \frac{1}{N} \sum_{i=1}^{2N} (D_{DQ,i} + D_{Logic,i})$$

It shows that the minimum clock cycle time of the pipeline is not determined by the delay of the slowest stage in the pipeline. It is rather the average delay of the logic and Latches through all stages.

Note that Eq. (5.11) is valid only if the data arrive to the Latch during the time it is transparent.

(5.12) 
$$\frac{i-1}{2}P - D_{DQ,i} + D_{CQ,i} < t_{D,i} < \frac{i}{2}P - U_i, \ \forall i \in [1..2N]$$

where it is assumed that the first leading edge of  $\Phi_1$  occurs at time zero. Nov. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic











### Opportunistic skew scheduling scheme

#### Advantages:

- It can operate with conventional Flip-Flops.
- It places fewer constraints onto the circuit design, allowing longer critical paths where necessary.

#### Disadvantages:

- It increases the complexity of the clock distribution system.
- It is hard to control the inserted delays over process, supply and temperature variations.
- The analysis of clock skew is also complicated in this asymmetric clock distribution network.

While all these difficulties make this technique impractical on a large-scale level, it is very useful in localized critical paths where every improvement directly increases the system clock rate.

Nov. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic



14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic







# Effects of clock uncertainties to a timeborrowing system

- Decreasing of the margins for time borrowing.
- The pipeline absorbs the uncertainties for the data that arrives during the transparency period of the Latch.
- The effect of the uncertainties is reduced to an average uncertainty over all stages in the path.

Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic



ov. 14, 2003

























































Nov. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic

# Simulation Techniques

- In high-speed designs, the design and evaluation of CSEs is focused on the elements on the critical path and often implicitly assumes such conditions during performance comparisons.
- There are a lot of CSEs that are placed on non-critical paths with relatively light load. While these CSEs do not directly impact the performance of the processor, careful design of these elements can significantly reduce the energy consumption and alleviate the clock distribution problems.
- The purpose of this presentation is to recommend simulation techniques that designers can use to evaluate the performance of CSEs, depending on the desired application.
- We try to build the understanding of the issues involved in creating a simulation environment of the CSE, so that such information can be used to build own setups tailored to the specific application.
- There is no universal setup that is good for all CSE applications.
- v. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic



































































### Environment Setup

- In case where CSE can be re-optimized for each particular load, further speedup can be achieved since the effort can be shared between stages rather than resting solely on the output stage. This approach was illustrated in (Heo and Asanovic 2001).
  - Contrary to their conclusions, in case when a general performance of a CSE needs to be assessed, the proper approach is to optimize the CSE for the most important application that determines the performance of the whole system, not the most frequent application.
- system, not rune most trequent application. In high-speed systems the most important are the elements on the critical path which is typically moderately-to-heavy loaded due to branching to parallel execution units and wire capacitance. The small number of critical paths in a processor does not decrease their importance since it is their delay that determines the clock rate of the whole extem whole system.
- The performance of the large number of the lightly loaded CSEs that are placed off of the critical path is of concern only if it can be traded for energy savings.
- The simulation approach should attempt to resemble the actual data-path environment. The number of logic stages in a CSE and their complexity highly depend on particular circuit implementation, leading to differences in logical effort, parasitic delay and energy consumption. Every CSE structure needs to be optimized to drive the load with best possible effort delay.
- 14 2003 Digital Syst Clacking: Oklobdzija, Stojanović, Marković, Ned



## General simulation setup

- The size of the clocked transistors is set to the size needed in order not to compromise the speed of the whole structure.
- not to compromise the speed of the whole structure. A direct tradeoff exists between the CSE delay and clock energy (size of clocked transistors), as some of the clocked transistors are always on the critical path of the CSE. Increase in sizes of clocked transistors on a critical path results in diminishing returns since data input is fixed. Depending on the CSE topology, some structures can trade delay for clocked transistor size more efficiently than others, so we allow this to happen up to a certain extent. certain extent.
- Certain extent. Our goal is to examine CSEs that are used on a critical path, hence the assumption that designer might be willing to spend a bit more clock power to achieve better performance. Differences in clock loads (CC/R) among devices illustrate potential drawbacks in terms of clock power requirements, and serve as one of the performance metrics. Clock inputs have identical signal slope to that of a FO4 inverter. This can be changed depending on the clock distribution design methodology.

Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedović

ov. 14. 2003

## General simulation setup

- The question on how to compare differential and single-ended structures has always been one of the key issues among the people characterizing and designing the CSEs.
- bifferential and single-ended structures should not be compared with each other, due to the overhead that single-ended structures incur to generate the complementary output. We do not require that single-ended structure generate both true and complementary value at the output.
- The worst-case analysis requires that the CSE generates the output that has worse data-to-output delay. However, it is also beneficial to measure both D-Q and D- $\overline{Q}$  delay.
- Any imbalance between the two can lead to big delay savings in cases where proper logic polarity manipulation in the stages preceding or following the CSE can change the polarity requirement of the CSE, and hence its data-to-output delay.
- Load model always consists of several inverters in a chain to avoid the error in delay caused by Miller capacitance effects from the fast switching load back to the driver.

14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovi

## General simulation setup

- The logical effort framework offers analogy between the CSE and a simple logic gate.
  - At light load, logic gate is dominated by its parasitic delay, i.e. self-loading.
  - At high load, effort delay becomes the dominant factor.
- Similarly, at light load, delay of a CSE with large number of stages is entirely dominated by parasitic delay.
- However, at high load, more stages are beneficial in reducing the effort delay, which then dominates over parasitic delay.
- . Therefore, the performance of the CSE is best assessed if it is evaluated in a range of output loads of interest for the particular application.
- CSE evaluation can either be performed using some representative critical path load or a set of loads can be used in which case CSE has to be re-optimized for each load setting.

v. 14. 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic



ov. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic



Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic

, 14, 2003







# HLFF sizing example

|                | - ,                                                                                                                                                                                                                                                             |
|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ۰٦<br>c<br>p   | The logical effort calculation of the second stage is slightly more<br>complicated because of the keeper inverter pair. A keeper sinks a<br>nortion of the current that is sourced by the PMOS transistor to node                                               |
| Ċ              | ${\cal Q}$ which can be taken into account as negative conductance.                                                                                                                                                                                             |
| ۰ ۱<br>c<br>f  | This negative conductance accounted for by subtracting the<br>conducfance of the NMOS transistor (1) of the shaded keeper inverter,<br>rom the conductance of the driving PMOS transistor (10/2).                                                               |
| • F<br>4<br>s  | For the particular load, efforts per stage were calculated to be 4.7 and 4.25, which is near the optimum value of 4, indicating that this example sizing is nearly optimal.                                                                                     |
| -              | The state in second scheme is second as simplified by a state of the                                                                                                                                                                                            |
| • 1<br>c<br>v  | i ne sizing in example above is somewhat simplified because the short<br>channel stack effect has not been taken into account, the logical effort<br>values for the NMOS transistor stack are somewhat pessimistic.                                             |
| • • •          | Dnce the logical effort of each stage is known it can be used to adjust the sizing of each stage as the load is increased or decreased.                                                                                                                         |
| י י<br>ט<br>ני | The alternative method is to use one of the automated circuit<br>optimizers, however, it is not recommend it as initial method, simply<br>because it is essential that designer gets to know the circuit through<br>manual sizing and logical effort estimation |
| . 1            | This builds intuition about the circuit and ability to verify ontimizer                                                                                                                                                                                         |
| r              | results.                                                                                                                                                                                                                                                        |
| Nov.           | 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic 47                                                                                                                                                                                  |
|                |                                                                                                                                                                                                                                                                 |



| HLFF Delay (normalized to FO4 inverter delay) |
|-----------------------------------------------|
| vs. Fanout for Different HLFF Cell Sizes      |

| Fanout                    | 4           | 16               | 42            | 64           | 128  |
|---------------------------|-------------|------------------|---------------|--------------|------|
| Load-Size (#stages)       |             |                  |               |              |      |
| Small-A (2)               | 1.60        | 2.06             | 3.11          | 4.19         | 7.80 |
| Medium-B (2)              | 1.80        | 2.06             | 2.59          | 3.05         | 4.62 |
| Large-C (2+1)             | 2.27        | 2.44             | 2.74          | 2.96         | 3.56 |
| There is only one optimal | I solution  | for each load    | d size.       |              |      |
| 14, 2003 Digital Syste    | m Clocking: | Oklobdzija, Stoj | anovic, Marko | vic, Nedovic |      |







| M-5A                                                     | FF Delay vs.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Fano<br>Celi                     | ut for<br>Sizes                        | r Diff<br>;                        | eren                                     | † М                            | SAFF                         |    |
|----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|----------------------------------------|------------------------------------|------------------------------------------|--------------------------------|------------------------------|----|
|                                                          | Fanout                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 4                                | 16                                     | 42                                 | 64                                       | 128                            |                              |    |
|                                                          | Optimal<br>Load-Size<br>(#stages)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                  |                                        |                                    |                                          |                                |                              |    |
|                                                          | Small-A (2)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2.33                             | 2.60                                   | 3.11                               | 3.53                                     | 4.70                           |                              |    |
|                                                          | Medium-B (2)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 2.35                             | 2.59                                   | 3.01                               | 3.34                                     | 4.24                           |                              |    |
|                                                          | Large-C (2+1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 3.06                             | 3.15                                   | 3.31                               | 3.44                                     | 3.83                           |                              |    |
| In this cas<br>output load<br>solutions is<br>the FO4 ir | e we also observe the observement of the observeme | he minin<br>ses. The<br>the elec | num data-<br>: performo<br>:trical fan | to-outpu<br>ance of t<br>out, norr | it ( <i>D-Q</i> )<br>hree dit<br>nalized | delay a<br>fferent<br>to the a | is the<br>sizing<br>delay of |    |
| Nov. 14, 2003                                            | Digital System Cl                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | ocking: Ok                       | (lobdzija, St                          | ojanovic, M                        | arkovic, N                               | edovic                         |                              | 53 |

| Modified SAFF                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>The sizing is done in a way similar to that described in the HLFF example.</li> <li>We recognize that the logical effort of the input stage is very small, better than that of an inverter, because of the small input capacitance.</li> <li>This implies that the sizing changes will mostly be located in the output stage since the input stage can accommodate larger load variations without the need for resizing.</li> <li>While it was relatively easy to find different sizes that perform better at certain loads, in the case of HLFF, it was not so in the case of M-SAFF.</li> <li>The small logical effort of the whole structure enables it to cover a huge range of loads with a single size achieving relatively good performance.</li> <li>This is the case with structure of size B in Table. Size A is only slightly better than size B, and only for very light load of FO4, and then size B device takes the lead all the way up to the FO64 after which additional inverter is needed to prevent excessive delay.</li> </ul> |
| Nov. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic 54                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |





# Automating the Simulations

- The delay vs. load CSE evaluation described in the examples can be implemented automatically. Per/ is suggested as one of the most convenient scripting languages today. For each CSE, we need to determine:
  - the logical effort of every stage based on its topology (e.g. 2 NAND-like stages, 1 inverter stage, would be 4/3, 4/3, 1), or better yet, exact logical-effort values obtained from the simulation.

  - EACH rogical efforts of a given technology process. The product of logical efforts of all stages should equal the total logical effort of the CSE. After total logical effort is found, optimal number of stages and updated stage effort can be calculated.
- With stage effort and be calculated. With stage effort and logical efforts obtained from the topology of the CSE, taking the data input of fixed size, and assuming that the clock is "on" (i.e. treating the structure as cascade of logic gates), transistor sizes for every stage can be calculated, progressing from the data input to the final load in the cimulation eather.
- stage can be calculated, progressing from the data input to the final load in the simulation setup. When a library of CSEs is created, a pre-simulation should be run for each environment parameter setup. This includes various process corners, supply voltages, etc., to determine the FOd inverter slope and set that value as the rise/fall time of signals that drive data and clock into the CSE.

Nov. 14, 2003

Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic





























































































Microprocessor Examples • Clocking for Intel® Microprocessors • IA-32 Pentium® Pro • First IA-64 Microprocessor • Pentium 4 • Sun Microsystems UltraSPARC-III® Clocking • Clocking and CSEs • Alpha® Clocking: A Historical Overview • Clocking and CSEs

- IBM® Microprocessors
  - Level-Sensitive Scan Design
  - Examples of CSEs

v. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic

# Intel<sup>®</sup> Microprocessor Features

|                 | Pentium® II        | Pentium® III       | Pentium® 4         |
|-----------------|--------------------|--------------------|--------------------|
| MPR Issue       | June 1997          | April 2000         | Dec 2001           |
| Clock Speed     | 266 MHz            | 1GHz               | 2GHz               |
| Pipeline Stages | 12/14              | 12/14              | 22/24              |
| Transistors     | 7.5M               | 24M                | 42M                |
| Cache (I/D/L2)  | 16k/16K/-          | 16K/16K/256K       | 12K/8K/256K        |
| Die Size        | 203mm <sup>2</sup> | 106mm <sup>2</sup> | 217mm <sup>2</sup> |
| IC Process      | 0.28µm, 4M         | 0.18µm, 6M         | 0.18µm, 6M         |
| Max Power       | 27W                | 23W                | 67W                |

Source: Microprocessor Report Journal

v. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic

























|                      | UltraSPARC-I®            | UltraSPARC-II®           | UltraSPARC-III®        |
|----------------------|--------------------------|--------------------------|------------------------|
| Year                 | 1995                     | 1997                     | 2000                   |
| Architecture         | SPARC V9, 4-issue        | SPARC V9, 4-issue        | SPARC V9, 4-issue      |
| Die size             | 17.7x17.8mm <sup>2</sup> | 12.5x12.5mm <sup>2</sup> | 15x15.5mm <sup>2</sup> |
| # of transistors     | 5.2M                     | 5.4M                     | 23M                    |
| Clock Frequency      | 167MHz                   | 330MHz                   | 1GHz                   |
| Supply voltage       | 3.3V                     | 2.5V                     | 1.6V                   |
| Process              | 0.5µm CMOS               | 0.35µm CMOS              | 0.15µm CMOS            |
| Metal layers         | 4 (Al)                   | 5 (Al)                   | 7 (Al)                 |
| Power<br>consumption | <30W                     | <30W                     | <80W                   |

Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic

17

ov. 14, 2003











| Microprocessor Examples                                                                                                                    |    |
|--------------------------------------------------------------------------------------------------------------------------------------------|----|
| <ul> <li>Clocking for Intel® Microprocessors</li> <li>IA-32 Pentium® Pro</li> <li>First IA-64 Microprocessor</li> <li>Pentium 4</li> </ul> |    |
| Sun Microsystems UltraSPARC-III® Clocking     Clocking and CSEs     Alpha® Clocking: A Historical Overview                                 |    |
| Clocking and CSEs                                                                                                                          |    |
| <ul> <li>IBM® Microprocessors</li> <li>Level-Sensitive Scan Design</li> <li>Examples of CSEs</li> </ul>                                    |    |
| 4, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic                                                                 | 23 |

v. 1

|                             | 21064     | 21164     | 21264     | 21364     |
|-----------------------------|-----------|-----------|-----------|-----------|
| # transistors [M]           | 1.68      | 9.3       | 15.2      | 152       |
| Die Size [mm <sup>2</sup> ] | 16.8x13.9 | 18.1x16.5 | 16.7x18.8 | 21.1x18.8 |
| Process                     | 0.75µm    | 0.5µm     | 0.35µm    | 0.18µm    |
| Supply [V]                  | 3.3       | 3.3       | 2.2       | 1.5       |
| Power [W]                   | 30        | 50        | 72        | 125       |
| Clk Freq. [MHz]             | 200       | 300       | 600       | 1200      |
| Gates/Cycle                 | 16        | 14        | 12        | 12        |

















































| Summary                                                                                                                                                        |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| <ul> <li>Intel<sup>®</sup> Microprocessors</li> <li>Active clock deskewing in Pentium<sup>®</sup> processors</li> </ul>                                        |  |  |
| <ul> <li>Sun Microsystems<sup>®</sup> Processors</li> <li>Semidynamic flip-flop (one of the fastest<br/>single-ended flip-flops today, "soft-edge")</li> </ul> |  |  |
| <ul> <li>Alpha® Processors</li> <li>Performance leader in the '90s</li> <li>Incorporating logic into CSEs</li> </ul>                                           |  |  |
| <ul> <li>IBM® Processors</li> <li>Design for testability techniques</li> <li>Low-power champion PowerPC 603</li> </ul>                                         |  |  |
| Nov. 14, 2003 Digital System Clocking: Oklobdzija, Stojanovic, Markovic, Nedovic                                                                               |  |  |