## TECHNICAL REPORT: The MS-Processor's Register File Timing and Power Evaluation

| Isidro           | Adrián                | Alex              | Marco Antonio   | Mateo            |
|------------------|-----------------------|-------------------|-----------------|------------------|
| González         | Cristal               | Veindenbaum       | Ramírez         | Valero           |
| DAC – UPC        | BSC – CNS             | ICS – UCI         | CIC – IPN       | BSC – CNS        |
| iglez@ac.upc.edu | adrian.cristal@bsc.es | alexv@ics.uci.edu | mars@cic.ipn.mx | mateo@ac.upc.edu |

## Abstract

Power evaluation is an important issue in new proposal chip level architectures due to the big amount of power is dissipated as head and chips have limited head dissipation capacity. The evaluation shown in this technical report don't use any low-power techniques; main goal of this work is known the upper limit consumption of the Multi-State Processor's RF design, power optimization is a work to be making through the steps of design flow. Logic design has being performed at transistor level using SPICE simulator, once the basic structures of RF took shape power consumption was analyzed, the source of technology parameters used in this work is Predictive Technology Models (PMT) provided by the Nanoscale Integration and Modeling Group at UC Berkeley [9].

## Introduction

Multiport register files are critical components of modern superscalar processors, but conventional designs consume considerable die area and power as register counts and issue widths grow. Multi-Banked register files consisting of multiple interleaved banks of lesser ported cells that can be used to reduce area and power. The MSP Register File design is described in next sections.

# 1. MSP Register File Design

The Physical register file is a Multi-Bank structure. In this analysis we consider that each bank is associated to an exclusive logical register maintaining a minimum read and write ports, nevertheless, others combinations are possible. Each bank is 64 bits words x 16-entries 1-Write Port and 1-Read Port. For this analysis 4-Way architecture is considered therefore MS-Processor's Register File requires 4-Global Write Ports and 8-Global Read Ports.

## 2. Layout of a bank memory

Register file is formed with four banks; the layout of a bank is shown in Figure 1. Each bank contains eight sub-banks and each sub-bank is 16-entry 1-Read local port / 1-Write local port. Each line in the figure represents a 64-bit data bus, four correspond to write ports and eight correspond to read ports.



Figure 1. Layout of a memory bank's multiplexer and demultiplexer structure

# 3. Interconnections

The source of technology parameters used in this work is Predictive Technology Model (PMT) provided by the Nanoscale Integration and Modelling Group at UC Berkeley [9]. The model uses the same approach as the International Technology Roadmap for Semiconductors by subdividing the wiring layers in same form into three categories 1) Local lines, for connections within a cell, 2) Intermediate lines for connections between modules and 3) Global lines for chip communications. In addition, it use two structure types a) Global Layer lines are coupled above one metal ground and b) Local and Intermediate Layer lines are coupled between two metal ground.





a) Global Layer lines

b) Local and Intermediate Layers lines

Figure 2. Wire Structures used in PMT

Wires were modelled using lumped lossy transmission line model. The new process also integrates several copper interconnection layers and uses a "Hi-k" dielectric material that increases the signal speed inside the chip and reduces chip power consumption.

# 4. Clock Frequency

Fan-Out-of 4 (FO4) delay<sup>1</sup> metric has being used to estimate CMOS-circuit speed due to it is independent of technology process. The FO4 metric and the access delay of some structures in the critical path as a lower bound can be used to estimate the processor clock speed across technology generations [4]. In [1] it is ssuggested using the computation delay of highly optimized 64-bit adder (5.5FO4 for static logic), under the assumption that to execute two dependent instruction in consecutive cycles, the first instruction must compute its result in a single cycle. Considering the pipeline-latches overhead and time to bypass the adder result back to the input for the next instruction, clock period is gone down to 8FO4 delays in aggressive estimation and 16FO4 delays in conservative estimation. For high performance architecture the optimal clock period estimated is approximately 8FO4 for integer and 6FO4 for floating point applications [6]. With these considerations we used typical FO4 delay and aggressive clock period of 8FO4 for our evaluations. Table 1 summarizes the clock frequency for each technology projected.

<sup>&</sup>lt;sup>1</sup> The FO4 delay is the time that experiment an inverter to drive 4 copies of same inverter, its delay has been approximately linear with technology and an approximation of the FO4 delay in picoseconds is given by  $360 \times L_{d(\mu m)}$  at typical case, where  $L_{d(\mu m)}$  is the minimal length of transistor gate in micrometers [5]

| Table 1. Technology, FO4 delay and high performance architecture frequency projected |        |          |                 |                      |  |  |  |  |
|--------------------------------------------------------------------------------------|--------|----------|-----------------|----------------------|--|--|--|--|
| Technology(nm)                                                                       | FO4 De | lay (ps) | Frequency (GHz) | Power Supply (Volts) |  |  |  |  |
| reciniology( <i>iiii</i> )                                                           | 360xL  | SPICE    | frequency (Ghz) |                      |  |  |  |  |
| 90                                                                                   | 32.4   | 22.5     | 4.0             | 1.2                  |  |  |  |  |
| 65                                                                                   | 23.4   | 17.2     | 5.0             | 1.0                  |  |  |  |  |
| 45                                                                                   | 16.2   | 16.3     | 8.0             | 0.9                  |  |  |  |  |

### 5. Decoder design

Decoder is implemented using dynamic logic, 512-entries RF requires a 9-bit decoders for each global write ports and each global read ports. 12 independent decoders are necessary for this design, 4 for write ports and 8 for read ports.



Figure 3. Gates used in decoders



Figure 4. Isolator and word line driver of a decoder and its word line signals

A complete decoder is implemented with four 2-input OR gate (Figure 3a) to decode banks, eight 3-input OR gates per banks (figure 3b) to decode sub-banks and sixteen 4-input OR gates per sub-bank to decode rows, a final stage is added to each entry to isolate the memory-decoder and to support the load of each register's entry; this is show in the Figure 4. In Figure 3 it is shown the gates used in decoders, address transition is gone on first phase of clock (pre-charge) and block select occur in second phase of clock (evaluation). NAND array box-enclosed is used to enabling the access to any port; one array is used for each memory sub-bank. Figure 12 shows the layout of a bank with the corresponding structure of

decoders and NAND enable array and shows the word line signals for several rows, each cycle decoder selects an entry of register file as long as the bank and subbank signals are active, delay is one clock (half for pre-charge and half for evaluation). Decoder average power consumption is 18mW/cycle for 90nm process.

### 6. Global write ports bus multiplexer structure

Write ports use 64-bits 4:1-Multiplexer (MUX). Each MUX is implemented using 64 1bit 4:1-multiplexers with independent true and complemented selection signals, only one of (A,B,C or D) ports is active at the same time. Figure 5 shows the 1-bit 4:1 MUX design using transmission gates.



Figure 5. Multiplexer used in Global Write Port bus and its waveform

### 7. Global read ports structure

Global read ports use 64-bits tri-state 1:8-demultiplexer (DEMUX). In similar form each DEMUX is implemented using 64 1-bit 1:8 demultiplexers<sup>2</sup> as it's shown in the Figure 1. Figure 6 shows a 1-bit 1:8 demultiplexer. Both multiplexer and demultiplexers use a fork circuit show in figure 7 to enabling the data outputs, this circuit drives true and complement enable signals.

<sup>&</sup>lt;sup>2</sup> The design of a demultiplexer is a collection of tri-state inverting with data inputs tied together, and the data outputs free.



Figure 6. 1-bit 1:8-demultiplexer used with Global Read Port bus and its waveforms



#### 8. Fork circuits structures

Figure 7. Fork circuit used in a) multiplexers and b) demultiplexers and its waveform

The inverter strings used to compute true and complement version of selection signals (S[3:0] and /S[3:0]) and enabling signals (E[7:0] and /E[7:0]) are called fork amplifier, it consist of two string of inverter that share same input as is show in Figure 7. One string contains an odd number and the other an even number of inverters. Logical effort was used to optimizing delay, optimal stages number and CMOS P/N ratios. The load on each true and complement select signal is 128/64 unit size transistor for multiplexer and 512/256 unit size transistor for demultiplexer.

For MUX and DEMUX the driven load of each leg of the fork is the same and both arms and its delays must be the same in order to true and complement output signals will be able to emerge at the same time. The Table 2 shows delay evaluation for MUX and DEMUX for 90nm, 65nm and 45nm process where the last column shows total delay for MUX and DMUX.

| Table 2 | Table 2. MUX/DEMUX Delay evaluation [ns] |         |                  |         |                   |             |  |  |  |  |
|---------|------------------------------------------|---------|------------------|---------|-------------------|-------------|--|--|--|--|
| TECH    |                                          | FORK5-4 | TRI-STATE BUFFER | FORK4-3 | TRANSMISSION GATE | Total Delay |  |  |  |  |
| 90nm    | DMUX                                     | 0.086   | 0.025            | -       | -                 | 0.111       |  |  |  |  |
| /01111  | MUX                                      | -       | -                | 0.064   | 0.010             | 0.074       |  |  |  |  |
| 65nm    | DMUX                                     | 0.083   | 0.022            | -       | -                 | 0.105       |  |  |  |  |
| 0.51111 | MUX                                      | -       | -                | 0.063   | 0.009             | 0.072       |  |  |  |  |
| 45nm    | DMUX                                     | 0.070   | 0.021            | -       | -                 | 0.091       |  |  |  |  |
| 451111  | MUX                                      | -       | -                | 0.052   | 0.007             | 0.059       |  |  |  |  |

## 9. Memory Cell

SRAM array is used in processor's structures for store temporal values, they can be single ported or multi ported. Multiport SRAM allows simultaneous read and write operations on different memory words but written of more than one port to same entry is not allowed, they can allow simultaneous read access to same entry depending of its application, such as the case of register alias table (RAT) or physical register file, etc. Sizing of transistors SRAM cell is possible help us with the CMOS transistors operation mode equations (linear and saturation), considering that bit-data stored in the cell exist and that they will be changed, furthermore considering some technological values. Two cases could be considered; first when  $V_{DS}$  voltage is small compared with ( $V_{GS}-V_{TH}$ ) is denoted as linear mode or linear region of CMOS transistor. Second case is when the gate channel is pinched off, device is described as saturated VDS >  $V_{GS}-V_{TH}$ .

For read cycle: assuming that a "1" is stored in the cell (q=1), and that both bit lines are pre-charged to  $\frac{1}{2}$  V<sub>DD</sub>. Before the WL<sub>0</sub> is actived the voltage level of bit lines does not show any significant variation because no current will flow through M6, in other half of the cell M5 and M2 will conduct a small current and the voltage level of *nbl*<sub>0</sub> will begin to go down slightly. The key design is that the voltage in *nq* node V<sub>q</sub> does not exceed the threshold voltage V<sub>T</sub> of M4 transistor, so that the M4 transistor remains turned off (saturate), this voltage is known as *ripple* ( $V_{nq} \leq V_{T,M4}$ ).



Figure 8. Multiport SRAM cell and Read cycle waves

Assuming that after access transistor M5 and M6 are on; the **bl**<sub>0</sub> voltage remains equal to  $\frac{1}{2} V_{DD}$ , since that no current flow through M6, in the other half of the cell however M5 operate in saturation and M2 operate in the linear region (conduction channel exist) and the voltage level of  $nbl_0$  will go down slightly. Because the bit line capacitance is large, the amount of decrease in bit line voltage is limited to a few hundred milivolts during the read phase. Considering the saturation drain current and linear mode drain current of CMOS transistor expressions:

$$k_{n,MS}\left(\left(V_{DD} - V_{ripple} - V_{Tn}\right)V_{DSATn} - \frac{V_{DSATn}^2}{2}\right) = k_{n,M2}\left(\left(V_{DD} - V_{Tn}\right)V_{DSATn} - \frac{V_{DSATn}^2}{2}\right)$$

| TABLE 3. | TABLE 3. SINGLE PORT SRAM CELL M2/M5 SIZE RATIO FOR SEVERAL TECHNOLOGY |                |          |          |                                                                                                                                                                 |        |  |  |  |
|----------|------------------------------------------------------------------------|----------------|----------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|--|--|--|
| TECH     | VDD                                                                    | VTHP/VTHN      | UN       | UP       | $\begin{pmatrix} W \end{pmatrix} \begin{pmatrix} V & -V & -V \end{pmatrix} \begin{pmatrix} V & -V & -V \end{pmatrix} \begin{pmatrix} V & -V & -V \end{pmatrix}$ | VALUE  |  |  |  |
| 090nm    | 1.2V                                                                   | -0.3390/0.3970 | 5.47 E-2 | 7.11 E-3 | $\frac{k_{n,M}}{L} = \frac{L}{M2} \left\{ \frac{(r_{DD} - r_{ripple} - T_{N} r_{DSATn} - 2)}{2} \right\}$                                                       | 0.8505 |  |  |  |
| 065nm    | 1.0V                                                                   | -0.3650/0.4230 | 4.91 E-2 | 0.57 E-2 | $k_{n,M5} = \begin{pmatrix} W \\ T \end{pmatrix}$ $(V_{DD} - V_{Tn})V_{DSATn} + \begin{pmatrix} V_{DSATn}^2 \\ V_{DSATn} \end{pmatrix}$                         | 0.8276 |  |  |  |
| 045nm    | 0.9V                                                                   | -0.4118/0.4660 | 4.39 E-2 | 0.44 E-2 | $(L)_{MS}$ $(2)$                                                                                                                                                | 0.7926 |  |  |  |

For the write cycle: Beginning from stage of figure 9, in order to change the data stored in the SRAM cell, for example, forcing nq to  $V_{DD}$ , and q to "zero" the voltage of node q must be reduced below of threshold voltage of M2 such as M2 be off

first, when  $V_{nq} = V_{Tn}$ , transistor **M6** operate in linear region, while **M3** operate in saturation.



Figure 9. Write Cycle of multiport SRAM

Resuming, the transistor **M3** is forced to be off during a "zero" write operation, if condition of inverter transconductance is satisfied, it guarantees that **M1** subsequently be on, changing the information stored. Tables 3 and 4 resume the transistors size ratio of SRAM cell.

| TABLE 4. | TABLE 4. SINGLE PORT SRAM M3/M6 SIZE RATIO FOR SEVERAL TECHNOLOGY PROCESS |                                    |          |          |                                                                                       |        |  |  |  |
|----------|---------------------------------------------------------------------------|------------------------------------|----------|----------|---------------------------------------------------------------------------------------|--------|--|--|--|
| TECH     | V <sub>DD</sub>                                                           | V <sub>THP</sub> /V <sub>THN</sub> | UN       | UP       | (W)                                                                                   | VALUE  |  |  |  |
| 090nm    | 1.2V                                                                      | -0.3390/0.3970                     | 5.47 E-2 | 7.11 E-3 | $\frac{k_{pM3}}{L} = \frac{L}{M_{3}} \frac{2(V_{DD} - 1.5V_{Tn})V_{Tn}}{L}$           | 1.5590 |  |  |  |
| 065nm    | 1.0V                                                                      | -0.3650/0.4230                     | 4.91 E-2 | 0.57 E-2 | $\overline{k_{nM6}} = \left(\frac{W}{T}\right)^{-1} \left(V_{DD} + V_{Tp}\right)^{2}$ | 1.4295 |  |  |  |
| 045nm    | 0.9V                                                                      | -0.4118/0.4660                     | 4.39 E-2 | 0.44 E-2 | $(L)_{M6}$                                                                            | 1.0853 |  |  |  |

In the same way it is possible to sizing the transistor of a multiport SRAM cell, if the design support simultaneous reading the cell must support all access transistor of active ports in same clock cycle. It makes an easy analysis, by considering same transistor expressions of saturation and linear region and the number N of access transistor of the read ports. Tables 5 and 6 resume the ratio of transistor for different technology process.

| TABLE 5. | TABLE 5. MULTIPORT SRAM CELL M5/M2 SIZE RATIO FOR SEVERAL TECHNOLOGY PROCESS                                                             |       |       |        |       |       |       |  |
|----------|------------------------------------------------------------------------------------------------------------------------------------------|-------|-------|--------|-------|-------|-------|--|
| TECH     | $(\mathbf{r})$ $(\mathbf{I}^2)$                                                                                                          |       |       | N READ | PORTS |       |       |  |
| ilen     | $k_{n,M} \left( \frac{W}{L} \right)_{LO} N \left( \left( V_{DD} - V_{ripple} - V_{Tn} \right) V_{DSATn} - \frac{V_{DSATn}^2}{2} \right)$ | 1     | 2     | 3      | 4     | 5     | 6     |  |
| 090nm    | $\frac{1}{k} = \frac{1}{(W)} < \frac{1}{(W)}$                                                                                            | 0.850 | 1.701 | 2.551  | 3.402 | 4.252 | 5.103 |  |
| 065nm    | $V_{n,MS} \left(\frac{n}{L}\right)_{MS} \left(\left(V_{DD} - V_{Tn}\right)V_{DSATn} - \frac{V_{DSATn}}{2}\right)$                        | 0.827 | 1.655 | 2.482  | 3.310 | 4.138 | 4.965 |  |
| 045nm    |                                                                                                                                          | 0.792 | 1.585 | 2.377  | 3.170 | 3.963 | 4.755 |  |

Tables 5 shows that **M5/M2** ratio increases linearly with the number of read ports, same succeed with **M3/M6** ratio shown in Table 6. In order to optimize area, the probability of simultaneous access to same locality must be studied, this probability is a function of applications (FIFO, Stacks, Queues, Register Files, Payload RAM, etc.) in general it is suggested using **N** equal to ceiling of half of read ports number [3]. Finally, considering that the access transistor is identical, it can be concluded that the ratio of inverters that form the cell is  $W_p/W_n = 1.5$ .

| TABLE 6. | TABLE 6. MULTIPORT SRAM CELL M3/M6 SIZE RATIO FOR SEVERAL TECHNOLOGY PROCESS |              |       |       |       |       |       |  |
|----------|------------------------------------------------------------------------------|--------------|-------|-------|-------|-------|-------|--|
| TECH     | (W)                                                                          | N READ PORTS |       |       |       |       |       |  |
| TECH     |                                                                              | 1            | 2     | 3     | 4     | 5     | 6     |  |
| 090nm    | $\frac{1}{1} = \frac{1}{(111)} < \frac{1}{(111)} < \frac{1}{(111)}$          | 1.559        | 0.779 | 0.519 | 0.389 | 0.311 | 0.259 |  |
| 065nm    | $k_{nM6} \left(\frac{W}{I}\right) \qquad N(V_{DD} + V_{Tp})^{2}$             | 1.429        | 0.714 | 0.476 | 0357  | 0.285 | 0.238 |  |
| 045nm    | $(L)_{M6}$                                                                   | 1.085        | 0.542 | 0.361 | 0.271 | 0.217 | 0.180 |  |

Table 7 shows the SRAM cell area, it includes the basic 6T RAM cell area plus the wires for bit-lines and world-lines.

| TABLE 7 | TABLE 7. MULTIPORT SRAM CELL AREA FOR SEVERAL TECHNOLOGY PROCESS                            |                     |  |  |  |  |  |  |  |
|---------|---------------------------------------------------------------------------------------------|---------------------|--|--|--|--|--|--|--|
|         | READ PORTS/WRITE PORTS                                                                      |                     |  |  |  |  |  |  |  |
| TECH    | 1/1 8/4                                                                                     |                     |  |  |  |  |  |  |  |
|         | width[µm] x height[µm]=area[µm <sup>2</sup> ] width[µm] x height[µm]=area[µm <sup>2</sup> ] |                     |  |  |  |  |  |  |  |
| 90nm    | 2.34 x 2.32 = 5.4                                                                           | 8.34 x 5.32 = 44.36 |  |  |  |  |  |  |  |
| 65nm    | 1.61 x 1.64 = 2.6                                                                           | 5.61 x 3.64 = 20.42 |  |  |  |  |  |  |  |
| 45nm    | 1.39 x 1.29 = 1.8                                                                           | 5.40 x 3.29 = 17.76 |  |  |  |  |  |  |  |

## 10. Register File Write and Read Circuits

Models used in our evaluations are shown in Figure 10; the read circuit used in our SRAM model is a simple differential voltage Sensing Amplifier.



Figure 10. a) SRAM writes circuit b) differential voltage Sensing Amplifier

| Table 8. Write and Read circuits delay for several technology process [ns] |               |                 |  |  |  |  |  |
|----------------------------------------------------------------------------|---------------|-----------------|--|--|--|--|--|
| TECH                                                                       | Write circuit | Sense amplifier |  |  |  |  |  |
| 90nm                                                                       | 0.022         | 0.163           |  |  |  |  |  |
| 65nm                                                                       | 0.020         | 0.124           |  |  |  |  |  |
| 45nm                                                                       | 0.018         | 0.096           |  |  |  |  |  |

Table 8 shows the delay measured for write circuit and sense amplifier, write circuit is very fast due to bit-lines for all time are tied to VCC then pre-charge is not necessary, in the other hand sense amplifier circuit needs the pre-charge signal to equalize both bit-lines before read line RL become active. Sense amplifier delay involves the first phase of clock used for pre-charge plus sense delay. All designs were evaluated using same write circuit and sense amplifier.

### 11. Power Evaluation

Power dissipation is one of the most important factors on evaluation of VLSI designs, new approaches in low power require of accurate power consumption simulation for new technology generations. Note that in general expression for power, the consumption depends strongly of the circuits capacitance  $p = C_T V_{DD}^2 f$ , however estimation of  $C_T$  requires not only identification of state-changing in logic gates, but also the effective capacitances in gate regions and drain regions a long with biasing effects, this is the analogical behaviour of transistors. In order to achieve an accurate monitoring of power dissipation in VLSI circuits, we use SPICE simulator tool and a sub-circuit to measure average power taking advantage of the voltage equation on one capacitor.



Figure 11. Power meter

This meter is an independent sub-circuit with a current-controlled current source and a parallel *RC* circuit. Power average in a dissipative element with fixed source voltage  $V_{DD}$  in a time interval  $\Delta t = t_2 - t_1$  is given by:

(1) 
$$P_{avg} = V_{DD} / t_2 - t_1 \int_{t_1}^{t_2} i_{DD}(t) dt$$

by using a current-controlled current source with current equal to  $I_s = I_{DD}$ , Voltage in a Capacitor  $C_y$  of the Figure 8 in a time interval  $\Delta t = t_2 - t_1$  is given by next equation

(2) 
$$V_{C_y}(t) = \beta / C_y \int_{t_1}^{t_2} i_{DD}(t) dt + v_o$$

by running a transient analysis with SPICE and reading  $V_{CY}$  at time  $t_2$ - $t_1$ , the average power consumption of the circuit is monitored by the equation 2 if a value for  $\beta$  is chosen such that:

(3) 
$$\beta = C_y V_{DD} / t_2 - t_1$$
 of  $\beta = C_y V_{DD} f$ 

The time interval  $t_2$ - $t_1$ , should span a clock period or multiple integer of period of the circuit frequency operation. Measuring the power dissipation is equivalent to measuring the supply current flow during the transient analysis.

### 12. The MS-Processor's Register File Power and Timing evaluation

Figure 14 shows the layout of MSP register file. Power was evaluated with SPICE using predictive technology models for 90nm, 65nm and 45nm process, a complete sub-bank was designed for simulation, total access energy was computed using the follow equation for a time span of the corresponding access delay:

 $TAcc \_Power = (Acc \_Energy) + (N-1) \times Idle \_energy$ , where:

TAcc\_Power : Total access (read/write) average power, Acc\_Energy : Sub bank access energy, Idle\_Energy : sub-bank idle state energy, N : Number of Sub banks



Figure 12. Layout of a memory bank

| Table 9. Sub-bank Access Power Evaluation (write/read/idle) [mW] |              |      |      |              |      |      |                              |      |      |
|------------------------------------------------------------------|--------------|------|------|--------------|------|------|------------------------------|------|------|
| TECH                                                             | CPR64X192/4B |      |      | CPR64X192/8B |      |      | MSP-RF4:1W1:8R<br>64X512/32B |      |      |
|                                                                  | Write        | Read | Idle | Write        | Read | Idle | Write                        | Read | Idle |
| 90nm                                                             | 8.50         | 6.00 | 0.60 | 5.00         | 7.80 | 0.40 | 2.00                         | 1.00 | 0.05 |
| 65nm                                                             | 2.50         | 2.20 | 0.75 | 1.85         | 1.75 | 0.30 | 0.50                         | 0.55 | 0.05 |
| 45nm                                                             | 1.80         | 1.10 | 0.30 | 1.20         | 1.20 | 0.30 | 0.50                         | 0.10 | 0.05 |

Each bank contains 8 sub-banks. Four banks complete the register file layout as is shown in Figure 13. Table 9 shows the power consumption for sub-bank; it is compared with a 64-bit 4Write-8Read Ports Multibank register file in configurations of 192-entries divided by 4 and 8 banks.

### 13. Sub-bank access delay

Table 10 shows sub-bank access delay evaluations for 90nm, 65nm and 45nm process, delay was computing using write circuit, sense amplifier, bit-line, word-line delays and for the MSP's register file only the MUX4:1's transmission gates and

DEMUX1:8's tri-state buffer delays was added in the write and read access given that Forks circuit delay could be overlapped because port is known (see figure 14). Global bus delay was not considered, because it is similar for all designs.

| Table 10. Sub-bank Access Delay Evaluation (write/read) [ns] |              |               |       |               |                               |       |  |  |  |
|--------------------------------------------------------------|--------------|---------------|-------|---------------|-------------------------------|-------|--|--|--|
| TECH                                                         | CPR4<br>64X1 | 1W8R<br>92/4B |       | 4W8R<br>92/8B | MSP-RF 4:1W1:8R<br>64X512/32B |       |  |  |  |
|                                                              | Write        | Read          | Write | Read          | Write                         | Read  |  |  |  |
| 90nm                                                         | 0.028        | 0.169         | 0.028 | 0.169         | 0.022                         | 0.163 |  |  |  |
| 65nm                                                         | 0.025        | 0.129         | 0.025 | 0.129         | 0.020                         | 0.104 |  |  |  |
| 45nm                                                         | 0.021        | 0.099         | 0.021 | 0.099         | 0.018                         | 0.096 |  |  |  |



Figure 13. Layout of MSP's Physical Register File

# 14. Total Energy Consumption

Table 11 does not include energy consumption of decoder. For comparison we considered that all design uses similar address decoders. Energy was evaluated in a time span (one cycle) equal to the period of the frequency and voltage source projected for each technology in section 4 shows in table 1.

| Table 11. Total access power evaluation (write/read) [mW] |              |               |       |               |                              |      |  |  |  |
|-----------------------------------------------------------|--------------|---------------|-------|---------------|------------------------------|------|--|--|--|
| TECH                                                      | CPR4<br>64X1 | 4W8R<br>92/4B |       | 4W8R<br>92/8B | MSP-RF4:1W1:8R<br>64X512/32B |      |  |  |  |
|                                                           | Write        | Read          | Write | Read          | Write                        | Read |  |  |  |
| 90nm                                                      | 10.30        | 7.80          | 7.80  | 6.30          | 3.55                         | 2.55 |  |  |  |
| 65nm                                                      | 4.75         | 4.50          | 2.75  | 2.65          | 2.05                         | 2.10 |  |  |  |
| 45nm                                                      | 3.30         | 2.60          | 2.10  | 2.10          | 2.00                         | 1.65 |  |  |  |

## 15. Total Access Delay (FO4)

The access delay was computed using the following schemas for write and read operations; shaded (pink) blocks are serial sub-operations. These schemas not involved of global buses delays like write-back bus or bypass bus.



Figure 14. Schemas for access delay computation

Table 12 shows the FO4 delay for the register file designs, those data were computed using estimated FO4 delays from table 1 and sub-bank access delay from table 10, considering the time access similar for any sub-bank in whichever bank.

| Table 12. Sub-bank Access Delay Evaluation (write/read) [FO4] |                      |      |                      |      |                              |      |
|---------------------------------------------------------------|----------------------|------|----------------------|------|------------------------------|------|
| TECH                                                          | CPR4W8R<br>64X192/4B |      | CPR4W8R<br>64X192/8B |      | MSP-RF4:1W1:8R<br>64X512/32B |      |
|                                                               | Write                | Read | Write                | Read | Write                        | Read |
| 90nm                                                          | 0.86                 | 5.21 | 0.86                 | 5.21 | 0.97                         | 5.03 |
| 65nm                                                          | 1.06                 | 5.51 | 1.06                 | 5.51 | 0.85                         | 4.44 |
| 45nm                                                          | 1.29                 | 6.11 | 1.29                 | 6.11 | 1.11                         | 5.92 |

# 16. Conclusions

MSP Register File reduce area and power because its distribute architectures, each sub-bank correspond with an ISA register, then 1 write port and 1 read port are sufficient for update (write) and read the last value of the register. Multiplexing and De-multiplexing not introduce significant delay for accesses because the first of two operations to select a port could be overlapped with decode delay. Access delay is very close to traditional full access multi-bank architectures, but power and area is reduced as is show in tables 7 and 11. The overall register file area is dominated by sub-bank interconnect.

### References

[1] V. Agarwal, M. S. Hrishikesh, S. W. Keckler and D. Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th International Symposium on Computer Architecture, 2000, pp. 248.

[2] V. Agarwal, S. W. Keckler and D. Burguer, The Effect of Technology Scaling on Microachitectural Structures, Computer Sciences Technical Report TR-00-02, Texas University, Austin, Texas, 2002.

[3] L. Feipei, C. Ying Lang and C. Shyh Jong, A new design methodology for *multiport SRAM cell*, Circuits and Systems I: Fundamental Theory and Applications, IEEE Transactions on [see also Circuits and Systems I: Regular Papers, IEEE Transactions on], 41 (1994), pp. 677.

[4] P. D. Fischer, Clock cycle estimations for future microprocessor generations, 1997, pp. 61.

[5] R. Ho, K. W. Mai and M. A. Horowitz, The future of wires, IEEE Semiconductor Research Corporation Workshop on Interconnects for Systems on a Chip, 2001.

[6] M. S. Hrishikesh, Burger, D., Jouppi, N. P., Keckler, S. W., Farkas, K. I., and Shivakumar P., The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays, 29th Annual International Conference on Computer Architecture, Anchorage, Alaska, 2002, pp. 14-24.

[7] S. M. Kang, Accurate simulation of power dissipation in VLSI circuits, Solid-State Circuits, IEEE Journal of, 21 (1986), pp. 889.

[8] J. A. Rabaey, A. Chandrakasan and B. Nicolic, eds., Digital Integrated Circuits: A design perspective, Prentice Hall, Upper Saddle River, New Jersey USA, 2003.

[9] W. Zhao and Y. Cao, New generation of Predictive Technology Model for sub-45nm design exploration, 7th IEEE International Symposium on Quality Electronic Design, San Jose CA. USA, 2006, pp. 585-590.