# COMPARATIVE ANALYSIS OF SERIAL VS PARALLEL LINKS IN NOC

Arkadiy Morgenshtein, Israel Cidon, Avinoam Kolodny, Ran Ginosar Electrical Engineering Department, Technion, Haifa, Israel, e-mail: arkadiy@tx.technion.ac.il

## **ABSTRACT**

Analytical model is employed to characterize and compare serial and parallel communication techniques in NoC interconnect. Simulations that are based on 130nm and 70nm technology parameters reveal up to ×5.5 and ×17 reduction in power and area of serial vs. 32-bit multi-layer parallel link, respectively. Lower power is dissipated by a single-layer parallel link but it occupies larger area. We conclude that long on-chip interconnects could benefit from serial links.

#### INTRODUCTION I.

Large Systems-on-Chip (SoC) can employ packet-switched Networks on-Chip (NoC) [1]. Typically, NoC is based on module connection via a mesh-type network of routers. NoC allows design modularity and high level of abstraction in architectural modeling of the system.

Transportation of data packets in NoC is currently performed by using multiple parallel links, which are proven more efficient than buffers-based architectures [2]. However, this technique incurs a high area cost, when inter-wire spacing, shielding and repeaters are considered. The area can be minimized when multiple metal layers are employed, but using repeaters increases the required area resources due to via blockage [6] and repeater sizes.

Serial links for NoC data transport have been proposed to overcome the drawbacks of parallel links [3][4][5]. They should not only allow savings in wire area and power dissipation and reduction of signal interference, noise and crosstalk, but also eliminate the need for multiple line drivers and buffers. Thus, serial links may be area-efficient not only at the interconnect level, but also at the circuit level, despite the required addition of a serializer and deserializer.

Additional advantages of serialization include the elimination of skew uncertainty thanks to removal of multiple signal wires; layout and timing verification simplicity; blockage reduction thanks to reduced number of vias and repeaters; and throughput control through changing of serializer frequency. Potential limitations of serial links, such as increased ISI between successive signals and the need for high-speed operation, can be addressed by encoding and asynchronous communication protocols.

An analytical study has been conducted to investigate the factors related to serial versus parallel links. We present detailed models of both circuitry and wire components. The techniques are compared based on technology parameters, showing power and area consumption versus length and throughput requirements of the link. Analytical models and simulation results are followed by conclusions and future research directions.

#### II. SERIALIZER STRUCTURE

The transformation of parallel multi-bit signal flow into a serial line and vice-versa requires special units at both ends of the link. The Serializer and De-serializer interface the router/module to the serial link. The serializer converts m-bit parallel data into serial form. It must operate at high speed to compensate for the loss of parallelism. This creates a challenging trade-off between transistor scaling and compact, low-power implementation.

The serializer is based on a switch array, and can be controlled by either a Muller pipeline [7] initiated by system clock pulses for asynchronous protocols, or by synchronous multiplexer controlled by a fast clock. The advantage of the asynchronous implementation is in high-speed operation without a need for m-times faster clock generation with high area and power consumption. The serializer can be designed for various lane width scenarios, or as a generic unit with lane width controller applied to the multiplexer and switch array. In this paper we consider only two cases: a fully parallel link and a single-wire serial link.

#### III. ANALYTICAL MODELS

Both serial and parallel links are modeled according to Figure 1 and parameters are derived using the analytical expressions presented in this Section.

Serial Link - The delay of the serializer is calculated as the sum of gate delays. For a capacitive load, the gate delay is expressed by the Logical Effort method [9]:

$$D_{gate} = \tau \cdot (gh + p) \tag{1}$$

where  $\tau$  is a technology-dependent time constant, g is the logical effort independent of transistor sizes, h is electrical effort and p represents the parasitic delay of the gate. Transistors sizes are increased when delay must be minimized to meet throughput demands.

Link optimization by repeater insertion is performed in three stages.

a) Repeaters and cascade driver are modeled using [10]: 
$$k_{rep} = \sqrt{\frac{0.4 \cdot C_{int} R_{int} \cdot L^2}{0.7 \cdot R_{l_{inv}} \left(C_{l_{inv}} + C_{pt}\right)}} \;, \; h_{rep} = \sqrt{\frac{C_{int} R_{l_{inv}}}{C_{l_{inv}} R_{int}}}$$
(2) 
$$k_{cas} = \log \left(\frac{h_{cas} \cdot C_{l_{inv}}}{C_{drv}}\right), \quad h_{cas} = \ell$$

where k are counts and h are scaling factors of devices.  $C_{drv}$ assumed to be the input capacitance of the first repeater and  $C_{int}$  and  $R_{int}$  are the wire capacitance and resistance per unit of



Figure 1. Serial and Parallel link architectures, related parameters and wire structures.

b) Power is minimized by scaling repeaters while having minimal impact on delay as described in [11][15][19]. Delay is calculated using Logical Effort method [9] for gates and repeaters and using Elmore delay model [20] for interconnect. The delays of the *i*<sup>th</sup> repeater-interconnect segment are [12]:

$$D_{gate} = \tau \cdot \left( g_i \cdot \left( \frac{C_{i+1} + C_{w_i}}{C_i} \right) + p_i \right)$$

$$D_{interconnect} = R_{w_i} \cdot \left( 0.5 \cdot C_{w_i} + C_{i+1} \right)$$
(3)

 $C_i$  and  $C_{i+1}$  are input capacitance of gates i and i+1 respectively, while  $Cw_i$  and  $Rw_i$  are the wire capacitance and resistance of segment i.

c) Throughput-centric optimization is applied to wires and repeaters as in [13]. The throughput-per-unit-area is:

$$T_{A} = \frac{1}{D_{link}(W+S)L} \tag{4}$$

where S is metal spacing and W is wire width. Maximal throughput per unit area is achieved iteratively by calculating optimal wire width using the partial derivative of (4) with respect to W and finding the resulting count and size of the repeaters. The outcome of this third and final stage is employed in the following simulations.

The serial wire is placed in an intermediate metal layer to maximize the distance to the neighboring wires and supply lanes. In this way the capacitance of the serial wire is minimized allowing high-speed operation.

Parallel Link - A 32-bit parallel link is employed with full shielding [14][15]. The two upper metal layers are used for power distribution and the remaining layers are fully shielded, leaving four or three effective layers for signal distribution (in 130nm or 70nm, respectively). Wire width and repeater parameters are scaled down from the optimum in order to meet the reduced throughput demands of each wire in the link (relative to the serial wire). This is applied iteratively considering the reduced throughput:

$$T_{parallel} = \frac{T_{serial}}{N} \tag{5}$$

where  $N_w$  is the number of parallel wires in the link. Thanks to the reduced throughput, parallel links dissipate lower power than the serial ones.

Two types of parallel link structures are considered in the analysis – a typical high-performance multi-layer structure, where signal and shield wires alternate and adjacent layers are used as either perpendicular signal wires [15] or as ground planes [16], while forming waveguides with minimized crosstalk, noise and impedance; and a low-power structure where all signal wires are located in a single intermediate metal layer to reduce capacitance, similar to the serial wire.

**Power -** Total power dissipation of the link is defined by:

$$P_{link} = P_{SerDes} + P_{drivers} + P_{repeaters} + P_{wires}$$
 (6)

(parallel links do not include power dissipation of the serializer and de-serializer). Each power factor can be defined as the sum of dynamic and leakage power components using:

$$P_{dyn} = \alpha \cdot C \cdot V_{DD} \cdot f \tag{7}$$

$$P_{leak} = W_{tot} \cdot V_{DD} \cdot I_{off} \tag{8}$$

where  $\alpha$  is the activity factor,  $W_{tot}$  is the total width of the devices and  $I_{off}$  is the off-current per device width [18]. The short-circuit power is relatively minor and can be neglected.

 $P_{dyn}$  in parallel wires is calculated for a reduced frequency f according to (5).  $P_{leak}$  is estimated using data of [17] where leakage current per device width grows dramatically from  $0.01\mu A/\mu m$  in 130nm to  $0.05\mu A/\mu m$  in 70nm, and is predicted to continue growing with the advent of technology.

**Area** – Link area is estimated assuming a factor of  $\times 5$  for average device size relative to its  $W \times L$  gate size. The area of wires including repeaters is the maximum of repeaters area and the vertical projection of the wiring:

$$A_{link} = A_{SerDes} + A_{drivers} + \max(A_{repeaters}, A_{wires})$$
 (9)

This method defines the effective blockage of area resources, while accounting for the multi-layer structure of the parallel link.

## IV. TEST SETUP AND RESULTS

All link components, related expressions and optimizations were modeled with Matlab. Power and area of the link were computed for 130nm and 70nm technologies and for various wire width factors ( $\times 1-\times 10$ ) versus length (with constant  $T_{serial}$ =16Gbps) and throughput (with constant L=1.5cm). The Berkeley parameter extraction tool (BPTM) [21] was used to predict parameters of the 70nm process for both interconnects and devices using BSIM3v3 models. These parameters were combined with estimates of the ITRS [17] and were verified using SPICE [12]. Simulations were conducted for two types of parallel links, multi-layer and single-layer structures. The obtained parameters of repeaters varied with respect to wire lengths and widths from 1 to 3 devices with scaling factors of 31 to 316 in the serial link.

The 32-bit serializer was assumed to have asynchronous control [7], using a critical path of six NAND gates [8]. The total count of logical gates in the serializer with the asynchronous control was assumed to be 500 (accounting for the increased number of gates in asynchronous circuits). Similar assumptions were made for the deserializer.

A Multi-Layer Parallel Link - The number of repeaters varied from 1 to 8 with scaling factors of 9 to 47 with respect to different wire lengths and widths. As can be seen in Figure 2 and Figure 3, there are "break-even" points of length beyond which the serial link (solid line) dissipates lower power – 400-2000um in 130nm as compared to 170-600um for 70nm. The relative benefit in 70nm is more pronounced than in 130nm – up-to  $\times 5.5$  and  $\times 3.7$ , respectively, due to increased leakage current of the repeaters and drivers in the parallel link. Figure 4 shows reduction of area of up-to  $\times 17$  in the serial link in 70nm  $(68000\mu\text{m}^2 \text{ vs. } 4200\mu\text{m}^2)$ , with "break-even" point for shortest narrow wires because of the dominating area of the serializer.

As can be seen in Figure 5, beyond a certain throughput level, the parallel design in 70nm consumes lower power due to transistor scaling in the serializer for reduced circuit delay. The "break-even" in 70nm is at 40Gbps for minimal width wires. Drastic reduction of the area ratio in Figure 6 at high throughput values is also caused by scaling of the serializer.

Single-Layer Parallel Link – The number of repeaters varied from 1 to 3 with scaling factors of 4 to 224. As is evident in Figure 7, the parallel link consumes lower power thanks to reduced wire capacitance and reduced scaling factors and count of repeaters. However, this arrangement results in extremely high area, leading to ×65 ratio between the parallel and serial links.

## V. SUMMARY AND FUTURE WORK

The comparative analysis of interconnects in NoC revealed significant improvements of up to ×5.5 and ×17 in power and area consumptions in serial links as compared to parallel links. The main source of this improvement is the low number of wires and repeaters needed for the serial link. Results obtained for 130nm and 70nm technologies show increasing ratio of improvement due to higher leakage currents in advanced submicron technologies. Two parallel link structures, multi-layer and single-layer, were used as reference; the single-layer link showed better results in terms of power but was dramatically (×65) larger in area.

Future research may consider various levels of serialization, as well as application of wire-pipelining in order to speed up the serial link and to investigate other potential advantages of the technique.



**Figure 2.** Power in serial and parallel links (130nm, multi-layer)



**Figure 3.** Power in serial and parallel links (70nm, multi-layer)



**Figure 4.** Ratio of area vs. link length (70nm, multi-layer)



**Figure 5.** Ratio of power vs. throughput (70nm, multi-layer)



**Figure 6.** Ratio of area vs. throughput (70nm, multi-layer)



**Figure 7.** *Ratio of power vs. link length* (70nm, single-layer)

## VI. ACKNOWLEDGEMENTS

We thank Michael Moreinis and Alexander Gnusin for their constructive comments and suggestions.

## REFERENCES

- W.J. Dally, B. Towles, "route Packets, Not Wires: On-Chip Interconnection Networks", DAC Conference, pp. 684-689, 2001.
- [2] E. Bolotin, I. Cidon, R. Ginosar, A. Kolodny, "Cost considerations in Network on Chip", *Integration - the VLSI journal*, 2003.
- [3] I. Saastamoinen, T. Suutari, J. Isoaho, J. Nurmi, "Interconnect IP for gigascale system-on-chip", ECCTD, pp. 116-120, 2001.
- [4] T. Suutari, J. Isoaho and H. Tenhunen, "High-speed Serial Communication With Error Correction Using 0.25 μm CMOS Technology," ISCAS, pp. 618-621, 2001.
- [5] I.B. Dhaou, E. Dubrova, H. Tenhunen, "Power efficient intermodule communication for digit-serial DSP architectures in deepsubmicron technology", *Multiple-Valued Logic*, pp. 61 - 66, 2001.
- [6] P. Gupta, A.B. Kahng, Y. Kim, D. Sylvester, "Investigation of Performance Metrics for Interconnect Stack Architectures", SLIP Conference, pp. 23-29, 2004.
- [7] J. Sparsø, S. Furber, "Principles of Asynchronous Circuit design -A Systems Perspective", Kluwer Academic Publishers, 2001.
- [8] R.W. Brodersen, M.A. Horowitz, D. Markovic, B. Nikolic, V. Stojanovic, "Methods for True Power Minimization", *IEEE/ACM CAD Conference*, pp. 35-42, 2002.
- [9] I. Sutherland, B. Sproull, D. Harris, "Logical Effort Designing Fast CMOS Circuits", Morgan Kaufmann Publishers, 1999.
- [10] H.B. Bakoglu, "Circuits, Interconnections and Packaging for VLSI", Adison-Wesley, pp. 194-219, 1990.
- [11] P. Kapur, G. Chandra, K.C. Saraswat, "Power Estimation in Global Interconnects & its Reduction Using a Novel Optimization Methodology", DAC Conference, pp. 461-466, 2002.
- [12] A. Morgenshtein, M. Moreinis, I.A. Wagner, A. Kolodny, "Logic Gates as Repeaters (LGR) for SoC Timing Optimization", *IFIP VLSI-SoC Conference*, pp. 99-104, 2003.
- [13] H. Shah, P. Shiu, B. Bell, M. Aldredge, N. Sopory, J.A. Davis, "Repeater Isertion and Wire Sizing Optimization for Throughput-Centric VLSI Global Interconnects", *ICCAD*, pp. 280-284, 2002.
- [14] J.A. Davis, R. Venkatesan, A. Kaloyeros, M. Beylansky, S.J. Souri, K. Banerjee, K.C. Swraswat, A. Rahman, R. Rief, J.D. Meindl, "Interconnect Limits on Gigascale Integration (GSI) in the 21st Century", *Proc. of IEEE*, v. 89, no. 3, pp. 305-324, 2001.
- [15] Y. Cao, C. Hu, A.B. Kahng, S. Muddu, D. Stroobandt, D. Sylvester, "Effects of Global Interconnect Optimizations on Performance Estimation of Deep Submicron Designs", *IEEE/ACM CAD Conference*, pp. 56-61, 2000.
- [16] R. Venkatesan, J.A. Davis, J.D. Meindl, "Compact Distribute RLC Interconnect Models – Part III: Transient in Single and Coupled Lines with Capacitive Load Termination", *IEEE Trans.* on Electron Devices, vol. 50, no. 4, pp. 1081-1093, 2003.
- [17] ITRS Roadmap, Interconnect and Devices, 2003.
- [18] R. Venkatesan, J.A. Davis, K.A. Bowman, J.D. Meindl, "Minimum Power and Area N-Tier Multilevel Interconnect Architectures Using Optimal Repeater Insertion", ISLPED Conference, pp. 167-172, 2000.
- [19] G.S. Garcea, N.P. van der Meijs, R.H.J.M. Otten, "Analytic Model for Area and Power Constrained Optimal Repeater Insertion", ESSCIRC, pp. 591-594, 2003.
- [20] W. C. Elmore, "The transient response of damped linear networks with particular regard to wide band amplifiers," J. Appl. Phys., vol. 19, no. 1, 1948.
- [21] Berkeley Predictive Technology Model (BPTM), www-device.eecs.berkeley.edu/~ptm/introduction.html.