# An Arbitration Look-Ahead Scheme for Reducing End-to-End Latency in Networks on chip

Kwanho Kim, Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo Semiconductor System Laboratory Dept. of Electrical Engineering, KAIST Daejeon, Republic of Korea kkh82@eeinfo.kaist.ac.kr

Abstract— With the increasing complexity of system-on-chip, Networks on Chip (NoC) of multi-hop switching require low end-to-end latency for QoS guarantee. An arbitration lookahead scheme is proposed to reduce the end-to-end packet latency in the NoC. Its packet arbitration at each switch is completed in a few cycle advance of the packet arrival. As a result, a packet bypasses the switch without the latency of input queuing and arbitration. This scheme is analyzed on 4x4 mesh topology. Maximum 65% and average 26% latency reduction are obtained under random traffic. Latency reduction up to 36% is achieved under multimedia traffic.

#### I. INTRODUCTION

As the complexity of systems-on-chips (SoCs) increases together with the interconnection issues of new IC technology generations, Networks on Chip (NoC) replaces bus-based interconnect to meet high performance requirement of SoCs. Recent works have been reported, which satisfy the requirements for future SoCs: high bandwidth, scalability, reliability and low power consumption [1-3]. However, low latency which is an important factor in real time applications has not been considered in detail. A previous work, Æthereal, used time-division-multiplexed circuit switching to guarantee bandwidth and to limit maximum latency [4]. However, the approach was unsuitable for the network with many switching hops due to the overhead of circuit construction and deconstruction and the synchronization between switch arbiters.

End-to-End network latency t<sub>e-e</sub> is given by

$$t_{e-e} = \sum t_{hop} + \sum t_{wire}$$
(1)

where the  $t_{hop}$  and  $t_{wire}$  are latency of a switch and propagation delay of a wire, respectively. The  $t_{wire}$  is intrinsic delay time which is determined by RC time constant, therefore, it cannot be changed. On the contrary, the  $t_{hop}$  varies according to the switching scheme and routing algorithm of the NoC, therefore, it can be reduced using low-latency techniques.

Because the portion of  $t_{hop}$  in the  $t_{e-e}$  is an important factor for the effectiveness of the low-latency schemes, the ratio of  $t_{hop}$  to  $t_{wire}$  is investigated as shown in Table I based on the ITRS 2003 report [5]. The  $t_{hop}$  is assumed to be 4 clock cycles the same as the previous works [2, 3]. Table I shows the  $t_{hop}$  is 10 times higher

Table I. The ratio of  $t_{hop}$  to  $t_{wire}$ 

| year                      | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 |
|---------------------------|------|------|------|------|------|------|------|------|
| t <sub>hop</sub> *[ps]    | 1296 | 959  | 769  | 590  | 431  | 365  | 324  | 266  |
| t <sub>wire</sub> ** [ps] | 105  | 139  | 182  | 224  | 229  | 288  | 358  | 380  |
| $t_{hop}/t_{wire}$        | 12.4 | 6.9  | 4.23 | 2.64 | 1.88 | 1.27 | 0.9  | 0.7  |

\* t<sub>hop</sub> is the sum of input queuing delay, arbitration latency and switching fabric latency. It is assumed that the network frequency is the same as MPU clock frequency.

\*\* twire is interconnect RC delay for 1mm global line

than that of  $t_{wire}$  in year of 2003, and the  $t_{hop}$  will remain comparable to the  $t_{wire}$  until year of 2010. This implies that the  $t_{e-e}$  can be effectively reduced by decreasing the  $t_{hop}$ .

In this paper, we propose an arbitration look-ahead scheme (ALOAS) to reduce the value of  $t_{hop}$ . In the ALOAS, routing information of a packet is broadcasted via sideband channel in advance of the packet injection into the network. Using the routing information, each switch pre-establishes the corresponding path before the packet arrives at the switch. As a result, the packet bypasses the input queuing and arbitration stages and passes through the pre-established channel.

This paper is organized as follows: in section 2 the ALOAS is proposed and diagnosed analytically. In section 3 the performance of ALOAS is analyzed under various traffic conditions on 4x4 mesh network. Finally conclusion will be made in section 4.

#### II. ARBITRATION LOOK-AHEAD SCHEME

#### A. Operational Description

In NoCs, deterministic routing scheme including source routing is widely used [2] because it is a cost-effective scheme for network and transport layer design. In this scheme, when a packet is injected into the network, the packet routing path is already determined and the packet is routed through arbitration at each switch hop. However, arbitration at each switching hop is unnecessary and inefficient if there is no output conflict. In other words, the  $t_{hop}$  can be reduced by avoiding the arbitration process in multiple switching hops. The proposed scheme here is the



Figure 1. Proposed Arbitration Look-Ahead Scheme in 4x4 Mesh NoC

low-latency scheme that can be applied under this condition.

In this part, the operation of the ALOAS is explained in a 4x4 mesh NoC architecture. Fig. 1 shows the 4x4 mesh NoC with the ALOAS. The NoC is divided into 4 local areas and each area has a local arbiter (LA). All the switches in a local area are connected to the LA. When a switch receives a packet from a processing unit (PU), the switch extracts routing information (RI) from the packet and transfers the RI to its LA. Based on the RI, the LA controls the local switches to pre-establish the routing path. The RI is also broadcasted to other LAs through the LA-to-LA links for end-to-end routing path to be established.

Fig. 2 illustrates more detailed operations of a LA and a switch. LA consists of routing transfer and path formation units. They control the local switches using the RI from a switch and the broadcasted RI from other LAs, respectively. The RI transferred from a switch to a LA is called RI packet. The RI packet contains the 2b source address to indicate the source PU among the four local PUs, and 4b destination address for the destination PU among the 16 PUs of the NoC. In routing transfer unit, the transferred RI packets are stored in small FIFO and one of them is selected by arbiter. Then, the selected RI packet is used as an input to the lookup table (LUT) where grant (grant1) signals are stored for routing path formation within local area. It is also broadcasted to the path formation unit of other LAs. In path formation unit, decoder uses the broadcasted RI packet to generate grant signals that will be given to local switches. When this unit receives the multiple RI packets, arbiter decides a final grant (grant2) signal among decoded grant signals to prevent contention. LA grant is then selected between grant1 signal in routing transfer unit and grant2 signal in path formation unit, and is sent to the corresponding switch. The switch connects its input port to the output port to set up routing path using LA grant transferred from LA. After all of the processes are finished successfully without contention, whole packet routing path forms one circuit for packets to bypass.

In the ALOAS, when the *i*th switch fails to establish the path due to the pre-established path, the packet is routed with



Figure 2. Block diagram of (a) a Local Arbiter and (b) a Switch

bypassing up to the (*i*-1)th switch and with conventional method after the *i*th switch. In this case, the routing path after the *i*th switch is canceled by disabling VALID signal. A switch generates HOLD signal to stop LA grant transmission from a LA to a switch when the routing path is already formed by other LAs.

It is possible to use one central arbiter for the implementation of ALOAS. However, it increases the control complexity and results in an area overhead due to many long global links compared with the distributed LA solution. Also, multi-level hierarchical LAs can be used to apply ALOAS to more complex NoCs.

#### B. Analytic Latency Analysis

(

In this section, the latency reduction by ALOAS is analyzed analytically in case of no output conflict. In the ALOAS when a packet is injected into the network, a packet passes through the first a few hops without bypassing, in other words, with conventional mode until the routing path is established. Let M be the hop count that a packet passes without bypassing and t<sub>R</sub> be the routing path setup time as shown in Fig. 3. If t<sub>R</sub> is given by the network condition, we can calculate M value to satisfy the following equation.

$$M-1) \cdot \mathbf{t}_{\mathrm{hop}} < \mathbf{t}_{\mathrm{R}} \le M \, \mathbf{t}_{\mathrm{hop}} \tag{2}$$

Estimated latency  $t_R$  is 6.5 clock cycle as shown in Fig. 3. Therefore, *M* in the 4x4 mesh is 2. In this case, the packet passes through the first two hops with conventional mode. After the routing path setup time, the packet passes through the next hops with bypassing thus the packet latency is reduced. When a packet is transferred with hop count lower than 2, it is routed only with conventional mode and the ALOAS is not enabled to prevent the unnecessary operations of LAs.



Figure 3. Packet and RI propagation path till routing path setup time

Let *N* be the total hop count that a packet passes from a source to a destination and  $t_{bp}$  be the latency for bypassing a switch hop. If there is no output conflict, end-to-end packet latency  $t_{e-e}$  is given by the following equations.

Conventional: 
$$t_{e-e} = N t_{hop}$$
 (3)

Proposed: 
$$t_{e-e} = M t_{hop} + (N-M) \cdot t_{bp}$$
 (4)

where  $t_{bp} = t_{SF} + t_{wire}$ 

Latency Reduction : 
$$\Delta t_{e-e} = (N-M) \cdot (t_{hop} - t_{bp})$$
 (5)

If the network frequency,  $t_{wire}$  and  $t_{SF}$  are 1GHz, 0.3ns and 0.1ns, respectively,  $t_{hop}$  and  $t_{bp}$  are given as 4ns and 0.4ns, respectively. For example, when a packet propagates from SW0 to SW15 in the 4x4 mesh, N = 7 and M = 2. In this case, 65% latency reduction is obtained.

If output conflicts exist due to many RI packets transferred from local switches to LA, the analysis above needs to be changed. In this case, the  $t_{BUF}$  in the  $t_R$  increases, which makes *M* larger than 2. That means the value of M varies according to the degree of traffic congestion. Therefore, we first needs to determine the proper hop count threshold,  $N_{th}$ , which will begin to apply the ALOAS according to the traffic pattern and network topology. If a hop count of a packet is higher than  $N_{th}$ , the packet operates with the ALOAS. Otherwise, the packet operates with conventional mode. We will show the optimal  $N_{th}$  for the maximum performance improvement in the next section.

#### III. PERFORMANCE ANALYSIS ON 4X4 MESH

We evaluate the performance of the proposed ALOAS on 4x4 mesh NoC where wormhole routing is used. The NoC has a 5x5 crossbar switch at each node. The latency for bypassing a switch is assumed as 0.5 clock cycle.

## A. Evaluation under random traffic

Fig. 4 shows the average packet latency versus  $N_{th}$  at various offered load. As a result, the minimum latency is obtained when  $N_{th}$  is 5 in the most cases. According to the result, when the ALOAS is overused, *i.e.*  $N_{th} < 5$ , the packet latency increases because the contention of SRI packets also increases.

A comparison of the normalized packet latency between the ALOAS with the optimum  $N_{th}$  and a conventional scheme is shown in Fig. 5. The maximum latency reduction, 65%, is obtained



Figure 4. Average latency versus Nth at various offered load

when there is no output conflict at every switching hop. Even when the network is fully congested, there is no latency overhead. Under random traffic with 20000 packets at varying offered load, the packet latency is reduced by 26% on average.

A normalized latency under uniform traffic according to the offered load is shown in Fig. 6. As the network is lightly loaded, the ALOAS shows better performance improvement because most of long-routing packets succeed to establish a routing path. On the other hand, when network is heavily loaded, the probability to a success of establishing the routing path becomes lower because output conflicts increase at each switch. As a result, 42% and 13% latency reduction is obtained at offered load of 0.1 and 0.5, respectively.



Figure 5. Latency comparison under random traffic



Figure 6. Latency comparison under uniform traffic

## B. Evaluation under multimedia traffic

It is well known that multimedia traffic has bursty nature. In the ALOAS, burst mode operation can be supported as follows. The first packet is routed using ALOAS for routing path formation and following packets pass through the previously formed path without the additional operation of LA. When the final packet of burst packets passes through a switch, a LA grant signal given to the switch is canceled, *i.e.* BPEN = 0.

Because the performance improvement of ALOAS strongly depends on the traffic patterns, the evaluation under realistic traffic is crucial. Therefore we traced the data transaction between a RISC processor and system memory while the processor runs a 3D graphics program with portable 3D graphics library [6]. As a result, we retrieved seven different traces of various 3D scenes.

Real-time 3D graphics application requires huge computing power and corresponding memory bandwidth [6]. In order to satisfy the requirements, multiple IPs for 3D graphics application are mapped on 4x4 mesh as shown in Fig. 7. The seven processing elements (PE) generate the data packets according to the corresponding real traces. The memory is divided into 4 modules to increase the memory bandwidth. The remaining five processing nodes emulate other applications by uniform traffic generators (TGEN).



Figure 7. Mapping for 3D graphics applications on 4x4 mesh

We evaluate the performance of a conventional NoC and a proposed NoC with ALOAS as shown in Fig. 8. We also evaluate the proposed NoC with and without burst mode. The proposed NoC without burst mode performs better than conventional one but performs worse than the proposed NoC with burst mode where the RI packet overhead can be reduced more. As a result, the proposed NoC with burst mode saves 21.1% latency than conventional one

under multimedia traffic condition. The result also shows that when we use RISC processor with cache, the network has better performance because the traffic load decreases. In conclusion the proposed ALOAS with burst mode is effective on multimedia traffic.



Figure. 8 Latency comparison under multimedia traffic

# IV. CONCLUSION

We have presented an arbitration look-ahead scheme for reducing the end-to-end packet latency in multihop NoC. In the ALOAS, each switch pre-establishes the routing path using LA before a packet arrives at the input port of the switch so that the packet can bypass the input queuing and arbitration stage at each hop. The layout result of LA shows that the area of LA is as small as about 10% of the switch area. The simulation results show maximum 65% and average 26% end-to-end latency reduction compared with a conventional NoC. We applied the ALOAS to multimedia-oriented NoC for the performance evaluation under multimedia traffic. As a result, 36% latency reduction was obtained compared with conventional NoC.

#### REFERENCES

- L. Benini, G. De Micheli, "Networks on Chip: A New SoC paradigm," in Computer Magazine, pp 70-78, January 2002
- [2] S.J.Lee, S.J.Song, K.Lee, J.H.Woo, S.E.Kim, B.K.Nam and H.J.Yoo, "An 800MHz Star-Connected On-Chip Network for Application to Systems on a Chip," IEEE International Solid-State Circuits Conference, pp 372-373, Feb 2003
- [3] K.Lee, S.J.Lee, S.E.Kim, H.M.Choi, D.Kim, S.Kim, M.W.Lee and H.J.Yoo, "A 51mW 1.6GHz On-Chip Network for Low-Power Heterogeneous SoC Platform," IEEE International Solid-State Circuits Conference, pp 152-153, Feb 2004
- [4] A.Jantsch and H.Tenhunen, Networks on Chip, Kluwer Academic Publishers, 2003, pp 61-84.
- [5] International Technology Roadmap for Semiconductors (http://public.itrs.net/)
- [6] J.H.Sohn, R.Woo and H.J.Yoo, "Optimization of portable system architecture for real time 3D graphics," IEEE International Symposium on Circuits and Systems, pp769-772, May 2002