# A 231MHz, 2.18mW 32-bit Logarithmic Arithmetic Unit for Fixed-Point 3D Graphics System

Hyejung Kim, Byeong-Gyu Nam, Ju-Ho Sohn and Hoi-Jun Yoo Semiconductor System Laboratory Department of Electrical Engineering and Computer Science, KAIST, Daejeon, Korea seeseah@eeinfo.kaist.ac.kr

*Abstract*—A 32-bit fixed-point logarithmic arithmetic unit is designed for mobile 3D graphics system. The proposed logarithmic arithmetic unit performs division, reciprocal, square-root, reciprocal-square-root and square operations in 2-cycle, and powering operation in 5-cycle. It uses programmable precision for accurate 3D pipeline computation and 8-region piecewise linear approximation model for logarithmic and exponential conversion to reduce the operation error under 0.2%. Its test chip is implemented by 1-poly 6-metal 0.18um CMOS technology with 9k gates. It operates at the maximum frequency of 231MHz and consumes 2.18mW.

## I. INTRODUCTION

The real time 3D graphics is one of the attractive applications for mobile systems which have limited battery power, energy and small memory capacity. Accordingly, most of mobile systems use low power 32-bit processors such as ARM or MIPS, and the fixed point arithmetic units have been used because they consume less power than floating point units [1]. Division, reciprocal, square-root and powering operations still require high computing power although they are crucial functions for the real time 3D graphics systems [2]-[3]. Thus they should be optimized carefully for the low power consumption.

The logarithmic and the antilogarithmic algorithms have been studied to reduce the latency and simplify the complex operations [4]-[7]. However, its operation error has not been carefully examined for its real applications. Although many approximation methods for error reduction have been proposed, their errors are still too large to be used for 3D graphics system [6]. In addition, the precision should be programmable for the fixed-point arithmetic to be used in this application.

In this paper, we propose a logarithmic arithmetic unit reducing its error and hardware complexity with variable precision control for fixed-point 3D graphics applications while its speed and power consumption are improved. It can reduce the area and the power consumption suitable for application to the mobile systems. We verify the proposed 32-bit logarithmic arithmetic unit by fabrication of the real silicon and its detail measurement results will be described.

| TABLE I. | OPERATIONS IN LOGARITHMIC NUMBER SYSTEM |
|----------|-----------------------------------------|
|          |                                         |

| Operation                     |      | Normal           | Logarithmic   |
|-------------------------------|------|------------------|---------------|
| Multiplication                | MUL  | z=x·y            | X+Y           |
| Division                      | DIV  | z=x/y            | X-Y           |
| Reciprocal                    | RCP  | z= 1/x           | -X            |
| Square Root                   | SQRT | z=√x             | X/2           |
| <b>Reciprocal Square Root</b> | RSQ  | $z=1/\sqrt{x}$   | -X/2          |
| Square                        | SQR  | $z = x^2$        | 2X            |
| Powering                      | POW  | z=x <sup>y</sup> | $Y + log_2 X$ |

#### II. 32-BIT LORARITHMIC ARITHMETIC UNIT

## A. 32-bit Logarithmic Arithmetic Unit

The Logarithmic Arithmetic Unit (LAU) computes complex functions such as multiplication, division and square-root by using only simple addition, subtraction or shift operations. When  $X=\log_2 x$  and  $Y=\log_2 y$ , the table-I summarizes the normal operations and their equivalent logarithmic operations.

Although the 3D graphics pipeline requires variable precision for each stage for accuracy optimization, the fixed-point number system supports only the fixed precision. Hence we solve this problem by making the precision programmable. The variable precision Qm.n is decided by



Figure 1. Top Architecture of LAU

TABLE II. COMPARISION OF LOGARITHMIC APPRIXIMATION ERROR

|                         | Mitchell | SanGregory | Combet | Hall  | Abed  | This work |
|-------------------------|----------|------------|--------|-------|-------|-----------|
| Percent Error Range (%) | 5.361    | 1.192      | 0.452  | 0.907 | 0.307 | 0.050     |

'n' input by the user, where 'm' is the number of bits representing integer part and 'n' is the number of bits representing fractional part [1].

The top architecture of the proposed LAU is shown in figure 1. The unit is composed of two LOG-CONVs (logarithmic converter) for the first stage, SCU (Simple Calculation Unit) which has an inverter, an ADD/SUB (adder/ subtracter) and a BSH (barrel shifter), and an EXP-CONV (exponential converter) at the second stage. x and yare the operand for the LAU, n decides the precision of each computation, and op selects the operation to be computed. LAU is pipelined to provide the high clock frequency. Since the logarithmic converter takes more time than exponential converter does, the SCU is located in the second stage to distribute the time budget evenly. So the LAU performs functions given in table-I in 2-cycle except for the powering operation,  $x^{y}$ . In addition, if single operand computations such as square-root operation are performed, the second logarithmic converter is turned off by the clock gating to reduce the power consumption.

The specular lighting is one of the most time-consuming parts in 3D graphics lighting algorithm because of its powering computation. In general,  $x^{\nu}$  is computed by long iterations, or using large look-up table. An unique solution to  $x^{\nu}$  is developed for LAU.  $x^{\nu}$  can be computed as  $2^{\nu \log_2 x}$  by 6 operations in only 5-cycle as shown in following equations.

| step 1. | $X = \log_2 x$                   |                |
|---------|----------------------------------|----------------|
| step 2. | $X' = \log_2 X ,$                | $Y = \log_2 y$ |
| step 3. | Z = X' + Y ,                     | $z = 2^Z$      |
| step 4. | $2^z = 2^{y \cdot \log_2 x} = x$ | у              |

The error induced by powering operation is less than 0.15%. LAU enables fast lighting computation with unnoticeable quality degradation.

#### B. Logarithmic Converter Block

In general, piecewise interpolation method is used in the binary logarithm conversion algorithms. They usually used 2-4 regions piecewise linear method and their error ranges are shown in table-II [6] and the minimum error is 0.31%. In this study, we divide the fraction part into 8-region to reduce its error rate, and use the straight linear interpolation in each region as shown in figure 2. Equation (1)-(3) and table-III show the approximation equations used in this study. x is the 32-bit input, k is the characteristic of the logarithm and  $\log_2(1+f)$  is the fractional part in the range [0, 1).  $\alpha$  and  $\beta$  are the approximation coefficients which has 7-bit and 10-bit resolution, respectively.







Figure 3. Architecture of Logarithmic Converter

$$x = 2^{k}(1+f), (1)$$

$$\log_2 x = k + \log_2(1+f),$$
(2)

where, 
$$\log_2(1+f) \cong \alpha \cdot f + \beta$$
. (3)

The input has variable precision Qm.n. The range of input can be  $2^{-32} < x < 2^{32}$  which is equivalent to  $-32.0 < \log_2 x < 32.0$ . The precision is changed to Q6.26 internally. That is, the 6 MSBs of  $\log_2 x$  should be 2's complement signed number for characteristic and remaining 26-bit should be unsigned number for fractional part, accordingly.

TABLE III. COEFFICIENT OF LOGARITHMIC APPROXIMATION MODEL

| f          | α     | β     | f          | α     | β     |
|------------|-------|-------|------------|-------|-------|
| [0.0, 1/8) | 1.390 | 0.000 | [4/8, 5/8) | 0.922 | 0.120 |
| [1/8, 2/8) | 1.234 | 0.015 | [5/8, 6/8) | 0.859 | 0.163 |
| [2/8, 3/8) | 1.109 | 0.045 | [6/8, 7/8) | 0.797 | 0.210 |
| [3/8, 4/8) | 0.992 | 0.889 | [7/8, 8/8) | 0.742 | 0.258 |



Figure 4. Architecture of FApp (Fractional Approximation) Block

Figure 3 shows the proposed architecture of the logarithmic converter block. x is an operand to be converted into logarithmic number, and n is the precision selection bit. Logarithmic converter block is composed of CLZ (Count Leading Zero), BSH (barrel shifter), CGen (Characteristic Generator) and FApp (Fractional part Approximation).

Since the delay time of FApp is the longest, it is important to optimize the FApp. The detailed architecture of the proposed FApp is shown in figure 4. FApp generates the approximated fractional value of the equation (3). It is implemented by hardwired shifter, MUX, CSA (Carry Save Adder) and CPA (Carry Propagating Adder). f is the fractional part,  $\beta$  is the approximation coefficients and s0-s3 are selection bits made by 3 MSBs of f. Although the fractional part has very high resolution, the coefficients  $\alpha$  and  $\beta$  are selected to minimize the number of MUX and the adders to reduce the latency and power consumption.

The maximum error range of the proposed method is 0.05% which is the smallest value among any other reported methods [6]. The proposed logarithmic converter operates at the maximum frequency of 231MHz which is equivalent to 48FO4. The operating frequency of the conventional logarithmic converter [6] is 55MHz in 0.6um technology which is equivalent to 61FO4. The critical path and the maximum error of logarithmic converter are reduced by 21% and 83.9%, respectively

## C. Exponential Converter Block

The fixed-point representation of the exponent is composed of integer part k and fractional part f as given in the equation (4). The fractional part  $2^{f'}$  can be also approximated by 8-region piecewise linear method like the one used for logarithmic converter in figure 4. Equation (5) and table-IV show the equations for the approximation equations.

$$2^{x} = 2^{k+f} = \begin{vmatrix} 2^{k} \cdot 2^{f} & x \ge 0\\ 2^{k-1} \cdot 2^{1-|f|} & x < 0, \end{vmatrix}$$
(4)  
$$= 2^{k'+f'} = 2^{k'} \cdot 2^{f'}$$

where, 
$$2^{f'} \cong \alpha \cdot f + \beta$$
. (5)

TABLE IV. COEFFICIENT OF EXPENTIAL APPROXIMATION MODEL

| f          | α     | $\beta$ | f          | α     | β     |
|------------|-------|---------|------------|-------|-------|
| [0.0, 1/8) | 0.719 | 1.000   | [4/8, 5/8) | 1.031 | 0.902 |
| [1/8, 2/8) | 0.727 | 0.991   | [5/8, 6/8) | 1.117 | 0.844 |
| [2/8, 3/8) | 0.867 | 0.972   | [6/8, 7/8) | 1.211 | 0.773 |
| [3/8, 4/8) | 0.945 | 0.941   | [7/8, 8/8) | 1.320 | 0.288 |

The input precision of exponential converter is Q6.26, and the output precision is Qm.n which is decided by n. The maximum error range of the exponential converter is found within 0.08% which is better than conventional methods [7].

#### III. EVALUATION RESULT

The proposed LAU is verified in 3D graphics software environment before its chip implementations and calculated maximum error ranges of the LAU are shown in table-V. Based on our simulation results, maximum error is less than 0.21% and it is within tolerable range for the small screen size images of the mobile system. Figure 5 is the test scene and in-box shows a zoomed image for the accuracy comparison. The test model consists of 1,700 polygons with lighting and texture mapping. The screen resolution is 512x512 and texture size is 256x256. The various precisions Qm.n are used to evaluate the scene. Figure 5(a) shows the result of the normal fixed-point calculation, and figure 5(b) shows the image of LAU. Unnoticeable difference can be found by naked eyes between two images.

## IV. CHIP IMPLEMENTATION AND MEASUREMENTS RESULTS

The proposed LAU was implemented into a chip by using Dongbu-Anam 1-poly 6-metal 0.18um CMOS technology to test its efficiency. A chip photograph is shown in figure 6. RDX4 (Radix-4 reciprocal-square-root) is also implemented for the performance comparison [1]. The gate counts of LAU and RDX4 take 9k and 44k, respectively. The core size is 1.0mmx1.0mm and the LAU size is 240umx600um. The fabricated LAU operates at the frequency of 231MHz at 1.8V supply voltage and its shmoo plot is given in figure 7. The maximum operating frequency

TABLE V. MAXIMUM PERCENT ERROR RANGE OF LAU



(a) Normal fixed-point calculation (b) LAU Calculati Figure 5. Comparison of 3D Graphics Result



Figure 6. Chip Photograph

of RDX4 is measured as 60MHz. The measured waveforms of the critical path are shown in figure 8. The latency and throughput of LAU is measured to be 2-cycle and 1-cycle, respectively, while those of RDX4's are measured as 10cycle and 8-cycle, respectively. By using LAU in fixedpoint arithmetic, the performance is found to be improved by 5 times compared to complex RDX4 method. The power consumption of the LAU is 2.18mW for one operand computation and 3.07mW for two operands computation. The power consumption of RDX4 is 4.29mW, which is 1.97 times of LAU's. Table VI compares the performance of LAU with that of RDX4, and table VII summarizes the characteristics of the fabricated LAU chip.



Figure 8. Measurement Result of Test Chip

TABLE VI. THE COMPARISON OF LAU WITH RDX4

|                          | RDX4               | LAU               |
|--------------------------|--------------------|-------------------|
| Gate count               | 44k                | 9k                |
| Latency / Throughput     | 10-cycle / 8-cycle | 2-cycle / 1-cycle |
| Max. operating frequency | 60MHz              | 231MHz            |
| Power consumption        | 4.29mW             | 2.18mW            |

TABLE VII. CHARACTERISTICS OF THE FABRICATED LAU CHIP

| Process Technology   | 1-poly 6-metal 0.18um CMOS technology                     |
|----------------------|-----------------------------------------------------------|
| Power Supply         | 1.8V                                                      |
| Operating Frequency  | 231MHz                                                    |
| Latency / Throughput | 2-cycle / 1-cycle                                         |
| Power Consumption    | 1-operand : 2.18mW<br>2-operand : 3.07mW                  |
| Gate Counts          | 9k                                                        |
| Area                 | die : 4.0mm x 4.0mm (pad limited)<br>core : 1.0mm x 1.0mm |

#### V. CONCLUSION

A 32-bit LAU is proposed for mobile 3D graphics system. The arithmetic unit consists of a binary logarithmic converter, an adder, a shifter and a binary exponential converter. It uses 8-region piecewise linear interpolation approximation algorithms and supports variable precision to compute complex functions fast and accurately. LAU is implemented with 0.18um CMOS technology and it takes 9k gate count. The fabricated LAU runs at 231MHz, and performs multiplication, division, reciprocal, reciprocalsquare-root and square operations in only 2-cycle, and powering operation in 5-cycle. The errors of computations are within 0.2%, which is tolerable in the case of small screen size of mobile 3D graphics system. It consumes 2.18mW for 1 operand computation, 3.07mW for 2 operands computation. The measured results clearly indicate the fixed point logarithmic arithmetic unit is suitable for the datapath of the mobile 3D graphics systems.

#### REFERENCES

- Ju-Ho Sohn, Ramchan Woo, Hoi-Jun Yoo, "A Programmable Vertex Shader with Fixed-Point SIMD Datapath for Low Power Wireless Applications", SIGGRAPH/Eurographics Workshop on Graphics Hardware 2004, Vol.1, pp.107-114, 2004
- [2] Kanako Yosida, Tadashi Sakamoto and Tomohiro Hase, "A 3D Graphics Library For 32-bit Mocroprocessors For Embedded Systems", IEEE Trans. on Consumer Electronics, Vol.44, pp.1107 – 1114, Aug.1998
- [3] J.-A.Pineiro et al. "High-Speed Double-Precision Computation of Reciprocal, Division, Square Root, and Inverse Square Root", IEEE Trans. on Computer, vol. 51, No. 12, December 2002
- [4] E.I.Chester, J.N.Coleman, "Matrix Engine for Signal Processing Applications using the Logarithmic Number System", Application-Specific Systems, Architectures and Processors, 2002. Proceedings. The IEEE International Conference on , pp. 315 – 324, July 2002
- [5] J.N.Coleman et al. "Arithmetic on the European Logarithmic Microprocessor", IEEE Trans. on Computer, vol.49, no.7, July 2000
- [6] Khalid H.Abed, "CMOS VLSI Implementation of a Low-Power Logarithmic Converter", IEEE Trans. on Computer, vol.52, No. 11, November 2003
- [7] Khalid H.Abed, Ramond E.Siferd, "VLSI Implementation of a Low-Power Antialgorithmic Converter", IEEE Trans. on Computer, vol.52, No.9, September 2003