

# VLSI DESIGN OF LOW POWER MAC UNIT

<sup>1</sup>Nagaraj krishna naik, <sup>2</sup>Durga Bhavani.A
<sup>1</sup>8<sup>th</sup> sem, Computer science and engineering, BMSIT&M, Bangalore
<sup>2</sup>Assistant Professor, BMSIT&M, Bangalore
Email: <sup>1</sup>nagarajnaikk@gmail.com,<sup>2</sup>Durga842004@gmail.com

Abstract—Digital signal processing algorithm typically require a large number mathematical operation to be performed more quickly and repeatedly on the set of data. Many DSP applications have constraints on latency, which is for system to work the DSP operation must be completed within some fixed time, and deferred processing is not viable. Hence DSP requires high speed, high throughput and low power consuming multiplier and accumulator (MAC) unit. This work presents the design and implementation of MAC unit which saves power using enabling technique and gives high throughput. Since in MAC unit data flow from input register to output register occurs through multiple blocks such as adder, multiplier and register there will be wastage of power due to flow of operation in unnecessary block. Hence block enabling is used to reduce delay and to save power. This design presents firstly one bit adder, four bit multiplier and registers are designed and power consumptions are verified using CADENCE VIRTUSO tool of 180nm CMOS technology. From the desired result one bit MAC unit is designed and the total power consumption is found out. This result is compared with normal design implemented.

Keywords-low power, MAC, CMOS, block enable, multiplier

### I. INTRODUCTION

In computing digital signal processing, multiplier and accumulating is a common step that computes the multiplication of two numbers and adds that product to accumulator. The operation itself is also often called MAC or a MAC operation. For real time signal processing, high speed and high throughput multiplier and accumulator (MAC) is always a key to achieve a high performance digital signal processing. For a epoch of personal communication, low power design is also becomes main for design consideration. This is because, battery energy available for these portable products limits the power consumption of the system. Therefore main motivation of this work is to investigate various pipelined multiplier/accumulator architectures and circuit design techniques which are suitable for implementing high throughput signal processing algorithm and at the same time achieve low power consumption. A conventional MAC unit consists of multiplier and an accumulator that contains the sum of the previous consecutive products. The function of the MAC unit is given by the following equation:



Figure 1: Basic structure of MAC

In this work 1 bit full adder, register and multiplier is designed using various design technique and power consumption of that is verified and low power MAC is designed.

### II. MULTIPLIER AND ACCUMULATOR UNIT

MAC is composed of an adder, multiplier and an accumulator. The input to the MAC are to be fetched from memory location and fed to multiplier block of the block, which will perform multiplication and give the result to the adder which will accumulate the result and then will store the result into a memory location. This entire process is to be achieved in a single clock cycle. Below gives the architecture of MAC unit.

Figure 2: architecture of MAC unit



The design consists of 8 bit modified array multiplier designed using full adder and AND gate. The design of full adder and AND is using transmission gate (uses map entered logic). And it consists of a one 8 bit accumulator register to store the output and two 4 bit register to fetch the data from memory. This MAC unit designed reduces the standby power consumption and gives better system performance. The product of Ai\*Bi is always fed back into the accumulator and then added again with the next product Ai\*Bi. This MAC unit is capable of multiplying and adding with previous product consecutively up to as many as four times.

### Operation: Output= $\sum$ Ai Bi.

The MAC operation is the basis of many DSP algorithms notably digital filtering. The term "digital filter" refers to an algorithm by which a digital signal or sequence of numbers is transformed into a another sequence of numbers termed the output digital signal. The MAC speed applies to both to finite impulse response (FIR) and infinite impulse response (IIR) filters. The complexity of the filter response dictates the number of MAC operations required per sample period. Digital filter involve signals in the digital domain (discrete time signals) and are used extensively in application such as digital image processing, pattern recognition and spectral analysis. In general FIR filters are preferred in lower order solutions. Since they employ feedback they exhibit naturally bounded response. They are simple to implement and require one RAM location and one coefficient for each order.

For example in FIR filter the output of the FIR filter is given by

 $Y(n) = (\text{from } n=0 \text{ to } n=k-1)\sum x(k) h(n-k)$ 

Where x(n) is input to the filter, h(n) is the impulse response of the filter and y(n) is the output of the filter. The output of an FIR filter is simply a finite length weighted sum of the present and previous input to the filter. Hence to perform filtering through above equation, the minimum requirement is to quickly multiply two values and add the result, to make it possible a fast dedicated low power consuming hardware MAC is mandatory in Digital Signal Processing.

### III.MODIFIED ARRAY MULTIPLIER

The design and analysis starts with the analysis of elementary algorithm for multiplication by modified array multiplication. Figure shows the algorithm for 4\*4 bits multiplication performed by modified array multiplication.

Figure 3: algorithm used in modified array multiplication.

A3B0 A2B0 A1B0

A=A3A2A1A0; B=B3B2B1B0;

Then A0B0

|          |      |      | A3B1 | A2B1 | A1B1 | A0B1 | +  |
|----------|------|------|------|------|------|------|----|
|          |      | A3B2 | A2B2 | A1B2 | A0B2 | +    | +  |
|          | A3B3 | A2B3 | A1B3 | A0B3 | +    | +    | +  |
| P7<br>P0 | P6   | P5   | P4   | Р3   |      | P2   | P1 |

There are four stages to go through, to complete multiplication process. Each stage uses full adder, input to that is supplied with two input AND gate. Last stage uses combination of full adder and half adder.

### C.BLOCK ENABLING TECHNIQUE

In any MAC unit, data flow from the input to output register through multiple stages such as multiplier stage, adder stage and the accumulator stage as shown in figure 1. Within the multiplier stage, further we find that there are multiple stages addition. During each operation of of multiplication an addition, the block in the pipelined may not required to be on or enabled until the actual data gets in from the previous stage. In blocks enabling technique, first we have to find the delay of each stage. Every block gets enabled only after the expected delay. For the entire duration until the inputs are available, the successive blocks are disabled, thus saves power.

Figure 4: general block diagram of pipelined MAC with block enabling technique.



## PIPELINED BLOCK ENABLED LOGIC

Figure 5 shows the three stages of pipelined MAC with block enable logic, depending upon the delay of individual blocks,

Figure 5: MAC with control logic.



The control logic enables the clock, power and logic pins of the block, thus saving power. Each of the blocks in the MAC unit has an enable signal to save power.

## **B.FULL ADDER DESIGN**

Successive addition in multiplier unit is achieved using full adder. Using truth table sum and carry equation for full adder is found out. Then using that equation CMOS design for full adder is designed using various gate design techniques. On implementing that designs on CADENCE VIRTUSO tool we found that adder design using MUX consumes less power compared to AOI design. Until the certain delay consumed by the adder circuit the output of the adder is disabled using block enable to save the power. Using this adder on one bit MAC is designed along with the use of one bit register, And gate and control logic to enable the block.



Table 1: Full adder design comparison.

| Full adder<br>using | No. of<br>Transistor used | Power(W)  |
|---------------------|---------------------------|-----------|
| AOI logic           | 32                        | 3.58E-6   |
| Using MUX           | 22                        | 0.1459E-9 |

# ACCUMULATOR REGISTER:

Figure 6 shows the one bit register file cell that may be represented by a D-flip flop and two gates. Note that in addition to the clock signal, the cell has 3 inputs and 1 output: write select, read select and D-input and Q output signal. In this cell D-flip flop will store the value of the input signal whenever write select is equal to one, consequently, whenever the read select signal is equal to 1, this D-flip flop will pass its stored value to the output through a tristate buffer.

Figure 6: one bit register cell



The objective of this work is to find the power consumption of each block and to find the area

required. From the experimental result low power consuming MAC is designed.

AND GATE:

The basic gate that is required to enable or disable MAC blocks are controlled using AND gate. An AND gate has a delay, the blocks connected to the outputs of an AND gate

Table2: comparing AND designs using various design

| AND design using  | Number of transistor | Power<br>consumed |
|-------------------|----------------------|-------------------|
| NAND and inverter | 6                    | 2.401E-10         |
| Transition gate   | 2                    | 1.319E-11         |

are disabled when there is no output from the AND gate, and these blocks are enabled only when output are available, thus reducing the power. Hence this gate is designed using transmission gate so that it uses less transistor and hence reduces power.

## ONE BIT REGISTER:

Register forms one of the basic unit for the MAC unit, as the register stores data, there is possibility of leakage current and that affects power dissipation. Also the clock connected to the register cell also is analyzed for its power consumption The register cell is enabled with clock gating and the power is measured. We found that register with enable consumes 4.029E-09 and register with enable found 4.078E-09.Hence it is found that register with enable consumes less power than register without enable.

HALF ADDER: The basic adder is known as half adder which adds 2 bits and produces sum and the carry.

In array multiplier we are using the half adder for the completion of multiplication operation. The inputs to the half adder are the outputs of full adder which are sum and carry. The most straight forward approach to designing an half adder is with logic gates. A rather different half adder design uses transition gate to form xor and AND gate. By using the transition gate we can reduce the number of transistor which are used in the half adder design. Thus we can reduce the power.

## MULTIPLIER(4×4):

Multiplier design starts with elementary school algorithm for multiplication. In each step we multiply one digit of the multiplier by the full multiplicand. We add the result shifted by a proper number of bits, to the partial product. When we run out of the multiplier digits, we are done. Binary multiplication of two bits is performed by the AND function.

The elementary school multiplication algorithm suggest a logic and layout structure for multiplier which is surprisingly well suited to VLSI implementation- the array multiplier. The structure of an array multiplier for unsigned number is shown in the figure. As when multiplying by hand, the partial products are performed in rows and accumulated in columns, with partial products shifted by the appropriate amount. Notice that only last adder in the array has a carry chain. The earlier additions are performed by the full adders which are used to reduce 3 one-bit inputs to 2 one-bit outputs. Only in last stage are all the values accumulated with carries. As a result relatively simple adders can be used for the early stages with a faster adder reserved for the last stage.

In this we used a ripple carry adder consists of N full adders with carry output of the full adder connected to the carry input of the next full adder. The advantages of the array multiplier are that it has a regular structure and a local interconnect.

Figure7: 4×4 multiplier unit implemented



## ONE BIT MAC:

Later after the completion of multiplier full one bit MAC is designed using two four bit register receiving the input ,  $4 \times 4$  multiplier ,9 bit accumulator register, 8 bit ripple carry adder and control logic for enabling . This gives complete

one bit MAC unit. But for real time applications one bit MAC is designed for 8 bit input data, hence the design varies according to that. That is two 8 bit input register,  $8 \times 8$  multiplier unit then 17 bit register then 17 bit adder and 18 bit accumulator register.

# **RESULT:**

| Units used | Total number<br>of transistor | Power<br>consumption |
|------------|-------------------------------|----------------------|
| AND gate   | 2                             | 1.319E-11            |
| Full adder | 22                            | 0.1459E-9            |
| multiplier | 278                           | 9.969E-3             |
| register   | 34                            | 0.2804E-9            |

# **CONCLUSION:**

MAC operation is the basis of DSP operation. Mainly used in digital filter design. The complexity of the filter response dictates the number of MAC operations required per sample period. Digital filter involve signals in the digital domain (discrete time signals) and are used extensively in application such as digital image processing, pattern recognition and spectral analysis. A  $4 \times 4$  multiplier accumulator (MAC) design is presented in this work. A full adder circuit based on MUX is used in this design. Compared to all other full adder circuits, the MUX based full adder consuming less power and less transistor. Hence MUX based full adder is applied in MAC. The basic block for MAC is analyzed separately for its performance. Along with block enabling technique to reduce power during operation of each block, Due to the block enabling technique that is selection of each block one after the other for complete operation reduces power consumption that is from unnecessary power flow to all block even though input to that block is not given. This calculates delay in each block, control logic is set for the selection of each block in MAC for the selection of block after certain delay. The full custom design is carried out for the proposed work and verified using CADENCE VIRTUSO tool.

## ACKNOWLEDGEMENT:

The paper is a team work and though it is impossible to give thanks to all faculties personnel, we take this opportunity to express gratitude to them. We express our gratitude and thanks to our guide Prof.Kotresh.Marali who have guided us for this project and without their valuable suggestions, technical expertise and constant encouragement; it would not have been possible to achieve this goal.

With great pleasure we acknowledge a deep sense of gratitude to our beloved head of the department (HOD) Prof.Dr.Vijaya.C for her inspiration, encouragement and extension of facilities of the department.

Lastly we also thankful to all teaching and non-teaching staff of electronics and communication department, who directly or indirectly helped us while working of this paper.

# **REFERENCE:**

- 1. "BASIC VLSI DESIGN" by Douglas A.Pucknell and Kamran Eshraghian 3<sup>rd</sup> edition.
- 2. "PRINCIPLES OF CMOS VLSI DESIGN" by Neil.H.E.Weste and Kanran Eshraghian.
- 3. "DIGITAL SIGNAL PROCESSING" by Li Tan.
- 4. "MODERN VLSI DESIGN" by Wayne Wolf.
- 5. "DIGITAL SIGNAL PROCESSING" 4<sup>th</sup> Edition by P Ramesh Bab