International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

A Hardware Intensive Approach for Efficient Implementation of Different FIR Filter Design Using Trapezoidal Rule for FPGA Platforms Rajeshwari N. Sanakal M.Tech student, Vemana Institute of Technology, VTU Belgaum Banaglore, Karnataka, India [email protected]

ABSTRACT Numerical techniques have long been used to compute an approximate solution of a definite integral. The traditional approaches have mostly been software oriented. However, with the current trend moving back towards hardware intensive processing, it is desirable to develop a hardware oriented solution that assesses the performance in terms of some realistic parameters such as speed, power and area. This project aims to exploit the one-to-one correspondence that exists between the Integration algorithms and the general FIR filters. Based on this correspondence a structure is developed that implements the Integration algorithm. Pipelined and parallel structures are developed and their effects on speed and power metrics are studied separately. It is shown that by these architectural modifications the data paths within the structure can be modified and the structure can be operated at higher throughput rates and/or with lower power consumption. Because of their ability to provide a high level of hardware programmability, FPGAs have been used as the implementation platform.

Keywords: FIR structure, Trapezoidal rule, Data broadcast structure, Fine grain pipelining

1. Introduction In mathematics and engineering applications there sometimes arise situations where it is difficult to find an anti-derivative of an integrand. A frequently used approach for obtaining approximate solutions for such integrals is the numerical integration technique. In its basic form, the definite integral is approximated as:

Substituting n = 1 in equation (1) results in the Trapezoidal rule for numerical integration as:

The generalized Finite Impulse Response (FIR) equation is given by:

are the coefficients needed to generate the necessary filtering response. Equations (2) and (3) suggest that there is a one-to-one correspondence between the Trapezoidal rule and the FIR equation. Thus a hardware structure for numerical integration can be obtained by mapping the Trapezoidal rule on to a generalized FIR structure. There have been some hardware implementations of numerical integration using radix-2 encoding techniques. However, these solutions have long computational delays and have large non-recurring engineering (NRE) costs. With FIR implementations the complex arithmetic operations are implemented in Rajeshwari N. Sanakal, IJRIT-242

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.242-250

terms of multiply and add operations only. The multiply operation is the main computational bottleneck in the FIR structure, requiring a high computation time. A variety of approaches have been pursued to speed up the FIR filter structures. Distributed arithmetic (DA) has been used as an alternative over the conventional Multiply and Accumulate (MAC) operations. This however tends to moderate structures as in effect, bandwidth is being traded off to save resources. Constant coefficient multiplication is yet another approach that has been used to design efficient structures. These are based on using look-up tables and addition operations. However, the use of look-up tables has restricted their usage in FIR implementations. The conventional approaches have mainly focused on achieving a speed up by modifying the discrete individual components of the FIR structure. It can, however, be shown that by introducing architectural modifications at the system level a subsequent speed-up in the performance can be achieved. This involves exploiting the hidden concurrencies within the algorithm to be realized Pipelining and parallel processing are the approaches that have been used at system level to exploit these concurrencies. A key issue here is determining the extent of parallelism that can be exploited within an algorithm. Computational complexity, data dependencies, communication bounds, filter lengths, finite arithmetic effects etc. are the factors that need to be considered before using the modifications. Another key issue is the availability of a suitable underlying platform that can support the hardware intensive processing. With current FPGAs, the technology provides hundreds of bit-parallel multipliers that can be used to develop different sorts of architectures. FPGAs are, therefore, being increasingly used for a variety of computationally intensive applications.

2. Design methodology & implementation of the proposed work 2.1 The basic FIR architecture: Based on the one-to-one correspondence that exists between equations 2 and 3 a general FIR structure for the Trapezoidal rule is given above in figure 1.

Fig 2.1: FIR structure for trapezoidal rule The critical path for this structure is limited by one multiplication and N addition operations; where N is the number of taps in the FIR filter. If TM is the computation time for one multiplication operation and TA is the computation time for one addition operation then the critical path computation time TC for this structure is: TC = TM + NTA The sampling frequency or the throughput of the structure is thus approximately given by: f sampling = 1 / TM + NTA It is observed that the critical path and thus the sampling frequency is a function of the filter length N and as the filter length increases, the throughput decreases. The critical path of the original structure can be reduced by transposing the structure as discussed next.

2.2 Transposed FIR structures: Rajeshwari N. Sanakal, IJRIT-243

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.242-250

The transposed structure is obtained by changing the direction of the data flow on each link and interchanging the input and output nodes as shown in figure 2.

Fig 2.2: Transposed FIR structure for trapezoidal rule. The critical path computation time TCT of the transposed structure is given by, TCT = TM + TA f sample = 1 / TM + TA The critical path and thus the sampling frequency of the transposed structure are independent of the filter length and are restricted by the computation time of one addition and one multiplication operation. Since the data is now being broadcast to all the multipliers simultaneously, the transposed structure is also known as the Data Broadcast structure. The critical path of this structure can further be reduced by pipelining the structure and is described below.

2.3 Pipelined and Fine grain FIR structure: This is done by placing latches along the feed-forward cut sets of the transposed structure. The critical path of this structure can further be reduced by pipelining the structure. The pipelined structure is shown in figure 3. Pipelining the structure reduces the critical path to TM at the expense of a single increase in latency. For pipelined structure the critical path is given by, T Cpip = TM f sampling = 1 / TM

Fig 2.3: Pipelined structure for trapezoidal rule. The critical path of the pipelined structure is limited by the computation time of the multiplier unit. For large input word lengths, the multiplier computation time can be significantly large. This limits the throughput of the overall structure. For large input word lengths, it is thus desirable to break the multiplication unit into smaller multiplication units. A pipeline register is then introduced between the two smaller units to increase the sampling rates, at the expense of an increased latency. This type of structure is known as the fine-grain pipelined structure and is shown in figure 4 below. Rajeshwari N. Sanakal, IJRIT-244

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.242-250

Fig 2.3.1: Fine grain pipelined structure for trapezoidal rule.

The critical path for the Fine grain pipelined structure is given by, T Cpip = TM f sampling = 1 / TM

2.4 Parallel structures: Pipelining and parallel processing are duals of each other, and if a structure can be pipelined it can also be processed in parallel. The single-input single-output (SISO) system representing the Trapezoidal rule is given by the equation:

To obtain a parallel structure this SISO system needs to be converted to a multiple-input multiple-output (MIMO) system. For example, the MIMO equations representing the 4- parallel structure of above equation are,

Note that each delay in the 4-parallel structure is a block delay of four clock cycles. Parallel processing does not change the critical path of the structure, however, since four samples are processed in a single clock cycle the overall throughput rate is four times the original. The drawback with parallel structures is that a lot of on-chip resources are required due to the duplication of the hardware. However, since the implementation is targeted for FPGA devices, the underlying hardware resources are quite high. This high amount of underlying logic can be used efficiently such that area is no longer a major concern. For a four parallel system therefore, TCpar = TCpip f par sampling = 4 * f pip sampling The corresponding 4- parallel structure for a filter of order 4 is shown in figure below:

Rajeshwari N. Sanakal, IJRIT-245

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.242-250

Fig 2.4: 4-parallel structure for trapezoidal rule

3 Result and analysis 3.1 Synthesis Results This section gives device utilization summary generated after synthesizing various components using HDL designer in Xilinx simulator.

3.1.1 Analysis and synthesis summary of Throughput, Latency of 8tap different FIR filter designs for Trapezoidal rule Table 3.1.1: Throughput and Latency comparisons for different realization of trapezoidal rule. Structure

Throughput (MHz)

Latency (No. of clock cycles)

Clock period (ns)

Basic FIR structure Transposed FIR Structure Pipelined FIR structure Fine grain pipelined FIR structure 4-parellel FIR structure

1065.12

1

15.02

4390.72

1

3.64

4292.96

2

3.72

4353.12

3

3.67

4549.60

0

4.08

Rajeshwari N. Sanakal, IJRIT-246

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.242-250

3.1.2 Analysis and synthesis summary of Throughput, Latency of 4tap different FIR filter designs for Trapezoidal rule: Table 3.1.2: Throughput and Latency comparisons for different realization of trapezoidal rule. Structure Basic FIR Structure Transposed FIR Structure Pipelined FIR structure Fine grain pipelined FIR structure 4-parellel FIR Structure

Throughput (MHz)

Latency (No. of clock cycles)

1057.20

1

Clock period (ns) 7.56

2290.88

0

3.49

2343.92

2

3.41

2379.84

3

3.36

2382.48

0

3.35

3.1.3 Analysis and synthesis summary of Area of 8tap different FIR filter designs for Trapezoidal rule: Table 3.1.3 Area comparisons for different realizations Structure

Basic FIR Structure Transposed FIR Structure Pipelined FIR structure Fine grain pipelined FIR structure 4-parellel FIR Structure

No. of LUTS

No. of Occupied slices

225

126

26

102

62

26

102

91

26

97

141

26

164

487

98

No. of IOBs

3.1.4 Analysis and synthesis summary of Area of 4tap different FIR filter designs for Trapezoidal rule: Table 3.1.4 Area comparisons for different realizations

Structure

Basic FIR Structure Transposed FIR Structure Pipelined FIR structure Fine grain pipelined FIR structure 4-parellel FIR Structure

No. of LUTS

No. of Occupied slices

42

25

14

33

23

14

38

29

14

35

37

14

107

137

50

No. of IOBs

Rajeshwari N. Sanakal, IJRIT-247

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.242-250

3.2 Comparison This section discusses the comparison of Throughput variations for different filter orders for Trapezoidal rule.

3.2.1 comparison of different filter designs based on throughput. Table 3.2.1; comparison based on throughput for different filter designs . structure

Basic FIR structure

Transposed FIR structure

Pipelined FIR structure

Fine grain pipelined FIR structure

Parallel FIR structure

8tap filter Throughput

1065.12

4390.72

4292.96

4353.12

4549.60

4tap filter Throughput

1057.20

2290.88

2343.92

2379.84

2382.48

5000 4500 4000 3500 Basic

3000

Transposed

2500

Pipelined

2000

Fine grain pipelined

1500

4-Parallel

1000 500 0 4

8

Fig 3.2.1 : Throughput variation for different filter orders.

3.2.2 Comparison of different filter designs based on Area. Table 3.2.2 comparison based on Area for different filter designs . structure

Basic FIR structure

Transposed FIR structure

Pipelined FIR structure 91

Fine grain pipelined FIR structure 141

8tap filter Area 4tap filter Area

126

62

25

23

Parallel FIR structure 487

29

37

137

Rajeshwari N. Sanakal, IJRIT-248

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.242-250

600 500 Basic

400

Transposed

300

Pipelined 200

Fine grain pipelined

100

Parallel

0 4

8 Fig 3.2.2: Area comparison for different filter orders.

4. Conclusion This project implemented the trapezoidal rule for numerical integration by mapping the integration algorithm on to FIR structures. The hardware solution presented in this project is based on the modifications that can be carried out at the system level. The analysis and the experimental results carried out in this project clearly indicated improvement in performance is achieved by introducing architectural modifications such as pipelining and parallel processing.

References [1] Weikang Qian, Chen Wang, Peng Li, David J. Lilja, Kia Bazargan, Marc D. Riedel, "An Efficient Implementation of Numerical IntegrationUsing Logical Computation on Stochastic Bit Streams,"IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2012, November 5-8, 2012, San Jose, California, USA. [2] Xiao-li Hu, Feng-ying Wang, Min Zhang, "Hardware Implementation of FIR Filter," proceedings of International Conference on Multimedia Technology 26-28 July 2011, p. no. 341-343 ISBN 978- 1-61284-771-9 (print.) [3] A. Abedelgwad, “High speed and area efficient multiply Accumulate (MAC) Unit for Digital signal Processing applications,” IEEE International Symposium on Circuits and Systems, ISCAS 2007. [4] Ayaman.A.Fayed, “A merged Multiplier Accumulator for High Speed signal processing Applications,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002. [5] P.Kolling, K.Abbot, “FPGA Implementation of High Performance FIR filter,” IEEE International Symposium on Circuits and Systems, ISCAS Proceedings of 1997. [6] M. Keerthi, Vasujadevi Midasala, S. Nagakishore, Jeevan Reddy K.,"FPGA Implementation of Distributed Arithmetic for FIR Filter," International Journal of Engineering Research and Technology (IJERT) ISSN: 2278-0181, Vol. 1, Issue 9, November 2012. [7] White, S. A. "Application of Distributed Arithmetic to Digital Signal Processing," IEEE ASSP magazine, Vol. 6 (3), pp. 4-19, July 1989.

Rajeshwari N. Sanakal, IJRIT-249

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.242-250

[8] Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner," FPGA Implementation of High Speed FIR Filters Using Add and Shift Method," Proceedings International Conference on Computer Design, 2006, pp 308-313. [9] K.Chapman, "Constant Coefficient Multipliers for the XC4000E," Xilinx Technical Report 1996. [10] M. J. Wirthlin and B. McMurtrey, "Efficient Constant Coefficient Multiplication Using Advanced FPGA Architectures," presented at International Conference on Field Programmable Logic and Applications (FPL), 2001. [11] M.J.Wirthlin, "Constant Coefficient Multiplication Using Look-Up Tables," Journal of VLSI Signal Processing, vol. 36, pp. 7-15, 2004 . [12] Keshab K. Parhi, "VLSI Digital Signal Processing Systems Design and Implementation," Wiley, 1999. [13] Roger Woods, John McAllister, Gaye Lightbody, Ying Yi, "FPGAbased Implementation of Signal Processing Systems," Wiley, 2008. [14] K.D.Underwood and K.S.Hemmert, "Closing the Gap: CPU and FPGA Trends in Sustainable FloatingPoint BLAS Performance," presented at International Symposium on Field-Programmable Custom Computing Machines, California, USA, 2004. [15] L.Zhuo and V.K.Prasanna, "Sparse Matrix-Vector Multiplication on FPGAs," presented at International Symposium on Field Programmable Gate Arrays (FPGA), Monterey, CA, 2005.

Rajeshwari N. Sanakal, IJRIT-250