A VLSI Array Processing Oriented Fast Fourier ...

Viewer
Transcript

IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005

3523

PAPER

Special Section on VLSI Design and CAD Algorithms

A VLSI Array Processing Oriented Fast Fourier Transform Algorithm and Hardware Implementation Zhenyu LIU†a) , Nonmember, Yang SONG††b) , Student Member, Takeshi IKENAGA††c) , Member, and Satoshi GOTO††d) , Fellow

SUMMARY Many parallel Fast Fourier Transform (FFT) algorithms adopt multiple stages architecture to increase performance. However, data permutation between stages consumes volume memory and processing time. One FFT array processing mapping algorithm is proposed in this paper to overcome this demerit. In this algorithm, arbitrary 2k butterfly units (BUs) could be scheduled to work in parallel on n = 2 s data (k = 0, 1, . . . , s − 1). Because no inter stage data transfer is required, memory consumption and system latency are both greatly reduced. Moreover, with the increasing of BUs, not only does throughput increase linearly, system latency also decreases linearly. This array processing orientated architecture provides flexible tradeoﬀ between hardware cost and system performance. In theory, the system latency is (s × 2 s−k ) × tclk and the throughput is n/(s × 2 s−k × tclk ), where tclk is the system clock period. Based on this mapping algorithm, several 18-bit word-length 1024-point FFT processors implemented with TSMC0.18 µm CMOS technology are given to demonstrate its scalability and high performance. The core area of 4-BU design is 2.991 × 1.121 mm2 and clock frequency is 326 MHz in typical condition (1.8 V, 25◦ C). This processor completes 1024 FFT calculation in 7.839 µs. key words: fast Fourier transform (FFT), array processing, singleton algorithm

1.

Introduction

FFT is widely used in digital signal processing. Related researches point out that FFT algorithm is both computation and communication intensive. As a consequence, many researches on FFT algorithms have been conducted. The conventional approach is software implementation of FFT algorithm on general-purpose processors [1]–[3]. However, in embedded or mobile applications, the application specific approach is more feasible. In this approach, parallel and pipeline architectures are widely adopted to increase performance. Cascade FFT algorithm is one useful approach [4], [5]. For n-point FFT, this algorithm typically requires log2 n processors. In order to get higher performance, high radix cascade algorithm is applied in some designs, such as radix-4 Manuscript received March 14, 2005. Manuscript revised June 14, 2005. Final manuscript received August 1, 2005. † The author is with Kitakyushu Foundation for the Advancement of Industry Science and Technology, Kitakyushu-shi, 8080135 Japan. †† The authors are with IPS, Waseda University, Kitakyushu-shi, 808-0135 Japan. a) E-mail: [email protected] b) E-mail: [email protected] c) E-mail: [email protected] d) E-mail: [email protected] DOI: 10.1093/ietfec/e88–a.12.3523

[6] and radix-8 cascade algorithm [7]. Each processor computes one stage of FFT. This architecture has high system throughput of 1/tclk data per second, where tclk is system clock period. However, this architecture has some demerits. First, the system processing latency can not be improved and this latency increases linearly with data length n. Second, cascade FFT is lack of scalability and its performance and hardware cost depend on FFT radix and data length. Consequently, it is hard for designer to make tradeoﬀ between hardware cost and system performance. Some designs [8] apply multiple processors in one stage to reduce system latency, but the bus contention problem prevents these processors from reaching full utilization. Another approach decomposes a long FFT into several short FFTs to achieve parallel processing [9]. Based on this algorithm, Shenhav [10] provides a parallel pipelined FFT processor with multiple vector digital signal processing (DSP) processors. Jones and Sorenson [11] implement this algorithm on multibus-oriented multiprocessor systems. Ma [12] provides an algorithm to simplify the address generation of twiddle factors and reduce the number of twiddle factors to the minimum. The demerit of this algorithm still lies on the data transfer between stages. First, dedicated memories are required for the inter stage data transfer. Second, IO operations performed by each processor greatly reduce its performance. Third, the bus contention among multiple processors makes the system performance even worse. In many real-time processing applications, such as radar signal processing, the system not only requires high throughput but also needs short processing latency. Due to the data transfer delay between successive stages, traditional architectures can not decrease the processing latency eﬀectively. The methods to overcome this lie on three aspects: (i) Enhance the system clock frequency; (ii) Find parallel architectures to widen the data bus; (iii) Eliminate the inter stage data transfer. All of the above aims can be achieved by adopting array processing architecture. In this architecture, all processors are scheduled to work in parallel with full utilization and all data communications between processing units are performed through local data connections without the problem of bus contention. These characters make it possible to get both wider data paths and higher clock speed. A mapping algorithm is provided in this paper to implement array processing oriented FFT architecture. The rest of this paper is organized as follows. In Sect. 2,

c 2005 The Institute of Electronics, Information and Communication Engineers Copyright

IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005

3524

the FFT mapping algorithm is proposed. Based on this mapping algorithm, one 1024-point 4-BU FFT processor is implemented in Sect. 3. The scalability analysis and performance comparison about this architecture is shown in Sect. 4. Conclusions are given in Sect. 5. 2.

Array-Processing Mapping Algorithm

In 1965 Cooley and Tukey gave the method for computing radix-2 transforms of arbitrary large size [13]. Based on their work, Singleton [14] provided a method, which makes every two-by-two computation have the uniform structure. If the input data are originally in reverse binary order, the two-by-two transform on the kth stage is as following. y j = x2 j + x2 j+1 exp(−iπ( j ÷ 2 s−k )/2k−1 ) y j+n/2 = x2 j − x2 j+1 exp(−iπ( j ÷ 2

s−k

k−1

)/2

(1) )

(2)

Where k = 1, 2, . . . , s; j = 0, 1, . . . , n/2 − 1; ÷ represents integer division without remainder. It can be concluded from Eq. (1) and Eq. (2) that successive stages have the same structure except for using a diﬀerent sequence of twiddle factors. The signal flow graph (SFG) for n = 8 Singleton algorithm is shown in Fig. 1, where wl = exp(−i2πl/8), l = 0, 1, 2, 3. Two vectors (i and j) are defined here and will be used in the array mapping algorithm. Singleton algorithm has following characteristics: (i) The two-by-two transforms at the same stage are independent and thus can be processed in parallel. (ii) Every twoby-two transform in this algorithm has the uniform structure. It is obvious that Singleton algorithm can be directly implemented with the structure in Fig. 1. However, this architecture needs n/2 × log2 n BUs, when n is large, the hardware cost is not acceptable.

With the property (ii) of Singleton algorithm, the 2tuple array can be reformed to the 1-tuple array structure. Applying the mapping algorithm provided in [15], [16], the projection vector p = [1, 0] and the schedule vector s = [1, 0] are chosen. After mapping, this FFT array architecture needs n/2 BUs and n storage unit. For the ith BU, it fetches operands from the 2ith and (2i + 1)th storage unit and stores the results to the ith and (i+n/2)th storage unit. For example, when n = 8, the 2-tuple architecture in Fig. 1 is mapped to the 1-tuple structure shown in Fig. 2. The four BUs (BU0-BU3) process the 3 stages transform. The twiddle factors must be changed at diﬀerent stages. To be specific, [coef0, coef1, coef2, coef3] equals to [w0 , w0 , w0 , w0 ], [w0 , w0 , w2 , w2 ] and [w0 , w1 , w2 , w3 ] in the 1st, 2nd and 3rd stage transform, respectively. This mapping algorithm reduces the number of BUs from n/2 × log2 n to n/2. But when n is large enough, for instance, n ≥ 1024, which is very common in modern digital signal processing, such hardware cost is still not acceptable. Because every two-by-two transform in Fig. 2 has the uniform structure, it is possible to do the next mapping in the direction j. Before the j direction mapping reformation, the storage elements must be duplicated and ping-pang strategy is applied to prevent storage contention. This structure is illustrated in Fig. 3. Now, we have n/2 BUs and 2n (n = 2 s ) storage units in two rows. In the first stage, BUs access the source data from row 0 and store the result in row 1. In the second stage, row 1 is used as the source and row 0 is used as the destination, and so on. With this new 1-tuple architecture, we apply mapping functions below to achieve the target array processing structure, which is composed of 2k BUs and 2k+1 block RAMs (each block RAM has 2 s−k entries). The processor array obtained for mapping is called logical processor array to differentiate it from the physical processor array (target machine). (i). The jth BU ( j = 0, 1, 2, . . . , n/2 − 1) in logical processor array is mapped to the ( j÷2 s−k−1 )th BU in physical

Fig. 2

Fig. 1

SFG of 8-point singleton FFT.

Fig. 3

SFG of 1-tuple 8-point FFT.

SFG of 1-tuple 8-point FFT with duplicated storage units.

LIU et al.: A VLSI ARRAY PROCESSING ORIENTED FFT ALGORITHM AND HARDWARE IMPLEMENTATION

3525

processor array; (ii). The ith storage unit (i = 0, 1, 2, . . . , n−1) in logical processor array is mapped to the (i÷2 s−k )th block RAM in physical processor array and its address in this block RAM is REM(i ÷ 2 s−k ). Where ÷ represents the integer division without remainder and REM() represents the remainder of integer division. After this mapping, the target physical processor array has following properties:

tth and (t +2k−1 )th block RAMs. Namely, the 2tth BU writes to the first half part of the tth and (t + 2k−1 )th block RAMs, the (2t + 1)th BU writes to the second half part of these RAMs. The twiddle factors are stored in ROM in sequential. For each two-by-two transform, its twiddle factor address is decided by the source data address and the current transform stage. The twiddle factor address can be expressed as: data addr[s-1 : 1]&((2 s-1 −1) (s−1−current stage)) (12)

(i). The lth BU accesses its source data from the lth block RAM. Proof: (l × 2 s−k−1 + h)th (h = 0, 1, 2, . . . , 2 s−k−1 − 1) BU in logical processor array is mapped to the lth BU in physical processor array. In logical processor array, (l×2 s−k−1 +h)th BU fetches source data from (2 × (l × 2 s−k−1 + h))th and (2 × (l × 2 s−k−1 + h) + 1)th storage units. From the mapping function (ii), these storage units are mapped to the lth block RAM, and their locations in the lth block RAM are Eq. (3) and Eq. (4) respectively. REM(2 × (l × 2 s−k−1 + h) ÷ 2 s−k ) = 2h REM((2 × (l × 2 s−k−1 + h) + 1) ÷ 2 s−k ) = 2h + 1

(3) (4)

(ii). The adjacent odd and even BUs in the physical processor array have the same destination block RAMs. If the two adjacent BUs are denoted 2tth and (2t + 1)th, their destination are the tth and (t + 2k−1 )th block RAMs. Proof: The BUs in the logical processor array, which are mapped to the 2tth in physical array, are numbered: 2t × 2 s−k−1 + h

Where: “” denotes logic left shift operation, “&” represents and operation, “data addr” represents the address of the source and “current stage” is the current process stage. In our design, the data is 1024-point length, so the twiddle factor address logic can be expressed in Table 1. In order to describe the algorithm clearly, one 2-BU 8point FFT hardware design implemented with the provided mapping algorithm is introduced first. For this example, n = 23 and k = 1, with the mapping algorithm, the 1-tuple array in Fig. 3 is transformed to the target architecture in Fig. 4. It can be concluded from Fig. 4 that: (i) BU0 and BU1 in logical array are mapped to BU0 in target machine. (ii) BU2 and BU3 in logical array are mapped to BU1 in target machine. (iii) The target machine has four blocks RAM: BRAM00, BRAM10, BRAM01 and BRAM11. S00, S10, S20 and S30 in logical array are mapped to BRAM00. S40, S50, S60 and S70 are mapped to BRAM10; S01, S11, S21 and S31 are mapped to BRAM01; S41, S51, S61 and S71 are mapped to BRAM11. The corresponding work procedure is described as fol-

(5) Table 1

So their destination storage units in the logical processor array are: 2t × 2 s−k−1 + h

(6)

2t × 2 s−k−1 + 2 s−1 + h

(7)

and

Applying the mapping function (ii), the storage units in Eq. (6) are mapped to the tth Block RAM: (2t × 2 s−k−1 + h) ÷ 2 s−k = t

1024-point twiddle factor address logic.

current stage 0 1 2 3 4 5 6 7 8 9

Twiddle Factor Address {000000000} {data addr[9], 00000000} {data addr[9 : 8], 0000000} {data addr[9 : 7], 000000} {data addr[9 : 6], 00000} {data addr[9 : 5], 0000} {data addr[9 : 4], 000} {data addr[9 : 3], 00} {data addr[9 : 2], 0} {data addr[9 : 1]}

(8)

Their addresses in the tth block RAM are: REM((2t × 2 s−k−1 + h) ÷ 2 s−k ) = h

(9)

The storage units in Eq. (7) are mapped to: (2t × 2 s−k−1 + 2 s−1 + h) ÷ 2 s−k = t + 2k−1 Their addresses in the (t + 2

k−1

(10)

)th block RAM are:

REM((2t × 2 s−k−1 + 2 s−1 + h) ÷ 2 s−k ) = h

(11)

In the same way, it can be derived that the (2t + 1)th physical BU writes its results to (2 s−k−1 + h)th entries in the

Fig. 4

Block diagram of array architecture. (n = 8, k = 1)

IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005

3526

Fig. 5

Fig. 6

Binary-reverse order format of source data.

Data in BRAM01&BRAM11 after the 1st stage transform.

lows: The raw source data are firstly permutated in binaryreverse order and stored in BRAM00 and BRAM10, as illustrated in Fig. 5. In the first stage, BU0 accesses x(0) and x(4) from BRAM00, and then writes x(0) + x(4), x(0) − x(4) to BRAM01 and BRAM11 in sequence. In parallel, BU1 accesses x(1) and x(5) from BRAM10. In order to avoid contention with BU0, it first computes x(1) − x(5) and writes it to BRAM11 and then processes x(1) + x(5) and writes to BRAM01. This interleaving storage strategy avoids the contention on the write-ports of BRAM01 and BRAM11. Next, BU0 calculates x(2) and x(6) in BRAM00 and BU1 calculates x(3) and x(7) in BRAM10. After the 1st stage completes, the data in BRAM01 and BRAM11 are illustrated in Fig. 6. In the second and third stages, the same operations are taken with diﬀerent butterfly twiddle factors. After the 3rd stage transform, the result data are stored in BRAM01 and BRAM11 in normal order. 3.

Fig. 7

Block diagram of array architecture. (n = 1024, k = 2)

Hardware Implementation Fig. 8

With the FFT mapping algorithm discussed in Sect. 2, a 1024-point FFT processor with 4 BUs is implemented with TSMC0.18 µm CMOS technology. This processor applies fix-point algorithm and its input data, output data and twiddle factors are all 18-bit width. The datapath architecture is shown in Fig. 7. Comparing Fig. 4 and Fig. 7, we can see that the logic complexity through datapath is unchanged: First, the logic complexity in each BU is constant; Second, the read data from block RAM to BU just pass through one 2-to-1 multiplexer and this is the same for the write data from BU to block RAM. In this way, the clock frequency is not aﬀected by BU number. What is more, because there is no bus contention problem in this design, in theory, the system performance will increase in proportion to BU number. Quantitative evaluation for the architecture scalability and system performance versus BU number is given in Sect. 4. In order to reduce block RAM port number, the operands for the two-by-two transform are fetched sequen-

Radix-2 BU architecture.

tially from block RAMs. This approach provides following advantages: (i) Each Block RAM only needs one port. (ii) Only two multipliers are required for each BU, which simplifies BU’s complexity. (iii) The data width of ROM is also reduced because real part and imaginary part of twiddle factor are also fetched sequentially. The architecture of BU is illustrated in Fig. 8. In this architecture, through the control of “T1,” x2 j+1 is latched in “Register1” and multiplied by the twiddle factor, x2 j is latched in “Register2” and just passes by. In this way, one complex multiplication could be performed in two clock cycles, so just two multipliers are needed: In the first clock period, x2 j+1 is multiplied by the real part (wR ) of the twiddle factor. In the second clock period, x2 j+1 is multiplied by the imaginary part (wI ) of the twiddle factor. The sequential to parallel converter controlled by “T2” combines these partial

LIU et al.: A VLSI ARRAY PROCESSING ORIENTED FFT ALGORITHM AND HARDWARE IMPLEMENTATION

3527 Table 3

Design performance.

Working Conditions Worst 1.62 V, 125◦ C Typical 1.8 V, 25◦ C Fast 1.98 V, 0◦ C

Fig. 9

Critical Path 4.8096 ns 3.0623 ns 2.2292 ns

Timing diagram of radix-2 butterfly operation.

Table 2 Hardware cost. Components Area (Gate) Percentage SRAMs 208,038 81.2% Datapath 47,703 18.6% Address Logic 500 0.2%

Fig. 10

(a) 1024-point source data.

Layout of 4-BU FFT processor. (b) 1024-point FFT result.

products to generate x2 j+1 × w. The timing diagram is shown in Fig. 9. Because adjacent two BUs share the same write-port, the write operation of the 2tth and the (2t + 1)th BU are scheduled in the following way. The 2tth BU first writes the data to the tth block RAM and at the same cycle the (2t+1)th BU writes the data to the (t + 2k−1 )th block RAM. In the next cycle, the 2tth BU writes the data to the (t + 2k−1 )th block RAM and the (2t + 1)th BU writes the data to the tth block RAM. In this way, the two adjacent BUs will not write to the same block RAM concurrently. This operation is controlled by flip-flop “T3” in Fig. 8. After synthesized with Synopsys Design Compiler, the hardware cost is presented in Table 2. It is clearly demonstrated that the control logic, which includes the address generation logic, is very simple in this FFT architecture. After P&R with Synopsys Astro, the core area of this design is 2.991 × 1.121 mm2 , which is shown in Fig. 10. In this stage, dual ports SRAMs still occupy about about 62.3% of the core area. If area is a critical factor, applying single port SRAMs will eﬀectively decrease the area. However, an extra processing latency of (log2 n) × (pipelinelength) clock cycles is required. For example, in our design, n is 1024 and pipelinelength is 8, so the extra processing latency is 80 cycles. The performance of our design in diﬀerent working

Fig. 11

Source data and transformed result of 1024-point FFT.

conditions is listed in Table 3. In this demonstration, the critical path lies on the one-stage 18bit multiplier. If the pipelined multiplier was applied, the clock frequency could be further increased. In typical working conditions, one 1024-point FFT transform takes about 7839 ns. The average power consumption of the 4 BUs is 54.118 mw in 250 MHz frequency. The input data and transformed results are shown in Fig. 11. 4.

Scalability and Performance Analysis

In order to verify the scalability of this architecture, based on the presented mapping algorithm, we change BU number to implement 1024-point FFT and get the corresponding timing delay, system throughput, system latency and hardware cost statistics. After P&R, the hardware overhead statistics for four configurations are shown in Table 4 and Fig. 12. From Table 4, we can see that the increase of core area is caused by two reasons: First, the datapath area almost increases in direct ratio to BU number; second, when BU number is increased, SRAM must be partitioned into more blocks and more area is consumed. The chart in Fig. 12 clearly demonstrates this trend.

IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005

3528 Table 5 Performance for diﬀerent BU number. BU Max Delay Throughput Latency (ns) (data/sec) (ns) 1 4.7928 20,864,630 49,079 2 4.8689 41,077,039 24,928 4 4.8096 83,166,999 12,313 8 4.9108 162,906,247 6,286

Table 4 Hardware overhead with diﬀerent BU number. BU SRAM Datapath AddrLogic Area (Gate) (Gate) (Gate) (mm2 ) 1 137,266 11,988 306 1.93548 2 160,468 24,140 366 2.37237 4 208,038 47,703 500 3.35291 8 298,284 94,082 810 5.18880

Fig. 12

Hardware overhead versus BU number.

Fig. 13

Layout of 8-BU FFT processor.

The timing delay within each BU will not worsen when the core area is dilated, because the complexity of each BU dose not changes. Just the nets between one BU and its source and destination memory blocks is increased and so the timing delay through these paths is increased. But this does not aﬀect the clock frequency, because the logic complexity through these net is very simple and the complexity dose not change when more BUs are applied, consequently the timing delay through these net will not become bottleneck of the system. What is more, the net delay can be optimized through careful floorplanning when the number of BU is large. For example, in case of 8-BU configuration, we could apply the floorplan which is illustrated in Fig. 13. Diﬀerent from the floorplan of 4-BU architecture, the eight BUs are aligned in two rows in stead of one row. In this way, the core width is shortened and each BU is arranged to the location close to its source and destination block RAM partitions. In worst case (1.62 V, 125◦ C), the post P&R performance statistics versus diﬀerent BU number is shown in Table 5. Because the maxim timing delay is not aﬀected by BU number and there is no bus contention problem among these units, throughput increases linearly and system latency

decreases linearly according to BU number. At the end of this section, we compare the proposed architecture with the architecture provided in reference [9], [10], radix-2 cascade architecture in reference [5] and radix4 cascade architecture in reference [6]. This comparison focuses on hardware cost, system throughput and system latency. In the parallel n-point FFT algorithm proposed in [10], there are p BUs at each pipeline stage as well as inter stage RAMs to keep consecutive FFT computation. Each BU at the first pipeline stage computes n1 /p FFT of n2 points, and each BU at the second pipeline stage computes n2 /p FFT of n1 points, where n = n1 ×n2 . To make a comparison, for n = 1024, we choose n1 = 32, n2 = 32 and p = 2. In this comparison, it is assumed that the external I/O clock frequency is equal to the chip core clock frequency. It is clearly illustrated that in the original architecture, the inter stage memory for decoupling of diﬀerent stage buses not only consumes large chip size but also occupies volume clock cycles. For radix-2 1024-point FFT, there are ten stages transform. In radix-2 cascade structure, the BU number is configured as ten and each BU processes one stage transform. For radix-4 1024-point FFT, five radix-4 butterfly units are applied. It must be noted that each radix-4 BU has four integer multipliers. So the hardware overhead of one radix4 BU is equal to two times of radix-2 BU. One shortage of radix-4 BU is that its multiplier utilization is just 75% [6]. The comparison is shown in Table 6. When 4 BUs are used, the SRAM volume in our design is one third of the algorithm in reference [10]. Because there is no inter stage data transfer, the system latency is reduced 58.1% and system throughput is improved 16.8%. The proposed 8-BU architecture has shorter system latency compared with radix-2 and radix-4 cascade designs even its datapath scale is 80% of these two counterparts. The throughput of 8-BU design is 20% lower than radix-2 and radix-4 cascade ones because less BUs are applied. When applying 16 BUs, the performance of our design is much better than other counterparts. It is clearly demonstrated that our algorithm has good scalability, high throughput and short system latency. 5.

Conclusion

One FFT mapping algorithm is proposed in this paper to develop eﬃcient FFT array processing architecture. For n = 2 s points FFT, 2k (k = 0, 1, . . . , s − 1) BUs could be scheduled to work in parallel. If pipeline latency is not taken into account, the corresponding processing latency is (s×2s−k )×tclk

LIU et al.: A VLSI ARRAY PROCESSING ORIENTED FFT ALGORITHM AND HARDWARE IMPLEMENTATION

3529 Table 6

Performance comparison.

Algorithm

Shenhav [10]

Radix-2 Cascade [5]

Radix-4 Cascade [6]

Proposed 4BU

Proposed 8BU

Proposed 16BU

Memory Size (bit)

225,792

110,520

110,520

73,728

73,728

73,728

BU Numbers

4(radix-2)

10(radix-2)

5(radix-4)

4(radix-2)

8(radix-2)

16(radix-2)

System Latency (cycle)

6114

2046

1620

2560

1280

640

1024 3072 × tclk

1024 1024 × tclk

1024 1024 × tclk

1024 2560 × tclk

1024 1280 × tclk

1024 640 × tclk

Throughput (data/sec)

* tclk denotes the clock period of the system

and the throughput is n/(s × 2 s−k × tclk ), where tclk is the system clock period. The presented architecture oﬀers following advantages: (i) The inter stage data transfer is not required, so the throughput and system latency are both better than conventional architectures; (ii) The critical path delay of this architecture does not increase with the augment of processing units. As a consequence, system performance increases linearly with the number of BU. (iii) This architecture provides users flexibility to make tradeoﬀ between hardware cost and performance. This mapping method can be extended to other radix FFT algorithms. Moreover, block-float-point and float-point FFT processors can also adopt this algorithm Acknowledgments This work was supported by fund from the Japanese Ministry of ECSST via Kitakyushu knowledge-based cluster project.

[9] N.M. Brenner, “Fast Fourier transform of externally stored data,” IEEE Trans. Audio Electroacoustics, vol.AU-17, no.2, pp.128–132, June 1969. [10] R. Shenhav, “The decomposition of long FFT’s for high throughput implementation,” IEEE International Conference on Acoustics, Speech, and Signal Processing-ICASSP87, pp.1043–1046, April 1987. [11] D.L. Jones and H.V. Sorenon, “A bus-oriented multiprocessor fast Fourier transform,” IEEE Trans. Signal Process., vol.39, no.11, pp.2547–2551, Nov. 1991. [12] Y.T. Ma, “A VLSI-oriented parallel FFT algorithm,” IEEE Trans. Signal Process., vol.44, no.2, pp.445–448, Feb. 1996. [13] J.W. Cooley and J.W. Tukey, “An algorithm for the machine calculation of complex Fourier series,” Math. Comput., vol.19, pp.297–301, April 1965. [14] R.C. Singleton, “A method for computing the fast Fourier transform with auxiliary memory and limited high-speed storage,” IEEE Trans. Audio Electroacoustics, vol.AU-15, no.2, pp.91–98, June 1967. [15] S.Y. Kung, “On supercomputing with systolic/wavefront array processors,” Proc. IEEE, vol.72, no.7, pp.867–884, July 1984. [16] P. Lee and Z. Kedem, “Synthesizing linear array algorithms from nested for loop algorithms,” IEEE Trans. Comput., vol.37, no.12, pp.1578–1598, Dec. 1988.

References [1] D. Takahashi, “High-performance parallel FFT algorithms for the HITACHI SR8000,” The Fourth International Conference/ Exhibition on High Performance Computing in the Asia-Pacific Region, vol.1, pp.192–199, May 2000. [2] D. Takahashi and Y. Kanada, “High-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers,” J. Supercomputing, vol.15, no.2, pp.207–228, Feb. 2000. [3] K. Tanno, T. Taketa, and S. Horiguchi, “Parallel FFT algorithms using radix 4 butterfly computation on an eight-neighbor processor array,” Parallel Comput., vol.21, no.1, pp.121–136, Jan. 1995. [4] G.D. Bergland and H.W. Hale, “Digital real-time spectral analysis,” IEEE Trans. Comput., vol.EC-16, no.2, pp.180–185, April 1967. [5] G.C. O’Leary, “Non-recursive digital filtering using cascade fast Fourier transformers,” IEEE Trans. Audio Electroacoustics, vol.AU18, no.2, pp.177–183, June 1970. [6] E. Bidet, D. Castelain, C. Joanblanq, and P. Senn, “A fast single-chip implementation of 8192 complex point FFT,” IEEE J. Solid-State Circuits, vol.30, no.3, pp.300–305, March 1995 [7] K. Maharatna, E. Grass, and U. Jagdhold, “A 64-point Fourier transform chip for high-speed wireless LAN application using OFDM,” IEEE J. Solid-State Circuits, vol.39, no.3, pp.484–493, March 2004. [8] D.R. Bungard, L. Lau, and T.L. Rorahaugh, “New programmable FFT implementation for radar signal processing,” IEEE International Symposium on Circuits and Systems-ISCAS89, vol.II, pp.1323–1327, May 1989.

Zhenyu Liu received his B.E., M.E. and Ph.D. degrees in electronics engineering from Beijing Institute of Technology in 1996, 1999 and 2002, respectively. His doctor research focused on real time signal processing and relative ASIC design. From 2002 to 2004, he worked as post doctor in Tsinghua University of China, where his research mainly concentrated on embedded CPU architecture. Currently he is a researcher in Kitakyushu Foundation for the Advancement of Industry Science and Technology. His research interests include real time H.264 encoding algorithms and associated VLSI architecture.

IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005

3530

Yang Song received the B.E. degree in Computer Science from Xi’an Jiaotong University, China in 2001 and M.E. degree in Computer Science from Tsinghua University, China in 2004. He is currently a Ph.D. candidate in Graduate School of Information, Production and Systems, Waseda University, Japan. His research interest includes video coding and associated very large scale integration (VLSI) architecture.

Takeshi Ikenaga received his B.E. and M.E. degrees in electrical engineering and the Ph.D. degree in information & computer science from Waseda University, Tokyo, Japan, in 1988, 1990, and 2002, respectively. He joined LSI Laboratories, Nippon Telegraph and Telephone Corporation (NTT) in 1990, where he has been undertaking research on the design and test methodologies for highperformance ASICs, a real-time MPEG2 encoder chip set, and a highly parallel LSI & system design for imageunderstanding processing. He is presently an associate professor in the system LSI field of the Graduate School of Information, Production and Systems, Waseda University. His current interests are application SoCs for image, security and network processing. Dr. Ikenaga is a member of the IPSJ and the IEEE. He received the IEICE Research Encouragement Award in 1992.

Satoshi Goto was born on January 3rd, 1945 in Hiroshima, Japan. He received the B.E. degree and the M.E. degree in Electronics and Communication Engineering from Waseda University in 1968 and 1970, respectively. He also received the Dr. of Engineering from the same university in 1981. He is IEEE fellow, Member of Academy Engineering Society of Japan and professor of Waseda University. His research interests include LSI System and Multimedia System.

VLSI Oriented Fast Multiple Reference Frame Motion Estimation ...

Signal processing utilizing a tree-structured array

Fast Fourier Color Constancy - Jon Barron

Fast and Wide Drive Array Industry 2016 Deep Market Research ...

PDF Online VLSI Digital Signal Processing Systems

RESEARCH ARTICLE 2048-Point Fast Fourier ...

Fast Fourier Transform Based Numerical Methods for ...

Reading and Using Fast Fourier Transforms (FFT)

Online PDF VLSI Digital Signal Processing Systems: Design and ...

FAWN: a fast array of wimpy nodes: technical ... - Research at Google

A VLSI Architecture Design of an Edge Based Fast Intra ...

A Contention-Free Radix-2 8k-points Fast Fourier ...

Fourier series

VLSI TECHNOLOGY.pdf

VLSI Technology.pdf

fourier integrals