IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006

979

PAPER

Special Section on Selected Papers from the 18th Workshop on Circuits and Systems in Karuizawa

Scalable VLSI Architecture for Variable Block Size Integer Motion Estimation in H.264/AVC Yang SONG†a) , Student Member, Zhenyu LIU†† , Nonmember, Satoshi GOTO† , Fellow, and Takeshi IKENAGA† , Member

SUMMARY Because of the data correlation in the motion estimation (ME) algorithm of H.264/AVC reference software, it is difficult to implement an efficient ME hardware architecture. In order to make parallel processing feasible, four modified hardware friendly ME workflows are proposed in this paper. Based on these workflows, a scalable full search ME architecture is presented, which has following characteristics: (1) The sum of absolute differences (SAD) results of 4 × 4 sub-blocks is accumulated and reused to calculate SADs of bigger sub-blocks. (2) The number of PE groups is configurable. For a search range of M×N pixels, where M is width and N is height, up to M PE groups can be configured to work in parallel with a peak processing speed of N×16 clock cycles to fulfill a full search variable block size ME (VBSME). (3) Only conventional single port SRAM is required, which makes this architecture suitable for standard-cellbased implementation. A design with 8 PE groups has been realized with TSMC 0.18 µm CMOS technology. The core area is 2.13 mm × 1.60 mm and clock frequency is 228 MHz in typical condition (1.8 V, 25◦ C). key words: variable block size motion estimation (VBSME), H.264/AVC, very large scale integration (VLSI) architecture

1.

Introduction

H.264/AVC, the newest video coding standard, was jointly developed by the ITU-T Video Coding Experts group and the ISO/IEC Moving Picture Experts group. The goals of H.264/AVC are to enhance compression efficiency and provide network friendly video representation for various applications [1]. Compared with previous standards, H.264/AVC adopts many new features, which include variable block sizes motion compensation, quarter-sample-accurate motion compensation, multiple reference picture motion compensation, in-the-loop deblocking filtering and so on [2]. These new characteristics make compression efficiency of H.264/AVC outperform pervious coding standards up to 50% on various bit rates and video resolutions [1]. Like other standards, ME is the most computation intensive part in H.264/AVC. Moreover, up to 7 block sizes and 1/4-pixel resolution motion vector (MV) are adopted in H.264/AVC [3], which put heavy burden on processing unit and make traditional ME architectures incompatible. The 7 Manuscript received June 27, 2005. Manuscript revised October 3, 2005. Final manuscript received November 14, 2005. † The authors are with the Graduate School of Information, Production and Systems, Waseda University, Kitakyushu-shi, 8080135 Japan. †† The author is with Kitakyushu Foundation for the Advancement of Industry Science and Technology, Kitakyushu-shi, 8080135 Japan. a) E-mail: [email protected] DOI: 10.1093/ietfec/e89–a.4.979

Fig. 1

Variable block sizes in H.264/AVC.

kinds of block sizes within one MB are shown in Fig. 1. In H.264/AVC reference software, the best matching position is calculated by Lagrangian cost, which includes both the residual cost and MV cost. However, the calculation of MV cost introduces data correlation among adjacent sub-partitions and makes parallel processing of all blocks within one MB infeasible. Moreover, for each block, its search range center is not fixed and then the overlapped search area data cannot be reused. To eliminate these demerits, four hardware-oriented ME workflows are proposed in this paper. Different video sequences are simulated to clarify the effects on encoding performance. Based on the proposed ME workflows, a scalable full search ME architecture for H.264/AVC is presented, which has following characteristics: (1) Due to the results reusing methodology adopted, total 41 MVs of different block sizes in one MB can be calculated in one full search ME operation. (2) Both the reference frame and current MB data are stored in conventional single port SRAM and no broadcasting signals are required. Therefore, the architecture is suitable for standard-cell-based implementation. (3) The processing element (PE) number in this design is configurable. If the search range in reference frame is M×N, where M is width and N is height, up to M PE groups could be configured to work in parallel with N×16 clock cycles to accomplish one VBSME operation. In fact, any m PE groups could be configured as long as m is a factor of M and the correspondent full search VBSME time is (M×N×16)/m clock cycles. The rest of the paper is organized as follows. In Sect. 2, the ME algorithm in H.264/AVC is presented. Four proposed ME workflows and their corresponding performance analysis are discussed in Sect. 3. A VLSI architecture for ME in H.264/AVC is proposed in Sect. 4. The silicon de-

c 2006 The Institute of Electronics, Information and Communication Engineers Copyright 

IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006

980

Fig. 2

MVP calculation process in H.264/AVC.

sign and performance analysis are given in Sect. 5. Finally, we conclude this paper in Sect. 6. 2.

ME Algorithm in H.264/AVC Reference Software

ME algorithm is the most computation intensive part in the encoding process. In H.264/AVC reference software, ME process is conducted in the following steps: 1). Integer ME (IME) is firstly performed on integer pixels to find the best matching integer position. 2). Fractional ME (FME) is then conducted on 1/2 and 1/4 pixels, which are around the best integer position found in 1), to enhance the coding efficiency. 3). Mode decision is then performed to decide the MB block mode. In this paper, our modification only concentrates on the IME process. FME and mode decision are the same as H.264/AVC reference software and will not be further discussed. During IME, for a specific block size, its motion cost is decided by both its SAD and motion vector difference (MVD), as shown in Eq. (1), where λ is the Lagrangian multiplier, and R(m − p) is the number of bits to code the MVD. m = m(mv x , mvy )T is the current MV and p = p(mvp x , mvpy )T is the predicated MV (MVP). J(m, λ) = S AD(s, c(m)) + λ × R(m − p)

(1)

From above discussion, it can be seen that MVP is required to calculate the motion cost of one block. In H.264/AVC, MVP for one block is predicted from its neighboring blocks, as shown in Fig. 2. The explanation is given as follows: 1). For all block size exclude 16×8 and 8×16, the MVP of current block is the median of MVA , MVB and MVC , as shown in Eq. (2). MVP = median(MVA , MVB , MVC )

(2)

2). For block size of 8 × 16, the MVPs of the left block and right block are MVA and MVC , respectively. 3). For block size of 16 × 8, the MVPs of the upper block and lower block are MVA and MVB , respectively.

Fig. 3 Reference frame buffer (MB Size is 16 × 16 with a search range of [−16, +15].

4). For special cases, the MVP calculation process will be modified accordingly and details can be found in [3]. From the above discussion, we can find there exist data dependency during the ME process within one MB. For example, for the four 4 × 4 blocks within the top-left 8 × 8 block in Fig. 2(a), we must process them in the order of E, F, G and H. Block E is firstly processed because its MVP can be directly calculated from its surrounding sub-blocks, which are outside of the current MB. However, block F can not be processed until ME process for block E is finished. This because MVPF requires MVE . This data dependence also exist in ME process of blocks G and H. Besides the data correlation for blocks within one MB, in H.264/AVC, the search range center for each block is determined by its MVP. Because the MVPs for adjacent MBs have various values, the overlapped search area for adjacent two MBs is irregular. For example, as illustrated in Fig. 3, because the overlapped area of two adjacent search windows is irregular and instable, when the ME process for MB A is accomplished, this overlapped area is difficult to be reused for the ME process of MB B. An easy way is totally refilling the search area. However, this approach will greatly increase the memory bandwidth. 3.

Proposed IME Workflows

From the IME algorithm discussed in Sect. 2, we can find that two issues make it is difficult to design an efficient ME architecture. Firstly, the calculation of motion cost causes data correlation among sub-blocks within one MB and makes parallel processing infeasible. Secondly, for each

SONG et al.: SCALABLE VLSI ARCHITECTURE FOR VBSME IN H.264/AVC

981 Table 1

Fig. 4 UMVP calculation algorithm (All 41 sub-blocks in one MB share the same UMVP, which is the median of MVA , MVB and MVC ).

Fig. 5

Fixed search range center ([0,0]).

block, the search range center is determined by its MVP, which incurs irregular and instable overlapped search area and makes it can not be easily reused. To eliminate the data correlation in the IME process, as proposed in [4], one way is to directly set the λ parameter in Eq. (1) to 0, which means only the SAD cost is taken into account. Therefore, the data dependency within one MB is removed. Another way is more complex, rather than calculating the exact MVP for each block, a uniform MVP (UMVP) is proposed, which is the MVP of current MB, as shown in Fig. 4. For all 41 sub-blocks within one MB, they share the same UMVP. Because UMVP is derived from the MVs outside current MB, data correlation within the MB is then eliminated and all blocks can be processed in parallel. To save the memory bandwidth, for each block, as discussed in [5], its search range center can be fixed to [0, 0] rather than its MVP. Therefore, the overlapped reference frame data for adjacent blocks can be reused. For example, as illustrated in Fig. 5, with a search range of [−16,+15], the adjacent two MBs can share 2/3 of the search frame data and then save lots of memory bandwidth. Based on the two modifications, four IME workflows for H.264/AVC are proposed, as shown in Table 1. In practice, ALG3 is the modified ME prediction flow proposed in [4] and the others are proposed by us. For the motion cost of a block, two options are provided: (1) Only SAD is considered. (2) Both SAD and MVD are taken into account, but the MVD is calculated by MV and UMVP. Both options

Proposed IME workflows.

Algorithms

Motion Cost

Search Range Center

JM 8.1a ALG1 ALG2 ALG3 ALG4

SAD & MVP SAD SAD & UMVP SAD SAD & UMVP

MVP UMVP UMVP [0,0] [0,0]

eliminate the data correlation and support parallel processing of all sub-blocks within one MB. For the search range center, two options are also available: (1) Search range center is fixed to [0, 0]. (2) Search range center is decided by UMVP. Eight video sequences with QCIF, CIF and HDTV720p resolutions are simulated to verify the coding efficiency of the proposed workflows. The test conditions are I-P-PP..., CAVLC, Hadamard transform. For QCIF and CIF sequences, 5 reference frames are used with a search range of [−16, +15]. For HDTV720p sequence, 2 reference frames are simulated and the search range is [−64, +63]. The R-D curves are shown in Fig. 6. From above experiments, we can see that except the Stefan sequence, the performance of the four proposed ME workflows are almost the same and the coding loss is less than 1%. We think this mainly dues to: (1) During ME, it is the SAD, not the MVD, that dominates the motion cost [4]. Therefore, omitting the MVD will not bring much performance degradation. (2) For frame with adequate search range, the exact MVP has a high possibility to locate in the search range which center is fixed to [0,0]. Therefore, whether using UMVP or [0,0] as search range center will not greatly impact the coding quality. A similar report can also be found in [6]. However, for Stefan sequence, the four proposed workflows have different performance. The ALG1 and ALG2 which use UMVP as search range center have almost the same performance as JM8.1a. The ALG3 and ALG4 which fix search range center to [0,0] have a visible coding loss. The reason is that Stefan sequence has fast movements, when the search range is 32 × 32 and centered to [0,0], the real MVs are out of the search area. This problem can be solved by two methods: One way is to increase the search range as discussed in [4], which means the increasing of computation. Another way is to use the UMVP as the search range center, which incurs the augment of memory bandwidth. According to the application requirements, users can make tradeoff between hardware cost and bandwidth to overcome this problem. During our experiments, ALG1 and ALG2 have almost the same performance and this is also true for ALG3 and ALG4. Therefore, ALG1 and ALG3 are more preferable because the UMVP calculation is eliminated and then hardware cost and power consumption can be reduced. For ALG1 and ALG3, if the application has enough search range, ALG3 is preferable because overlapped search area can be reused. Otherwise, ALG1 is the preferred because PSNR loss is negligible.

IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006

982

Fig. 6

The rate and distortion comparison for proposed workflows.

At last, some conclusions are given here: (1) By adopting the UMVP, we can achieve almost the same performance as the original H.264/AVC reference software. Moreover, data correlation within one MB is eliminated, which makes parallel processing feasible. (2) The motion cost of one block can be directly calculated by SAD. This can decrease the computation complexity and only slightly impact the coding efficiency. (3) For frame with relative large search

range, the search range center for each block can be directly fixed to [0,0] to save memory bandwidth and the coding loss is also acceptable. A scalable IME architecture is proposed in the following section, which can support both ALG1 and ALG3 by adopting different reference frame data schedule approaches.

SONG et al.: SCALABLE VLSI ARCHITECTURE FOR VBSME IN H.264/AVC

983

4.

IME Hardware Architecture

4.1 ME Architecture Overview Because ME is the most computation intensive part in video coding process, various architectures have been proposed. For example, a 1-D and a 2-D systolic array ME architecture are proposed in [7] and [8], respectively. Some reviews on ME architecture can be found in [9]–[12]. However, previous ME architectures can not easily be adopted by H.264/AVC because they can not fully support all the 7 block modes adopted by H.264/AVC. Till recently, a 1-D architecture [13] and a 2-D architecture [4] are presented, both of which can support all the block modes in H.264/AVC. However, these architectures have their inherent drawbacks. The major demerit of 1-D system is its slow processing speed. More PE groups can be applied to overcome this problem. However, this will increase the hardware cost. Compared with 1-D architecture, 2-D system has a higher performance. But it still has demerits. Firstly, because the current frame is stored in the processing elements (PE) array, there should be broadcasting signals from reference frame data and read-write ports of the PE arrays. This long net delay impacts the clock frequency. Secondly, the 2-D hardware is not fully utilized with conventional SRAM because of pipeline bubbles. In order to fully utilize hardware, the reference frame storage memory must be divided into two partitions and each PE must add a multiplexer to choose its input data between them [4]. This scheme not only increases the hardware complexity but also increases the scale of the PE array. 4.2 Proposed ME Architecture To overcome these demerits, a scalable hardware architecture is proposed to perform full search VBSME for H.264/AVC. Assume the search range is M×N pixels, where M is width while N is height, m PE groups can be scheduled to work in parallel, as long as m is a factor of M. For each PE group, the corresponding processing time is (16×M×N)/m with a sub-search range of (M/m)×N. For example, if the search range is 32×32 pixels, we can choose 1, 2, 4, 8, 16 or 32 PE groups to perform the full search ME operation and their corresponding sub search ranges are 32 × 32, 16 × 32, 8 × 32, 4 × 32, 2 × 32 and 1 × 32 pixels. In order to illustrate the architecture, a design with 8 PE groups is implemented with a search range of 32 × 32 pixels. So for each PE, the sub-search area is 4 × 32 pixels and processing time is 16×4×32 clock cycles. This architecture is shown in Fig. 7. If the pipeline latency is not taken into account, it costs one PE group 16 clock cycles to calculate all 41 SADs in one search position. For example, at clock cycle 15, PE group0 finishes the SAD computation at search position [0,0], at clock 31, it finishes the operation at search position [0,1] and so on. At clock 511, every PE group reaches the search

Fig. 7

ME architecture with 8 PE groups.

position [0,31] and finishes the search in the first column. From clock 512 to 1023, each PE group performs the search at the second column position. The rest search can be traced by analogy. When one PE group gets the SAD at a new search position, it compares this new value with the stored one and saves the minimum SAD and associated MV. The data flow schedule from clock cycle 0 to clock cycle 511 is shown in Table 2. The other flows from 512 to 2047 are the same. At cycle 2047, each PE group finishes the search operation in its own 4 × 32 sub-search area and gets 41 local minimum SADs and their correspondent MVs. From cycle 2048, the selector is active and chooses the 41 global minimum SADs from these PE groups. In order to decrease system critical path delay, every PE group adopts a three-stage pipeline architecture and achieves 100% hardware utilization. In stage 1, 4 × 1 SADs are calculated. In stage 2, 4 × 4, 4 × 8 and 8 × 4 SADs are obtained. In stage 3, 8 × 8, 8 × 16, 16 × 8 and 16 × 16 SADs are generated. In the first stage, 16 pixels of one row in current MB and one row in search area are fed in. The partial SADs of the neighboring four pixels in the same row are calculated and latched to the second stage, as shown in Fig. 8. The second stage calculates sixteen SADs of 4 × 4 blocks, eight SADs of 4 × 8 blocks and eight SADs of 8 × 4 blocks in one MB, as shown in Fig. 9. The detailed hardware architecture is shown in Fig. 10. SAD4 × 4 X (X: 015) registers store the minimum SAD of sixteen 4×4 blocks, SAD4 × 8 X (X: 0-7) store the minimum SAD of eight 4 × 8 blocks and SAD8 × 4 X (X: 0-7) store the minimum SAD of eight 8 × 4 blocks. Stage 2 consists of 4 sub-blocks, namely p 0, p 1, p 2 and p 3. These four blocks have the similar data flow, so we just describe the workflow of p 3 to demonstrate the principle of stage 2. In p 3, the register “sum inter 3” and the adder bellow function as an accumulator. In every four clock cycles, a new 4 × 4 SAD is generated. Different SAD4 × 4 X registers are multiplexed to be compared with this SAD value at different time. To be specific, SAD4×4 0 is compared at 16i + 3 clock cycle, SAD4 × 4 4 is compared at 16i + 7 clock cycle, SAD4 × 4 8 is compared at 16i + 11 clock cycle and

IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006

984 Table 2 Clock 0 1 ... 14 15 16 17 ... 30 31

PE Group0 15 i=0

15

|C(i, 0)−R(i, 0)|

i=0 |C(i, 1)−R(i, 1)|

15

...

i=0 |C(i, 14)−R(i, 14)| 15 i=0 |C(i, 15)−R(i, 15)| 15 i=0 |C(i, 0)−R(i, 1)| 15 i=0 |C(i, 1)−R(i, 2)|

15

...

i=0 |C(i, 14)−R(i, 15)| 15 i=0 |C(i, 15)−R(i, 16)|

Data flow schedule.

PE Group1 15

...

|C(i, 0)−R(i+4, 0)|

...

i=0 |C(i, 1)−R(i+4, 1)|

...

i=0

15

...

...

i=0 |C(i, 14)−R(i+4, 14)| 15 i=0 |C(i, 15)−R(i+4, 15)| 15 i=0 |C(i, 0)−R(i+4, 1)| 15 i=0 |C(i, 1)−R(i+4, 2)|

...

...

...

i=0 |C(i, 14)−R(i+4, 15)| 15 i=0 |C(i, 15)−R(i+4, 16)|

...

15

15

PE Group6 15 i=0

15

PE Group7

|C(i, 0)−R(i+24, 0)|

i=0 |C(i, 1)−R(i+24, 1)|

15

...

i=0 |C(i, 14)−R(i+24, 14)| 15 i=0 |C(i, 15)−R(i+24, 15)| 15 i=0 |C(i, 0)−R(i+24, 1)| 15 i=0 |C(i, 1)−R(i+24, 2)|

... ... ...

15

...

i=0 |C(i, 14)−R(i+24, 15)| 15 i=0 |C(i, 15)−R(i+24, 16)|

...

15 i=0

|C(i, 0)−R(i+28, 0)|

i=0

|C(i, 1)−R(i+28, 1)|

15 15

...

i=0 |C(i, 14)−R(i+28, 14)| 15 i=0 |C(i, 15)−R(i+28, 15)| 15 i=0 |C(i, 0)−R(i+28, 1)| 15 i=0 |C(i, 1)−R(i+28, 2)|

15

...

i=0 |C(i, 14)−R(i+28, 15)| 15 i=0 |C(i, 15)−R(i+28, 16)|

............ 496 497 ...

15

i=0 |C(i, 0)−R(i, 31)| 15 i=0 |C(i, 1)−R(i, 32)|

15

...

15

i=0 |C(i, 0)−R(i+4, 31)|

...

|C(i, 1)−R(i+4, 32)|

...

15

i=0

...

...

i=0 |C(i, 14)−R(i+4, 45)| 15 i=0 |C(i, 15)−R(i+4, 46)|

...

15

511

i=0 |C(i, 14)−R(i, 45)| 15 i=0 |C(i, 15)−R(i, 46)|

Fig. 8

PE group pipeline stage1 architecture.

510

SAD4 × 4 12 is compared at 16i + 15 clock cycle. If the new SAD is less than the stored one, the register is updated. As mentioned before, data reusing methodology is adopted in this architecture. To calculate 4×8 SAD, at clock 16i + 3, the 4 × 4 SAD from accumulator is stored in tmp 3 register. At clock 16i + 7, the saved value in tmp 3 is combined with the SAD from accumulator to get the SAD values for SAD4 × 8 0 blocks. The same operations are performed to calculate SAD4 × 8 4 at clock 16i + 11 and 16i + 15. In order to calculate 8 × 4 SAD, the 4 × 4 block SAD from p 3 are combined with the values from p 2 to generate 8 × 4 block SAD. The 8 × 4 block SAD registers, which include SAD8×4 0, SAD8×4 2, SAD8×4 4 and SAD8×4 6 are multiplexed to compare with this value at clock 16i + 3, 16i + 7, 16i + 11 and 16i + 15, respectively. To fulfill the data reuse methodology, in every four clock cycles, the 8 ×

...

15 i=0 15 i=0

15

|C(i, 0)−R(i+24, 31)| |C(i, 1)−R(i+24, 32)| ...

i=0 |C(i, 14)−R(i+24, 45)| 15 i=0 |C(i, 15)−R(i+24, 46)|

Fig. 9

15 i=0

|C(i, 0)−R(i+28, 31)|

i=0

|C(i, 1)−R(i+28, 32)|

15 15

...

i=0 |C(i, 14)−R(i+28, 45)| 15 i=0 |C(i, 15)−R(i+28, 46)|

The 41 SADs within one MB.

4 SADs are also latched down to the third stage for other SADs computation. The third stage architecture is shown in Fig. 11. The SADs of 8 × 4 blocks are reused to generate SADs of 8 × 8, 16 × 8, 8 × 16 and 16 × 16 blocks, as illustrated in Fig. 9. The data flow in this stage is similar to the second stage. At clock 16i + 7 and 16i + 15, 8 × 8 block SADs and 16 × 8 block SADs could be obtained, at clock 16i + 15, 8 × 16 and

SONG et al.: SCALABLE VLSI ARCHITECTURE FOR VBSME IN H.264/AVC

985 Table 3

Hardware cost for the design. Area(Gates)

8 PE Groups Current MB Buffer Search Area Buffer Others Total Size

Table 4 Stage1 Stage2 Stage3 Total Size

Fig. 10

Hardware cost for one PE group. Area(Gates) Percentage(%) 5630 27.5% 11345 55.3% 3525 17.2% 20.94K Gates

PE group pipeline Stage2 architecture. Fig. 12

Fig. 11

PE group pipeline Stage3 architecture.

16 × 16 block SADs could be calculated. Because datapath in this stage is wider than the one in stage2, the critical data path in this stage is longer than the one in the second stage. But it should be pointed out that the inputs to the third stage are updated in every 4 clock cycles. As a consequence, the data paths in this stage are multi-cycle ones. This character is good for hardware implementation. We can choose low power, small area and slow components to implement the logic in this stage. 5.

Percentage(%)

167.56K 66.0% 16.66K 6.6% 56.64K 22.3% 13.01K 5.1% 253.87K Gates

Hardware Implementation and Performance Analysis

5.1 Hardware Implementation The design with 8 PE groups is implemented with a search

Silicon layout.

range of 32 × 32 and can handle all block modes in H.264/AVC. This design is realized by Verilog and synthesized with Synopsys Design Compiler with a total gate count of 253.87K, as listed in Table 3. The current MB has 16 × 16 pixels and search area is 47 × 47 pixels. In order to make physical implementation convenient, one pixel is extended in both vertical and horizontal directions. Therefore, the search area is 48 × 48 pixels. For one PE group, the corresponding hardware cost is 20.94K, as listed in Table 4. This design is placed and routed by Synopsys Astro with TSMC 0.18 µm standard-cell CMOS technology. The core area is 2.13 mm × 1.60 mm, which is shown in Fig. 12. In typical work condition (1.8 V, 25◦ C), the maximum frequency is 228 MHz. 5.2 Scalability and Performance Analysis In the proposed architecture, one PE group is the smallest configuration granularity and the corresponding hardware cost is 21K gates. According to the frame size and search range, the number of PE groups can be flexibly configured to satisfy different processing requirements. It takes one PE group 16 cycles to fulfill VBSME operation on one search position. If the clock period is TCLK , the corresponding processing time is 16×TCLK . For a frame with size of W×H pixels and search range of M×N pixels and frame rate of F frames/second, where is W frame width, H is frame height, M is search range width and N is search range height, the processing requirement (search positions per second) is shown in Eq. (3).

IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006

986 Table 5

CIF VGA 4CIF 525SD

PE group number versus video applications. Frame Size

Search Range

Frame Rate

PE Group Number

352 × 288 640 × 480 704 × 576 720 × 480

32 × 32 48 × 32 48 × 32 56 × 48

30 Hz 30 Hz 30 Hz 30 Hz

1 4 6 8

Table 6

Granularity Scale (PE) Granularity Size (Gates) Throughput (pos/s) Process Technology Working Conditions Frequency

Table 7

Hardware Granularity Hardware Cost Max Frequency (Post-Synthesis)

Performance comparison. 2-D Array Architecture

This Work

256 PEs 127.7 K Gates

1 PE Group (16 PEs) 21 K Gates

202 MHz

237.5 MHz

Architecture scalability comparison. 1-D Array Design [13]

2-D Array Design [4]

This Work

16 PEs

256 PEs

1 PE Group (16PEs)

61K

91K

21K

16 Cycles

1 Cycle

16 Cycles

TSMC 0.13 µm – Post-Synthesis 294 MHz

TSMC 0.35 µm – Post-Layout 66.67 MHz

TSMC 0.18 µm Typical Post-Layout 228 MHz

H W × ×M×N×F 16 16

search-position/s

(3)

For real time VBSME, the number of PE groups should satisfy the inequation  W ×H × M ×N ×F  NUMPE Group ≥ × 16 × TCLK (4) 16 × 16 For example, in order to process QCIF (176×144) with 32×32 search range and 30 Hz frame rate, the required number of PE groups can be calculated as following (TCLK is 4.39 ns in our design).   176×144×32 ×32×30 NUMPE Group ≥ × 16×TCLK 16 × 16 = 0.21 (5) Therefore, only one PE group can realize real time VBSME. The number of PE groups versus some typical applications are listed in Table 5. The configuration granularity for various VBSME architectures is listed in Table 6. The throughput in the table is defined by the number of clock cycles to fulfill VBSME on one search position. For the 2-D design in [4], during its IME process, only SAD is considered and the search range center is fixed to [0,0]. In practice, this is the ALG3 in this paper. For the 1-D design in [13], only SAD is used in its IME process and the search range center is not discussed. However, because only the hardware cost of PE array is taken into account, the selection of search range center only affects the bandwidth between the on-chip SRAM and off-chip memory and will not hinder the fairness of comparisons. Compared with 1-D architecture [13], our design has the same throughput, but with a much smaller hardware cost. Compared with 2-D architecture [4], the area-performance efficiency (throughput per gate) of 2-D design is higher than

the proposed one. This because only 4 × 4 partial SAD is stored in 2-D design, the other bigger SADs are directly calculated and delivered outside without storage. Therefore, many partial SAD registers which are used in our design are eliminated in the 2-D architecture. It is difficult to directly compare our design with the 2-D architecture because they are implemented with different technology. In order to make comparison feasible, the 2-D design in [4] is also realized by us. This architecture is described by Verilog and synthesized by Synopsys Design Compiler with TSMC 0.18 µm technology. Under the worst operating condition (1.62 V, 125◦ C), the performance comparisons are shown in Table 7. It can be seen that when 6 PE groups are configured to work in parallel, we have the same hardware cost as the 2-D design. The corresponding processing capability (search-position/s) can be calculated in Eq. (6). Processing Capability = NumPE

Group

×

1 16 × TCLK

1 16 × 4.39 ns = 85.5M search-position/s (6)

= 6×

Therefore, for video applications which the real-time processing requirement below 85.5M search positions per second, the proposed design is more efficient; Otherwise, the 2-D design is more preferable. This processing capability corresponds to the frame size of 4CIF (704 × 576) at 30 Hz with a search range of 48 × 32. However, the proposed architecture has following merits: (1) Compared with 1-D architecture, our design can effectively reduce the hardware cost while provides the same processing capability. (2) In comparison with 2-D design, a higher clock frequency can be achieved in our architecture. (3) The proposed ME architecture has the smallest hardware granularity, which means it has more flexibility between performance and hardware cost. According to the processing requirements, the number of PE groups can be easily configured to achieve real time VBSME without hardware waste. For instance, in order to real time conduct VBSME on CIF video at 30Hz with a search range of 48 × 32, if all under a clock frequency of 220 MHz, we need two 1-D array ME architecture [13] or one 2-D array ME architecture [4] to satisfy the requirement. The corresponding hardware costs are 122K gates and 91K gates, respectively. For the proposed design, we also need two PE groups but the hardware cost is only 42K gates. It is clear that our design can satisfy the requirement with the smallest hardware cost.

SONG et al.: SCALABLE VLSI ARCHITECTURE FOR VBSME IN H.264/AVC

987

6.

Conclusions

In this paper, four modified ME workflows for H.264/AVC are proposed. In these revised ones, data correlation in the IME process is eliminated and then makes parallel process VBSME for all blocks within one MB feasible. In comparison with H.264/AVC reference software, the performance degradation is acceptable. Based on the proposed workflows, a scalable VBSME architecture for H.264/AVC is proposed. While traditional 1-D or 2-D systolic array architectures try to widen the bandwidth of current frame, we broaden the output bandwidth of reference frame. For a search range of M×N pixels, any m PE groups can be schedule to perform the VBSME in parallel with a corresponding processing time of (M×N×16)/m clock cycles. By data reusing methodology, SADs of 4 × 4 blocks are accumulated and reused to calculate bigger ones, which decrease the calculation requirement. This architecture has no broadcasting signals and applies conventional SRAM to store current MB and reference frame, which makes it suitable for standard-cell-based design. Moreover, because no broadcasting signals are required in our design, compared with 2-D architecture, a higher clock frequency can be achieved. A design with 8 PE groups has been implemented to illustrate this architecture. The core area is 2.13 mm × 1.60 mm and maximum frequency is 228 MHz in typical conditions (1.8 V, 25◦ C). Acknowledgments This work was supported by fund from the MEXT via Kitakyushu innovative cluster project. References [1] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer, and T. Wedi, “Video coding with H.264/AVC: Tools, performance, and complexity,” IEEE Circuits Syst. Mag., vol.4, no.1, pp.7–28, First Quarter, 2004. [2] T. Wiegand, G.J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol.13, no.7, pp.560–576, July 2003. [3] T. Wiegand, G. Sullivan, and A. Luthra, “Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264—ISO/IEC 14496-10 AVC),” May 2003. [4] Y.W. Huang, T.C. Wang, B.Y. Hsieh, and L.G. Chen, “Hardware architecture design for variable block size motion estimation in MPEG-4 AVC/JVT/ITU-T H.264,” Proc. IEEE Int. Symp. Circuits Syst., vol.II, pp.796–799, 2003. [5] T.C. Wang, Y.W. Huang, H.C. Fang, and L.G. Chen, “Performance analysis of hardware oriented algorithm modifications in H.264,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.II, pp.493–496, 2003. [6] A. Sharifinejad and H. Mehrpour, “A fast full search block matching algorithm using three window search based on the statistical analysis of the motion vectors,” IEEE International Conference on Communications, vol.1, pp.104–108, 2002. [7] P.M. Kuhn, “Fast MPEG-4 motion estimation: Processor based and flexible VLSI implementations,” J. VLSI Signal Process., vol.23,

pp.67–92, Oct. 1999. [8] C.H. Chou and Y.C. Chen, “A VLSI architecture for real-time and flexible image template matching,” IEEE Trans. Circuits Syst., vol.36, no.10, pp.1336–1342, Oct. 1989. [9] T. Komarek and P. Pirsch, “Array architecture for block matching algorithm,” IEEE Trans. Circuits Syst., vol.36, no.10, pp.1301–1308, Oct. 1989. [10] P. Pirsch, N. Demassieux, and W. Gehrke, “VLSI architecture for video compression—A survey,” Proc. IEEE, vol.83, no.2, pp.220– 246, Feb. 1995. [11] P. Pirsch and H. Stolberg, “VLSI implementation of image and video multimedia processing systems,” IEEE Trans. Circuits Syst. Video Technol., vol.8, no.7, pp.878–891, Nov. 1998. [12] P.C. Tseng, Y.C. Chang, Y.W. Huang, H.C. Fang, C.T. Huang, and L.G. Chen, “Advances in hardware architectures for image and video coding—A survey,” Proc. IEEE, vol.93, pp.184–197, Feb. 2005. [13] S.Y. Yap and J.V. McCanny, “A VLSI architecture for variable block size video motion estimation,” IEEE Trans. Circuits Syst. II, Express Briefs, vol.51, no.7, pp.384–389, July 2004.

Yang Song received the B.E. degree in Computer Science from Xi’an Jiaotong University, China in 2001 and M.E. degree in Computer Science from Tsinghua University, China in 2004. He is currently a Ph.D. candidate in Graduate School of Information, Production and Systems, Waseda University, Japan. His research interest includes motion estimation, video coding technology and associated VLSI architecture.

Zhenyu Liu received his B.E., M.E. and Ph.D. degrees in electronics engineering from Beijing Institute of Technology in 1996, 1999 and 2002, respectively. His doctor research focused on real time signal processing and relative ASIC design. From 2002 to 2004, he worked as post doctor in Tsinghua University of China, where his research mainly concentrated on embedded CPU architecture. Currently he is a researcher in Kitakyushu Foundation for the Advancement of Industry Science and Technology. His research interests include real time H.264 encoding algorithms and associated VLSI architecture.

Satoshi Goto was born on January 3rd, 1945 in Hiroshima, Japan. He received the B.E. degree and the M.E. degree in Electronics and Communication Engineering from Waseda University in 1968 and 1970, respectively. He also received the Dr. of Engineering from the same university in 1981. He is IEEE fellow, Member of Academy Engineering Society of Japan and professor of Waseda University. His research interests include LSI System and Multimedia System.

IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006

988

Takeshi Ikenaga received his B.E. and M.E. degree in electrical engineering and the Ph.D. degree in information & computer science from Waseda University, Tokyo, Japan, in 1988, 1990, and 2002, respectively. He joined LSI Laboratories, Nippon Telegraph and Telephone Corporation (NTT) in 1990, where he has been undertaking research on the design and test methodologies for highperformance ASICs, a real-time MPEG2 encoder chip set, and a highly parallel LSI & system design for imageunderstanding processing. He is presently an associate professor in the system LSI field of the Graduate School of Information, Production and Systems, Waseda University. His current interests are application SoCs for image, security and network processing. Dr. Ikenaga is a member of the IPSJ and the IEEE. He received the IEICE Research Encouragement Award in 1992.

Scalable VLSI Architecture for Variable Block Size ...

ment of Industry Science and Technology, Kitakyushu-shi, 808-. 0135 Japan. a) E-mail: ...... China in 2001 and M.E. degree in Computer. Science from Tsinghua ...

1MB Sizes 3 Downloads 316 Views

Recommend Documents

Scalable VLSI Architecture for Variable Block Size ...
adopts many new features, which include variable block sizes motion .... For block size of 16 X 8, the MVPs of the upper block ... because the overlapped area of two adjacent search windows ...... phone Corporation (NTT) in 1990, where he.

A VLSI Architecture for Variable Block Size Motion ...
Dec 12, 2006 - alized in TSMC 0.18 µm 1P6M technology with a hardware cost of 67.6K gates. ...... Ph.D. degree in information & computer sci- ence from ...

PARALLELING VARIABLE BLOCK SIZE MOTION ...
variable block sizes for ME and MC, the compression performance can ... becomes the bottleneck for real time encoding. In the ..... “Data Partition for Wavefront.

A variable step-size for frequency-domain acoustic ...
varying step-size for the frequency-domain adaptive filter algo- rithm is derived and its connection to a magnitude-squared coherence (MSC) is revealed.

A variable step-size for frequency-domain acoustic ...
2007 IEEE Workshop on Applications of Signal Processing to Audio and ..... dd m. S k. V k S k. S k k. S k S k. S k. S k εε γ. = = = . (8). Substituting (8) into (5), we ...

High-throughput GCM VLSI architecture for IEEE 802.1 ...
Email: {chzhang, lili}@nju.edu.cn. Zhongfeng ... Email: [email protected]. Abstract—This ..... Final layout of the proposed GCM design. Table I lists the ...

VLSI Architecture for High Definition Digital Cinema ... - Rice ECE
structure memory and dynamic buffer management method. It can be configured to support both 2k and 4K high definition digital movies. In addition, since ... vided into three parts: hardware-software interface module, information gathering and coding

A VLSI Architecture for Visible Watermarking in a ...
Abstract—Watermarking is the process that embeds data called a watermark, a tag, ...... U. C. Niranjan, “VLSI impementation of online digital watermarking techniques with ... Master's of Engineering degree in systems science and automation ...

VLSI Architecture for High Definition Digital Cinema ... - Rice ECE
This paper presents a high performance VLSI architecture for the playback system of high definition digital cinema server that complies with Digital Cinema ...

Hardware-Efficient Propagate Partial SAD Architecture for Variable ...
Mar 13, 2007 - Through compressing the propagated data and optimizing the processing ... and Application-Based Systems]: Signal processing systems;.

variable block carry skip logic using reversible gates
its carry output, Ci+1, when Pi = Ai⊕Bi. Multiple full- adders, called a .... architecture for adder circuits using reversible logic based on minimizing gate count, ...

Optimal Scalable Software Architecture for Symmetric Multi-Core ...
Optimal Scalable Software Architecture for Symmetric Multi-Core Embedded System.pdf. Optimal Scalable Software Architecture for Symmetric Multi-Core ...

A scalable service-oriented architecture for multimedia ... - Uni Marburg
development and use of web services for a broad scope of multimedia .... computational cloud, such as the Amazon Elastic Compute Cloud (EC2) (Amazon Web ... The service provider is interested in efficient services in that the response time.

Microvisor: A Scalable Hypervisor Architecture for ...
of CPU cores, while maintaining a high performance-to- power ratio, which is the key metric if hardware is to con- tinue to scale to meet the expected demand of ExaScale computing and Cloud growth. In contrast to the KVM Hypervisor platform, the Xen.

A Scalable FPGA Architecture for Nonnegative Least ...
Each of these areas can benefit from high performance imple- mentations .... several disadvantages are immediately apparent—for each iter- ation there are two ...

A scalable service-oriented architecture for multimedia ... - Uni Marburg
BPEL for data-intensive applications and (b) provides tools that further ease the ... management and cluster/grid computing, including multidisciplinary analysis ..... Step 3 After processing the whole input video in this manner, a final list is give

FaSTAPTM: A Scalable Anti-Jam Architecture for GPS
selectable by software, but larger tap lengths can be ac- commodated with minor .... receiver does not need to be integrated with the FaSTAP. AJ module and the ...

A Wire-Delay Scalable Microprocessor Architecture for ...
technology challenges of the coming decade. The Grid Processor architecture (GPA) is designed to address these technology challenges [2]. As shown in Figure. 9.6.3, each GPA implementation consists of a 2-D array (4x4 in this example but scalable to

A Scalable Hierarchical Power Control Architecture for ...
1. SHIP: A Scalable Hierarchical Power Control. Architecture for Large-Scale Data Centers. Xiaorui Wang ... Power consumed by computer servers has become a serious concern in the ... years as more data centers reach their power limits. Therefore, ...

Jiminy: A scalable incentive-based architecture for ...
teracts with a MySQL backend database for storage and retrieval of system/user data. ..... This is part of the reason why files in file-sharing peer-to-peer systems.

Scalable Elastic Systems Architecture
Feb 24, 2011 - vice and the cloud does not provide good primitives for the degree of elasticity you require, ... Free or idle resources become hid- den. If however, each .... of a principal. The top layer is the Elastic Service and Runtime. Layer.

pdf-2\vlsi-risc-architecture-and-organization-electrical-and-computer ...
DOWNLOAD FROM OUR ONLINE LIBRARY. Page 3 of 7. pdf-2\vlsi-risc-architecture-and-organization-electrical-and-computer-engineering-by-s-b-furber.pdf.