Optimization of Propagate Partial SAD and SAD Tree Motion Estimation Hardwired Engine for H.264 Zhenyu Liu, Satoshi Goto and Takeshi Ikenaga The Graduate School of IPS, Waseda University, [email protected] Abstract— Variable block size motion estimation algorithm is the efficient approach to reduce the temporal redundancies and it has been adopted by the latest video coding standard H.264/AVC. The computational complexity augment coming from the variable block size technique makes the hardwired accelerator essential, especially for real-time applications. In this paper, the authors apply the architecture level and the circuits level approaches to improve the performance of Propagate Partial SAD and SAD Tree hardwired engines, which outperform other counterparts when considering the impact of supporting the variable block size technique. Experiments demonstrate that by using the proposed approaches, compared with the original architectures, 14.7% and 18.0% hardware cost can be saved for Propagate Partial SAD architecture and SAD Tree architecture, respectively. With TSMC 0.18µ m 1P6M CMOS technology, the proposed Propagate Partial SAD architecture attains 231.6MHz operating frequency at a cost of 84.1k gates. Correspondingly, the execution speed of the optimized SAD Tree architecture is improved to 204.8MHz with 88.5k gate hardware overhead.

I. I NTRODUCTION Variable block size (VBS) motion estimation (ME) is one powerful technique adopted by the latest international video coding standard, H.264/AVC [1]. Compared with the fixed block size counterpart, variable block size motion estimation can achieve more accurate motion vectors and then it further reduces the temporal redundancies. In H.264/AVC, motion estimation is conducted on different blocks sizes including 4 × 4, 4 × 8, 8 × 4, 8 × 8, 8 × 16, 16 × 8 and 16 × 16, as shown in Fig.1. During motion estimation, all blocks inside one macroblock (MB) are processed and the block mode with the minimum rate-distortion cost is chosen as the final candidate. The high compression efficiency of variable block size motion estimation is achieved at the price of the intensive computational complexity. In H.264 encoding process, more than 50% computation power is consumed by the motion estimation algorithm. Though many fast motion estimation algorithms have been developed, such as four-step search [2], diamond search [3] and successive elimination [4], as the computation burden is too huge, the general purpose CPU based software motion estimation method is still infeasible for real-time applications, especially with HDTV specifications. For example, with Parkrun 720P test sequence, [−64, 64] search range, and 5 reference frames, it took Dell Precision Workstation R 650 (Intel Xeon 3.06GHz processor and 2GB ECC Double Data Rate SDRAM memory at 266MHz) an average of This work was supported by fund from the CREST, JST.

978-1-4244-2658-4/08/$25.00 ©2008 IEEE

00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 mode 4 × 4

00 10 20 30 mode

01 11 21 31 8×4

00

01

00 01 02 03

00

01

10 11 12 13

10

11

mode 4 × 8

mode 8 × 8

00 00

10 mode 16 × 8

mode 8 × 16 Fig. 1.

mode 16 × 16

41 partitions in the macroblock

16.9 seconds per frame to execute the motion estimation based on Unsymmetrical-cross Multihexagon-grid Search (UMHexagonS) algorithm [5]. To satisfy the computational throughput requirement, the hardwired motion estimation engine is still essential for the real-time encoder designs [6][7]. Full search alike algorithms are wildly used by the hardwired motion estimation design. Compared to the software oriented fast block matching methods, the full search alike algorithms adopted by hardware engine have such advantages: (1) By using the sub-partition SAD reusing scheme, the computation complexity of variable block size algorithm is almost reduced to the level of fixed block size one; (2) Process elements (PE) are scheduled to work in parallel and fully utilized, so its throughput is in direct ratio to the process element number; (3) The memory access and control logic are regular and simple. These advantages can also be viewed as the principles for hardwired motion estimation engine design. Many studies and excellent works have been proposed in the field of VBSME hardwired engine architecture design. According to the preceding three design principles, the analysis in [8] reveals that Propagate Partial SAD and SAD Tree architectures outperform other counterparts. These two architectures are suitable for different application specifications. When no parallelism is required, Propagate Partial SAD provides the most efficient datapath, so it is suitable for the low resolution video sequence or a small search range. In contrast, SAD Tree architecture presents better performance when supporting a high degree of parallelism. Some literatures introduce the fast block matching algorithm in the hardware motion estimation engine design, such as [9], in which four-step search is optimized to be more friendly to the VLSI implementation. The design in

328

literature [9] also complies with the aforementioned three design principles and the computation engine uses SAD Tree architecture, which is proposed in [8]. Since Propagate Partial SAD and SAD Tree architectures have played pivotal roles in the hardwired engine design for H.264/AVC, the research about how to further improve their performance is meaningful. In this paper, the authors enhance the clock speed and hardware efficiency of Propagate Partial SAD and SAD Tree via the architecture level and the circuits level optimizations. The rest of the paper is organized as follows. Section II describes the optimizations of Propagate Partial SAD and SAD Tree architectures. Detailed performance analysis of the proposed architectures are shown in Section III. Finally, conclusions are drawn in Section IV. II. C IRCUITS O PTIMIZATIONS Motion estimation algorithm is processed in three steps. First, absolute difference calculation is processed on each pixel of current macroblock. Second, the sum of absolute differences (SAD) of all pixels in current macroblock for every search position are calculated. Third, the search candidate with minimum SAD value is designated as the final motion vector. The procedure is shown in (1) and (2). W−1H−1

SAD(m, n) = ∑



k=0 l=0

authors further optimize the architecture of 4 × 4 process element group (PEG). At high clock speed, up to 4.8k additional gate count reduction is achieved via this method. The main drawback of original SAD Tree architecture is the low execution speed. With the optimized 2-stage pipeline architecture proposed in section II-B, the maximum clock speed of SAD Tree is improved to 204.8MHz with merely 88.5k gate hardware cost. A. Circuits Optimizations for Propagate Partial SAD Architecture 16x1 + 16x1 Reference Pixels Broadcast Each Pixel to 1x16PE Row 0 Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7

B4x8_00

SADmin = min(SAD(m, n)) m ∈ [0, M−1]n ∈ [0, N−1]

Row 12 Row 13 Row 14 Row 15

B8x4_00

B4x4_20

B4x8_10

B4x8_02

B4x8_01

B4x4_03 B4x4_13

B4x4_12

B8x4_11 B8x8_01

B8x4_01

B8x4_30 B8x8_10

Fig. 2.

Row Accumulator

B4x4_22

B4x4_21 B4x4_31

B8x16_00

Processing Element

W and H are the width and the height of the current image block, respectively. M and N are the width and the height of search window, respectively. C(k, l) denotes the pixel values of the current block and R(m + k, n + l) denotes the pixel values of the reference frame. (m, n) represents the motion vector. According to the analysis of literature [8], Propagate Partial SAD and SAD Tree architectures outperform other counterparts in terms of gate count, hardware utilization, memory bandwidth and system latency. The efficient datapath of Propagate Partial SAD and SAD Tree mainly comes from two issues: First, the redundant propagation registers are eliminated. Second, these two architectures fully capitalize on multi-operand adder tree. Single Propagate Partial SAD has higher clock speed and consumes less hardware cost than SAD Tree architecture. However, as the register array for reference pixels buffering in SAD Tree can be shared by the neighboring multiple sets of architecture, when the parallelism is considered, SAD Tree consumes less hardware than Propagate Partial SAD. Therefore, these two designs have different application specifications. The optimizations of Propagate Partial SAD are discussed in section II-A. The authors remove the redundant propagation registers for 4 × 4 partial SAD and improve the circuits of process element. Compared with the original implementation, these approaches introduce 14.7% hardware cost reduction. The

B8x4_10 B8x8_00

B4x4_30

(1)

(2)

B4x4_02

B4x4_11

B4x8_03

Row 8 Row 9 Row 10 Row 11

4x4 PE Group

|R(m+k, n+l)−C(k, l)|

B4x4_01

B4x4_00 B4x4_10

B8x4_20

B16x8_00 B4x8_11

B4x8_12

B16x8_10 B16x16_00

Shift Register for 4x4 Partial SAD

B4x4_23 B4x4_33

B4x4_32

B8x4_31 B8x8_11

B8x4_21

B4x8_13

B8x16_01

Shift Register for 8x8 Partial SAD

Proposed Propagate Partial SAD hardware architecture

1) System Architecture Optimization: The hardware architecture of the authors’ Propagate Partial SAD architecture is shown in Fig. 2. The dataflow of the proposed design is similar to the original Propagate Partial SAD architecture. The current pixels are stored in the process element array. Two sets of 16×1 reference pixels are broadcasted to the process element array, which are represented with the vertical dash lines in Fig. 2. The architecture is composed of 16 × 16 process element and they are partitioned into sixteen 4 × 4 process element groups. The process elements in ith row compute the 16 × 1 distortions of ith row pixels in current macroblock. Each 4 × 4 process element group is in charge of calculating the SAD of the corresponding 4 × 4 block. It should be noticed that in one cycle, the partial SADs in these sixteen rows belong to sixteen adjacent search positions in vertical. For example, when the process elements in row 0 are calculating the distortions in candidate [m, n], the distortions derived by process elements in row 1 belong to the search candidate [m, n − 1]. Consequently, propagation registers are required to synchronize the row partial SADs. For instance, in 4 × 4 process element group, the distortions generated by 4 × 1 process elements are summed with the partial SAD propagated from its upper row stage and then the result is propagated to the next row stage in vertical. At the outputs of the seventh process element row, the SADs of 4×4 blocks in the upper half partition of current macroblock, including B4×4 YX (Y:0-1 X:0-3), are fed into the first stage adder

329

trees. Based on these small blocks’ SADs, SADs of other larger blocks, including B4×8 0X (X:0-3), B8×4 YX (Y:01 X:0-1), and B8×8 0X (X:0-1), can be derived through the adder trees. In order to calculate the distortions of B16× 8 00, B8×16 00, B8×16 01 and B16×16 00, the SADs of B8×8 00 and B8×8 01 are propagated by the dedicated delay registers in vertical. At the last pipeline stage, they are summed with SADs of B8×8 10 and B8×8 11 to derive the distortions of B16×8 00, B16×8 10, B8×16 00, B8×16 01 and B16×16 00 via the final stage adder tree. In contrast, the original Propagated Partial SAD architecture [8] propagates all 4 × 4 partial SADs in pipeline, and at the last pipeline stage, they are summed to generate other larger block SADs. Compared with the original design, the proposed architecture only propagates 8 × 8 partial SADs of the upper half partition of current macroblock in the last eight pipeline stages. Consequently, the hardware cost for the data propagation is reduced. In detail, the original 4 × 4 SAD delay registers for upper half partition in the last eight stages are substituted by the 8 × 8 SAD delay registers indicated in Fig.2. Therefore, 560-bit registers can be saved by the authors’ architecture. Another advantage of the proposed design is that the operand number to the last stage adder tree is decreased, so the critical path delay is shorten. In the original design, the critical path lies in the adder tree of the last stage. Based on the SADs of sixteen 4×4 blocks, this adder tree derives all other blocks’ SADs. So, the input number of adder tree is sixteen. In the optimized design, eight 4×4 block SADs and two 8×8 block SADs are used as the operands, so the operand number is reduced to ten. In this way, the circuit complexity of the adder tree implementation can be simplified and the critical path delay is also reduced. The hardware cost reduction of the adder tree component coming from the proposed architecture is clearly illustrated in Fig.3. It should be noticed that the adder tree gate count of the proposed architecture is composed of the cost of three adder trees as shown in Fig.2. When the working clock frequency is equal to or greater than 140MHz, 724-1203 gates can be saved by authors’ approach. 6

Gate Count (k gate)

5

4

3

Original Adder Tree Proposed Adder Tree

2

100

120

140

160 180 Frequency (MHz)

200

220

Fig. 3. Comparisons of adder tree hardware cost of the original architecture and the proposed one. Synthesis conditions: TSMC 1P6M 0.18 µ m CMOS standard cell; worst work condition (1.62v, 125◦ C)

2) Circuits Optimization for Process Element: According to literature [10], the absolute difference operation can be expressed as ( R +C + 1 R > C |R −C| = (3) (R +C) R≤C where, R represents the reference pixel and C denotes the current pixel. The intuitive hardware implementation of this algorithm is shown in Fig.4. The most significant bit of the output from the first 8-bit adder is inverted and then is used to bit-XOR with the rest bits. s8 is summed with the XOR result to generate the final absolute difference between R and C. As the process element number is 256, the final adders in these process elements in turn not only consume no-trivial hardware overhead but also increase the critical path delay. One approach is simply eliminating this adder in the process element implementation. However, this will cause one bit error in each process element. In the worst case, the accumulated errors of all process elements is 256. In the authors’ design, as shown in Fig.4, cyx and absyx are both fed into the Row Accumulator. The addition between cyx and absyx is implemented by the Carry-Save Adder (CSA) tree in Row Accumulator, consequently the dedicated adder for cyx and absyx in each process element is avoided. R[7:0] s8

C[7:0]

R[7:0]

8-bit adder

s8

s7 s6 s5 s4 s3 s2 s1 s0

cyx Intuitive PE circuits

C[7:0]

8-bit adder

s7 s6 s5 s4 s3 s2 s1 s0

absyx

Optimized PE circuits

Fig. 4. Circuits optimization of process element (y, x indicate the process element label)

In theory, |C − R| is equal to |R − C|, but the latter is preferred in the hardware implementation. During the motion estimation processing, the data of current pixels are constant, consequently, the timing constraints through these paths can be defined as multi-cycle delay paths with the specific scheduling. As shown in Fig.5, cycle 0-15 is the initialization stage of current macroblock, the broadcasting of reference pixels is scheduled to start at cycle 16, therefore, the setup time from current macroblock pixels to the process element array is two-cycle delay. The loose timing constrains contribute to the hardware reduction of current macroblock register file, especially at high clock speed. It should be notice that C instead of C is stored in current macroblock registers, and then the inverters for the current pixel in each process element can be saved, as depicted in Fig.4. With the impact of m-parallelism, totally m × 256 × 8 inverters can be saved by this method. 3) 4 × 4 Process Element Group Architecture Optimization: Under 100MHz, 120MHz, 140MHz, 160MHz,

330

0

1

...

14

2-cycle delay 15 16

ref.1[31:0]

17 ref.1[31:0]

Cur0

Cur1

Cur14

Cur15 Ref0

Fig. 5. delay

[31:24] PE 00

Ref1

[23:16] PE 01

[15:8] PE 02

[31:24] PE 10

180MHz, 200 and 220MHz clock speed, synthesis results reveal that the gate count of sixteen 4 × 4 process element groups accounts for 63.1-67.2% of the whole system. As 4 × 4 process element groups contribute a significant portion of the whole system hardware overhead and this ratio always increases with the clock speed, the authors provide an improved architecture of 4 × 4 process element group, which brings about significant area reduction especially at the high clock speed. The original 4 × 4 process element group architecture is illustrated as Fig.6(a). It is observed that the Row Accumulator is composed of the Carry-Save Adder and the final Carry-Propagate Adder. The Carry-Save Adder compresses the outputs (cyx , absyx ) from the process elements in the same row and the vertically fed in partial SAD into Carry and Sum vectors. The sum of Carry and Sum vectors is implemented with the final Carry-Propagate Adder. In contrast, the improved 4 × 4 process element group architecture is shown as Fig.6(b). Compared with the original design, the final Carry-Propagate Adder is removed from each Row Accumulator and the Carry and Sum vectors from CarrySave Adder are propagated directly via the delay registers between 4 × 1 process element rows. This approach brings about two advantages: First, the hardware cost of the final Carry-Propagate Adder in each row is saved; second, the critical path in each row is significantly reduced, therefore at the high clock speed the synthesizer can use the low speed but area saving components to implement the logic in these rows. The adverse effect of this approach is that the number of inter stage registers are increased by 40 for each 4 × 4 process element group. However, experimental results demonstrated that at high clock speed, significant area saving still can be obtained by the proposed architecture. It is well known that the 4:2 compressor based CSA tree has both speed and hardware cost advantages over the traditional 3:2 compressor based design [11]. Because TSMC 0.18µ m CMOS library has provided 4:2 compressor standard cell, in the authors’ design, the CSA tree in each row is mainly built up with 4:2 compressors. Another merit is that the wire number of 4:2 compressor based design is fewer than the 3:2 compressor based counterpart. This feature simplifies the routing complexity for the back-end design. The 4:2 compressor based row CSA trees adopted in the proposed Propagate Partial SAD 4×4 process element group are shown in Fig.7. It should be noticed that the addition between absyx and cyx is merged into the compression operation of the row CSA. Hardware cost comparisons of the original process element

[23:16] PE 11

[31:24] PE 20

[23:16] PE 21

[15:8] PE 22

[23:16] PE 31

[15:8] PE 12

[23:16] PE 21

[31:24] PE 30

CR1 SR1 0 CR1 [15:8] PE 22

[23:16] PE 31

CR2 SR2 0 CR2 [15:8] PE 32

D

SAD4×4

S0R0

D S0R1

[7:0] PE 23

D S0R2

[7:0] PE 33

CR3

CSA

C

D

[7:0] PE 13

D [7:0] PE 33

S

SR0 0 CR0

CSA

S

CSA

[23:16] PE 11

[31:24] PE 20

C

ref.0[31:0]

CR0

D [7:0] PE 23

[15:8] PE 32

[7:0] PE 03

CSA

S

CSA [31:24] PE 30

[31:24] PE 10

[7:0] PE 13 C

CSA

[15:8] PE 02

D

S

[15:8] PE 12

[23:16] PE 01

CSA

C

CSA

Timing diagram for Propagate Partial SAD with multi-cycle path

[31:24] PE 00

[7:0] PE 03

SR3

D

SAD4×4

ref.0[31:0]

(a)

(b)

Fig. 6. Architecture comparisons of 4 × 4 process element group (a) Original Architecture (b) Improved Architecture abs03 abs02 abs01 abs00 ICO ICI 4:2(8b) C[8:1] S[7:0] c01 [8:0]

[8:0]

[8:1]

[8:1]

C[9:2]

2:2(8b)

CR0 [9:2]

abs13 abs12 abs11 abs10 c00

ICO ICI 4:2(8b) C[8:1] S[7:0] c11 c02

S[8:1]

3:2(1b) C[1] S[0]

0 [9] CR0

2:2(1b) C[10]S[9]

c03

SR0[8:1] CR0 [1] SR0 [0] CR0 [0]

[8:0] [8:0]

CR1 [10] SR1 [9] CR1 [9:0]

abs23 abs22 abs21 abs20

SR1 [8:0]

abs33 abs32 abs31 abs30 c20

0 [8:0] CR1 0 [8:0] SR1

ICO ICI 4:2(9b) C[9:1] S[8:0] 0 c CR1 [10] 23

CR2 [10] SR2[10:9] CR2 [9:0]

(c)

c12

(b)

ICO ICI 4:2(8b) C[8:1] S[7:0] c21

3:2(1b) C[10]S[9]

0 [8:0] CR0 0 [8:0] SR0

ICO ICI 4:2(9b) C[9:1] S[8:0] c13

(a)

0 [9] CR1 0 [9] SR1 [8:0] [8:0]

c10

SR2 [8:0]

c22

c30 ICO ICI 4:2(8b) C[8:1] S[7:0] c31 0 [10] 0 [9] 0 [8:0] CR2 CR2 CR2 0 0 0 [8:0] SR2 [10] SR2[9] [8:0] [8:0] SR2 2:2(1b) 3:2(1b) C[11] S[10] C[10]S[9]

ICO ICI 4:2(9b) C[9:1] S[8:0]

CR3 [11:0]

c32

c33 SR3 [10:0]

(d)

Fig. 7. CSA trees of the improved 4×4 process element group of Propagate Partial SAD architecture. (a) CSA in row 0 (b) CSA in row 1 (c) CSA in row2 (d) CSA in row 3

group architecture and the proposed one under various timing constrains are illustrated in Fig.8. It is observed that when the clock speed is equal to or greater than 140MHz, the proposed architecture is more efficient than the original counterpart. When the clock speed is 180MHz, 9.37% hardware cost can be saved by each 4 × 4 process element group. B. Circuits Optimizations for SAD Tree Architecture The original SAD Tree architecture [8] comprises 256 process elements and each process element is in charge of the corresponding pixel distortion computation in current

331

Fig.10, which is designed to receive and compress the absyx values, as well as cyx . All Carry and Sum vectors of sixteen 4 × 4 blocks are stored in the interstage registers. In the second stage, by using these Carry and Sum vectors stored in the interstage registers, variable block size adder tree calculates the SADs of 41 blocks. It should be noticed that buffering Carry and Sum vectors of 4 × 4 blocks instead of their SAD values can balance the critical path delay of the two pipeline stages, which in turn contributes to the improvements of the operating frequency and the hardware cost saving.

3.6

Gate Count (k gate)

3.4

3.2

3

Original 4 × 4 PEG

2.8

2.6

Proposed 4 × 4 PEG

100

120

140 160 Frequency (MHz)

180

200

220

abs30

Fig. 8. Comparisons of 4 × 4 process element group hardware cost of the original architecture and the proposed one. Synthesis conditions: TSMC 1P6M 0.18 µ m CMOS standard cell; worst work condition (1.62v, 125 ◦ C)

abs31

ICO

abs32

4:2(8b) C[8:1]

abs33

ICI S[7:0]

abs20 c03 ICO

c12

c13 ICO

4:2(9b) C[9:1]

abs21

abs22

4:2(8b) C[8:1]

ICI S[8:0]

abs23

ICI S[7:0]

...

...

abs12

4:2(8b) C[8:1]

abs13

ICI S[7:0]

C[10:1]

ICI S[9:0]

[10:1]

2:2 (10b) C[11:2]

4:2(9b) C[9:1]

ICO

c22 4:2(10b)

S[10:1]

abs00 c01

c10

c11

ICO

...

abs11

ICO

c21

c23

...

abs10 c02

ICO

abs01

abs02

4:2(8b) C[8:1]

ICI S[8:0]

abs03

ICI S[7:0]

c20

c30 c31 c32

[0]

c33

3:2 (1b) C[1]

S[0]

stage1

C4×4 [11:2] S4×4 [10:1] C4×4 [1] S4×4 [0] C4×4 [0]

Fig. 10. 4:2 compressor based CSA circuits in proposed 2-stage SAD Tree

stage2

III. E XPERIMENTAL R ESULTS VBS adder tree

Process element

4:2 compressor based CSA tree

Registers for Carry and Sum vectors of 4×4 Blocks Fig. 9.

Proposed SAD Tree hardware architecture

macroblock. In one cycle, the 256 pixels’ distortions at one search candidate are obtained by the process element array. Sixteen 4×4 block SADs are derived by summing the distortions of the corresponding pixels via adder tree, and then the SADs of other larger blocks can be obtained by summing the corresponding 4 × 4 block SADs. No partial SADs are stored in the original architecture, consequently, there is no register overhead for the partial SAD buffering. The main problem of the original design is the slow operating frequency due to its long critical path delay. Experimental results show that with TSMC 0.18µ m 1P6M CMOS technology, at worst work conditions (125◦C, 1.62v), the maximum clock speed R synthesized by SYNOPSYS Design Compiler is 134MHz and the design hardware cost grows rapidly with the timing constrains. To improve the working clock speed, the 2-stage SAD Tree architecture is proposed by the authors, as depicted in Fig.9. In the first stage, absyx and cyx of one 4 × 4 process element array are derived and dispatched to the 4:2 compressor based CSA module, and then the CSA computes the Carry and Sum vectors of the 4 × 4 block SAD. The circuits of 4:2 compressor based CSA in the first stage is illustrated in

The performance metrics of the proposed approaches discussed in Section II to the Propagate Partial SAD and SAD Tree architectures are analyzed in this section. All hardware architectures were implemented with Verilog-HDL and were R synthesized with SYNOPSYS Design Compiler based on TSMC 0.18µ m 1P6M CMOS technology. The worst work conditions (1.62v, 125◦C) were applied during the synthesis procedure. The hardware overhead discussed in this section consisted the reference pixel buffer, the current macroblock buffer and the datapath for generating 41 block SADs, which were the same as the test conditions applied in [8]. The performance comparisons of the proposed Propagate Partial SAD architectures were conducted with the following three test cases: 1) The original Propagate Partial SAD system architecture with the process element circuits optimization; 2) The proposed Propagate Partial SAD system architecture with the optimized process elements; 3) The proposed system architecture with the optimized process elements and the improved process element group architecture; The hardware cost comparisons of these three cases under various timing constrains were depicted in Fig.11. Compared with the original design in literature [8], at 110.8MHz clock speed, the process element optimization brought about 7.6k gate hardware reduction. Using the proposed system architecture, additional 4.5k gate reduction could be achieved. In turn, compared with the original design in [8], 14.7% hardware cost was saved by the authors’ approaches. At high clock speeds (Frequency ≥ 140MHz), the proposed 4 × 4 process element group architecture outperformed the original

332

c00

91

101 97 93 Gate Count (k gate)

counterpart and began to contribute to the performance improvement. For instance, at 180MHz clock speed, additional 4.8k gate hardware reduction came from the optimized process element group architecture. The maximum clock speed of the authors’ Propagate Partial SAD was 231.6MHz at the cost of 84.1k gates. 89

Original Design in [8] 89 85 81 77 1-stage+P Eopt

87

73

2-stage+P Eopt

Gate Count (k gate)

85 83

69

Original Design in [8]

60

81 79

Fig. 12.

80

100

120 140 Frequency (MHz)

160

180

200

220

Hardware cost comparisons of SAD Tree architecture

77 75

P Eopt

73

P Eopt + Archopt

71

P Eopt + Archopt + P EGopt

69

100

120

140

160 180 Frequency (MHz)

200

220

240

Fig. 11. Hardware cost comparisons of Propagate Partial SAD architecture (PEopt : process element optimization; Archopt : system architecture optimization; PEGopt : 4 × 4 process element group architecture optimization)

The performance comparisons for SAD Tree architecture, which were illustrated in Fig.12, contained the following two test cases: 1) The original SAD Tree architecture with the process element circuits optimization; 2) The proposed two pipeline stage architecture with the optimized process elements. When Frequency ≤ 60MHz, the original architecture was preferred due to no interstage register overhead in this design. However, Fig.12 illustrated that the maximum clock speed of one stage architecture was 134.1MHz and its hardware overhead augmented rapidly with the operating frequency. When the high clock speed (Frequency ≥ 80MHz) was specified, the proposed two pipeline stage architecture was favored of. It was observed that, at 110.8MHz clock speed, compared to the design in [8], 16.0k gates could be saved by the authors’ proposals, that is, 18.0% hardware cost reduction was achieved. The maximum clock speed of the provided two stage SAD Tree design was 204.8MHz at the price of 88.5k gates. IV. C ONCLUSIONS System architecture level and circuits level optimizations are proposed in this paper to enhance the performance of Propagate Partial SAD and SAD Tree variable block size motion estimation hardwired engine. The performance metrics of the proposed architectures are verified by the synthesis results with TSMC 0.18µ m 1P6M CMOS standard cell library. The Propagate Partial SAD architecture is improved by compressing the shift registers for partial SAD propagation and optimizing the circuits of process element and the architecture of 4 × 4 process element group. Compared with the original design, up to 14.7% hardware

reduction is achieved by the authors’ approaches and the maximum clock speed of the proposed Propagate Partial SAD architecture is 231.6MHz. Two pipeline stage architecture and 4:2 compressor based CSA circuits are adopted by authors for SAD Tree improvement. Consequently, 18.0% hardware cost is saved as compared to the original SAD Tree design and its maximum clock speed is enhanced to 204.8MHz, which make it more suitable for the computation intensive applications. R EFERENCES [1] J. Ostermann, et al., “Video coding with H.264/AVC: Tools, performance, and complexity,” IEEE Circuits and Systems Magazine, vol. 4, no. 1, pp. 7–28, First Quarter 2004. [2] L.-M. Po and W.-C. Ma, “A novel four-step search algorithm for fast block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 3, pp. 313–317, June 1996. [3] J. Y. Tham, et al., “A novel unrestricted center-biased diamond search algorithm for block motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 4, pp. 369–377, August 1998. [4] W. Li and E. Salari, “Successive elimination algorithm for motion estimation,” IEEE Transactions on Image Processing, vol. 4, no. 1, pp. 105–107, January 1995. [5] Z.-B. Chen, et al., “Fast integer-pel and fractional-pel motion estimation for H.264/AVC,” Journal of Visual Communication and Image Representation, vol. 17, no. 2, pp. 264–290, April 2006. [6] T.-C. Chen, et al, “Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 6, pp. 673–688, June 2006. [7] Z.-Y. Liu, et al., “A 1.41W H.264/AVC real-time encoder SoC for HDTV1080p,” in 2007 IEEE Symposium on VLSI Circuits, June 2007, pp. 12–13. [8] C. Y. Chen, et al., “Analysis and architecture design of variable blocksize motion estimation for H.264/AVC,” IEEE Circuits and Systems I, vol. 53, no. 3, pp. 578–593, March 2006. [9] T.-C. Chen, et al., “Fast algorithm and architecture design of lowpower integer motion estimation for H.264/AVC,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 5, pp. 568–577, May 2007. [10] J. Vanne, et al., “A high-performance sum of absolute difference implementation for motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 7, pp. 876–883, July 2006. [11] V. Oklobdzija, D. Villeger, and S. Liu, “A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach,” IEEE Transactions on Computers, vol. 45, no. 3, pp. 294–306, March 1996.

333

Optimization of Propagate Partial SAD and SAD Tree ...

Data Rate SDRAM memory at 266MHz) an average of. This work was ... engine is still essential for the real-time encoder designs. [6][7]. Full search alike ...

862KB Sizes 2 Downloads 173 Views

Recommend Documents

Hardware-Efficient Propagate Partial SAD Architecture for Variable ...
Mar 13, 2007 - Through compressing the propagated data and optimizing the processing ... and Application-Based Systems]: Signal processing systems;.

32-Parallel SAD Tree Hardwired Engine for Variable ...
ger motion estimation (IME) engine with 192 × 128 search range ... In order to further reduce the hardware cost of VBSME engine, six optimization methods are ...

sad mad.pdf
Sign in. Page. 1. /. 5. Loading… Page 1 of 5. Page 1 of 5. Page 2 of 5. Page 2 of 5. Page 3 of 5. Page 3 of 5. sad mad.pdf. sad mad.pdf. Open. Extract. Open with.

Sad Song.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Sad Song.pdf.

Sad Song.pdf
There was a problem loading more pages. Retrying... Sad Song.pdf. Sad Song.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Sad Song.pdf.

SAD - v2.pdf
Software Architecture Document. Version 2.0. Page 1 of 49 ... Architecture Description 22. 5.1 Enviromental Device of ... Main menu. Displaying SAD - v2.pdf.

TOR04_Estatistica TMaster SAD 2014.pdf
Page 1 of 1. www.flytime.pt www.cofihst.pt www.anlisboa.pt www.sitesgratis.com. Splash Meet Manager 11, Build 29894 Registered to Associação de Natação ...

Sad Book Glad Book.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Sad Book Glad ...

Scan. Novi Sad 2017 Completo.pdf
Scan. Novi Sad 2017 Completo.pdf. Scan. Novi Sad 2017 Completo.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Scan. Novi Sad 2017 ...