Bandwidth and Local Memory Reduction of Video ...

Viewer
Transcript

Bandwidth and Local Memory Reduction of Video Encoders Using Bit Plane Partitioning Memory Management Yi-Nung Liu, Meng-Che Chuang, and Shao-Yi Chien Media IC and System Lab Graduate Institute of Electronics Engineering and Department of Electrical Engineering National Taiwan University MD-726, 1, Sec. 4, Roosevelt Rd., Taipei 106, Taiwan [email protected]

Abstract— This paper presents a new memory management scheme for the reference frame buffer in a video encoder. The proposed BPPMM (Bit Plane Partitioning Memory Management) scheme changes the format of data, and hence we can access different number of the bit-planes of the pixel data ondemand. This BPPMM technique is especially suitable for motion estimation with bit-truncation. Experiments show that when BPPMM scheme is integrated in an H.264 hardware encoder, more than 46% of local SRAM size and 31% of the external memory bandwidth could be reduced with only a little quality degradation.

32

32

On-Chip Local SRAM (8-b/pixel) 8

8

.........

8

8

External SDRAM

Processing Unit Motion Estimator

Fig. 1.

The Simplified block diagram of the ME module.

I. I NTRODUCTION Motion estimation (ME), which is used to derive the temporal correspondence between two frames, is the heart of video coding systems. ME can be used not only for video coding systems but also for frame rate up conversion, deinterlacing, three dimensional de-noise, video analysis, optical flow etc. As the resolutions of video capturing and display devices become higher and higher, the computation of ME is also growing. Consequently, the performance, cost, and memory bandwidth of these video systems can be much improved by optimizing ME operations. Video coding systems are widely used video system that contains ME module and Motion Compensation (MC) module. The H.264 [1] video encoding systems are examples to illustrate the importance of motion estimation. The ME module takes the input macroblock (MB) as the current MB and takes the search range data from the reference frame memory as the reference data. The block matching operations between those pixel data will repeat again and again until the best-matched block is found, which costs large chip area and huge power consumption. Moreover, it also introduces a large memory bandwidth to load data from the reference frame memory to the ME module. By these reasons, lots of fast algorithms and efficient architectures have been developed. Fig. 1 shows the simplified block diagram of an ME module. The reference frame data is stored in the external SDRAM. To reduce the memory bandwidth requirement of ME, the onchip local SRAM is embedded on the ME engine as a cache memory. The tradeoff between local memory size and off-chip

978-1-4244-3828-0/09/$25.00 ©2009 IEEE

memory bandwidth, which is called as memory reuse strategy, is analyzed by Chen et al. [2], and Tuan et al. [3] summarize the existing data-reuse schemes. There are also many fast algorithms trying to reduce the number of candidates of motion vector (MV). Another dimension to reduce the computation is bit-truncation [4], and it shows that even if 8-bit data is directly truncated into 4-bit, the quality of video is still kept by deriving the results from those partial pixels. It is possible to design a proprietary memory organization to store the reconstructed reference frames in the off-chip SDRAM, since it is only accessed by deblocking filter (DF) and ME/MC. The existing hardware ME designs with bittruncation technique access the whole 8-bit pixel value and takes only part of them to calculate the sum-of-absolutedifferences (SADs), which means some memory bandwidth is redundant in this scheme. If there is another special memory management scheme for bit-truncation algorithms, the redundant bandwidth can be reduced, and the size of local SRAM can also be reduced. Therefore, in this paper, a new scheme named as bit-plane partition memory management (BPPMM) is proposed. Note that the memory management scheme is orthogonal to other memory reuse and fast algorithms. That is, other approaches can be still employed at the same time. II. T HE P ROPOSED R EFERENCE F RAME M ANAGEMENT S CHEME Since the reference frame memory is generated from the deblocking filter and is used only by the temporal prediction,

766

Authorized licensed use limited to: National Taiwan University. Downloaded on August 6, 2009 at 23:10 from IEEE Xplore. Restrictions apply.

MB 00

MB 01

tP Bi

Line 0 Line 1

ne la

Y[1:0] Y[3:2] Y[5:4]

16 Pixels

Y[7:0]

SB00 SB01 SB02 SB03

Y[7:0]

MB m0

(a)

MB m1

............

16 Pixels

............ MB mn

(b)

MB 00

MB 01

SB10 SB11 SB12 SB13

Y[7:6]

SB20 SB21 SB22 SB23 SB30 SB31 SB32 SB33

Fig. 2. Conventional pixel data arrangement. (a) Line order. (b) Macroblock order.

Fig. 3.

MB m0

MB m1

MB mn

............

Proposed bit-plane based pixel data arrangement. Sub-block

the ways to arrange the pixel data order in the external SDRAM can be decided in proprietary by designers to fit the algorithm and architecture of temporal prediction.

4 Pixels

4 Pixels

Y00 Y01 Y02 Y03

A. Conventional Memory Management Strategy

Y10 Y11 Y12 Y13 Y20 Y21 Y22 Y23 Y30 Y31 Y32 Y33

32-bit Word

Y[7:6]

Y[5:4]

Y[3:2]

Y[1:0]

U[7:0]

V[7:0]

Y00 Y01 Y02 Y03 Y10 Y11 Y12 Y13 Y20 Y21 Y22 Y23 Y30 Y31 Y32 Y33 [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] [7:6] 32-bit Word

(b) Fig. 4. (a) Sub-block with 4×4 pixels. (b) Illustration of the proposed bitplane partitioning memory management (BPPMM) scheme.

BPPMM is the alignment problem. One may not always get those exactly required data in the same MB or SB. Fig. 5 shows the match and mismatch conditions. When mismatching occurs, more data accesses are required. Also, some of the data access will introduce small overhead of latency, which is called DRAM bank conflict. Fortunately, the modern DRAM usually has four independent banks that we can use bank interleaving technique to reduce the number of bank conflict. III. I MPLEMENTATION The proposed BPPMM scheme is integrated in H.264 motion estimation systems to demonstrate its efficiency. The H.264 encoder designed by Huang et al. [6] [7] is selected as 16 Pixels

16 Pixels

SB00

SB01

SB02

SB03

SB00

SB01

SB02

SB03

SB10

SB11

SB12

SB13

SB10

SB11

SB12

SB13

SB20

SB21

SB22

SB23

SB20

SB21

SB22

SB23

SB30

SB31

SB32

SB33

SB30

SB31

SB32

SB33

(a)

16 Pixels

B. The Proposed BPPMM (Bit Plane Partitioning Memory Management) Scheme The concept of the proposed BPPMM is similar to the scalable video coding, where video in different levels of quality can be achieved by decoding different amounts of bitstream. In BPPMM scheme, the reconstructed frame data is stored in the manner where bit-planes are grouped into several partitions, as shown in Fig. 3, where Y[7:6] denotes the most significant two bit-planes (bit 7 and bit 6) of the whole luminance image. Then the ME module can access different numbers of bit-planes. The more accessed bit-planes, the better quality can be achieved; the less accessed bit-planes, the lower on-chip local SRAM and memory bandwidth are required. The baseline version of BPPMM is shown in Fig. 4(b), which is based on 32-bit bus system and video processing system with 4×4 sub-block (SB) and 16×16 MB. Y[7:6] data of all pixels in one SB is packed together into one word, as shown in Fig. 4(a)(b), where Y00[7:6] denotes the most significant two bits of the pixel Y00. As for one MB, 16 SBs are stored in a raster-scan order, as shown in Fig. 3, and all the MBs in one frame are also stored in a raster-scan order. Similarly, Y[5:4], Y[3:2], Y[1:0], U[7:0], and V[7:0] are stored in the external SDRAM followed Y[7:6], as shown in Fig. 4(b). This scheme can well support integer motion estimation (IME) with bittruncation algorithms because the ME module can only load part of the bit-planes for calculation instead of loading all the bit-planes and doing truncation in the processing units of ME. Note that each scheme of data order has its own advantages and disadvantages. The drawback of macroblock order and

(a) Memory Address

16 Pixels

There are two conventional pixel data arrangement strategies. One is “line order” and the other one is “macroblock order,” which are illustrated in Fig. 2. The line order scheme, where the pixel data is stored in raster-scanned order on the whole image, is simple and easy for implementation. On the other hand, while accessing a block of pixel, the macroblock order can improve the access time because the number of burst read/write of external SDRAM can be reduced. The SDRAM behavior and data mapping for video systems are studied by Yu et al.[5].

(b)

Fig. 5. Block alignment issue. (a) Matched boundary. (b) Mis-matched boundary.

767 Authorized licensed use limited to: National Taiwan University. Downloaded on August 6, 2009 at 23:10 from IEEE Xplore. Restrictions apply.

the target. It is a representative encoder design, and the basic architecture of most H.264 encoders are similar to this work. The ME engine in [6] [7] can be simply presented as Fig. 7(a), where an on-chip local SRAM, an integer motion estimation (IME) processing unit, and a fractional motion estimation (FME) processing unit are included in the motion estimator. Also, the off-chip bandwidth required by IME dominates the system bandwidth. This encoder adopts the level C data reuse [3] to reduce the off-chip bandwidth. For each MB, the associated search range data is first loaded to the onchip local SRAM, where 8-bit data is stored for each pixel. The IME processing unit, which is designed with bit-truncation technique, only accesses 5 bits for each pixel to derive the integer MV. Next, the FME processing unit refines the MV to quarter-pel precision. It loads the search range data from the same on-chip local SRAM, where full 8 bits are required for each pixel. The ME engine with BPPMM scheme is shown in Fig. 7(b). The external SDRAM is now organized in BPPMM scheme, and the on-chip SRAM is separated into two parts: SRAM0 is used to store N bits for each pixel for bit-truncation IME, and SRAM1 is used to store the (8-N) bits for each pixel for FME, which is called as “N/(8-N)” partition in this paper. For each MB, the associated search range data is first loaded to the on-chip local SRAM0, where N-bit data is stored for each pixel (N=5 to achieve the same quality as [6]). Then IME processing unit can directly access the on-chip SRAM0 without truncation operations. After the integer MV is generated, the FME processing unit locally refines the MV to quarter-pel precision. When the required reference pixels overlap to those used by IME, only additional (8-N) bits are needed to be loaded to SRAM1 from the external SDRAM. However, when the required pixels do not exist in the on-chip SRAM0, the whole 8-bit data will be loaded into both SRAM0 and SRAM1. With the BPPMM scheme, SBs with different number of bit-planes can be easily accessed on-demand, and the on-chip SRAM size and memory bandwidth can be lower since the redundant memory storage for bit-truncation ME can be avoided. For both IME and DF modules of video encoding systems, the BPPMM is a good solution because the boundaries of accessed blocks always match to those of reference frames. There is only a little overhead to write a macroblock data into the SDRAM or read it back. But when the whole encoding system is considered, the role of FME become critical. In the best case, the MV predictor of variable block size motion estimation (VBSME) of FME may point to the same region, and the accessed blocks are aligned to the BPPMM block boundary. Thus the FME module do not need to load too much additional data. For a 16×16 block, the minimum required data by FME is 22×22 for fractional pixel interpolation. With BPPMM, the data read from DRAM varies from 24×24 to 28×28 depending on the boundary match conditions, as shown in Fig. 5. If the required reference data of MV with different block sizes have no overlapped region mismatch to the SB boundaries, the additional memory access of FME module will

MVdiffofVBSMENight720pXaxis Ͳ5 Ͳ4 Ͳ3 Ͳ2 Ͳ1 0 1 2

(a) MVdiffofVBSME Night720pYaxis Ͳ3 Ͳ2 Ͳ1 0 1 2 3

(b) Fig. 6. The motion vector difference analysis. (a) Horizontal direction. (b) Vertical direction.

be huge. Fortunately, this extreme condition rarely happens to real cases, which can be shown in Fig. 6. The motion vector difference in Fig. 6 is defined by the difference between the motion vector of 16×16 block and all the vectors of other block sizes of the same MB. The statistical results show that the motion vector differences are usually very small. More than 90% of them are zero, and more than 95% of them are smaller than 1 pixel. It is shown that we do not need to load different data for different block sizes very often. IV. E XPERIMENTAL R ESULTS To evaluate the performance of the proposed BPPMM, Huang’s H.264 encoder [6] is chosen as the reference platform. We assume the search range is ±64×±64, and VBSME with 4×4, 4×8, 8×4 block sizes are disabled. In addition, level C data reuse scheme [3] is employed. The BPPMM allows IME to take 4-bit (N=4, partition 4/4) or 6-bit (N=6, partition 6/2) data rather than 8-bit (N=8, partition 8/0). Partition 5/3 (N=5) is a special case in our analysis since some redundancy will be introduced because the granularity of bit-plane accessing does not match to this design (two bit-planes). Table I shows the required on-chip local SRAM for different partitions of the BPPMM scheme, where partition 8/0 is the special case of BPPMM that is the same as the original reference design [6] [7]. It shows that with BPPMM scheme, the on-chip memory size is greatly reduced, and about 46% of memory size can be achieved with partition 4/4. Note that both partitions 6/2 and 5/3 can provide the same quality to the reference design because the reference encoder adopts 3-bits truncation for each pixel; therefore, 35.5% on-chip memory size reduction can be achieved without any quality degradation. For 4/4 partition, the IME uses fewer data than 8/0 partition; therefore the quality will drop slightly. The

768 Authorized licensed use limited to: National Taiwan University. Downloaded on August 6, 2009 at 23:10 from IEEE Xplore. Restrictions apply.

39

32

32

38

On-Chip Local SRAM (8-b/pixel) 8

8

8

PSNR (dB)

8

37

Bit-truncation

......... 5

FME Processing Unit

5

.........

36 8/0 5/3 4/4 3/5 0/8

35 5

5

External SDRAM

34

IME Processing Unit

33 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bitrate (Mbps)

(a) Foreman (CIF)

Motion Estimator 39

(a)

38

32

32

32

8-N 8-N

PSNR (dB)

37

On-Chip Local SRAM1 ((8-N)-b/pixel)

On-Chip Local SRAM0 (N-b/pixel)

8/0 5/3 4/4 3/5 0/8

35

8-N 8-N N

N

N

N

External SDRAM in BPPMM Scheme

.........

.........

FME Processing Unit

36

IME Processing Unit

34

33 2

3

4

5

6

7

8

9

10

11

Bitrate (Mbps)

(b) Night (720p) Motion Estimator

Fig. 8.

(b) Fig. 7. (a) ME engine with the conventional memory management scheme. (b) ME engine with the proposed BPPMM scheme. TABLE I O N -C HIP L OCAL M EMORY U SAGE A NALYSIS . Partition 8/0 6/2 5/3 4/4 IME SRAM (bits) 184,320 138,240 115,200 92,160 FME SRAM (bits) 0 3,768 5,652 7,536 Total SRAM (bits) 184,320 142,008 118,968 99,696 Reduction Ratio (%) 0 23.0 35.5 45.9

associated PSNR curves for different partitions are shown in Fig. 8. Table II shows the bandwidth analysis result. It shows that in average cases, the BPPMM can reduce about 31% of the bandwidth with partition 4/4. But as for some sequences with complicated motion, the VBSME of FME may access nonoverlapped regions. In these extreme cases, the BPPMM may introduce bandwidth penalty. V. C ONCLUSION In this paper, an idea of a bit-plane based memory organization for ME is proposed. By using the bit plane partitioning of TABLE II O FF - CHIP BANDWIDTH A NALYSIS . Partition IME BW (bits/MB) FME BW Best Case (bits/MB) FME BW Worst Case (bits/MB) BW Average Case (bits/MB) Bandwidth Saving Ratio

8/0 18,432 0 0 18,432 0%

6/2 5/3 13,824 13,824 1,152 2,304 7,008 14,016 15,562 17,299 15.57% 6.15%

4/4 9,216 2,304 14,016 12,691 31.15%

The RD-curve of different bit-truncation schemes.

reference frame memory, the local on-chip SRAM size can be reduced by more than 46%, and the 31% of off-chip memory bandwidth can be reduced with slightly quality degradation. The only drawback is that the bandwidth become non-constant. This bit plane partitioning of reference frame memory also makes bit-plane hierarchical motion estimation possible. One is no longer restricted to load 8-bits pixel data at one time, but can access different bit-planes on-demand to optimize the ME memory usage together with other reuse schemes and fast algorithms. This scheme and the concept can also be adopted by other video analysis modules to reduce the on-chip SRAM size and external memory bandwidth. R EFERENCES [1] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, 2003. [2] C.-Y. Chen, C.-T. Huang, Y.-H. Chen, and L.-G. Chen, “Level C+ data reuse scheme for motion estimation with corresponding coding orders,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 4, pp. 553–558, Apr. 2006. [3] J.-C. Tuan, T.-S. Chang, and C.-W. Jen, “On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 1, pp. 61–72, Jan. 2002. [4] Z.-L. He, C.-Y. Tsui, K.-K. Chan, and M. Liou, “Low-power VLSI design for motion estimation using adaptive pixel truncationr,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 5, pp. 669–678, Aug. 2000. [5] G.-S. Yu and T. Chang, “Optimal data mapping for motion compensation in H.264 video decoding,” in Proc. IEEE Signal Processing Systems, 2007 IEEE Workshop on (SiPS’07), Oct. 2007, pp. 505–508. [6] Y.-W. Huang, T.-C. Chen, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, C.-S. Chen, C.-F. Shen, S.-Y. Ma, T.-C. Wang, B.-Y. Hsieh, H.-C. Fang, and L.-G. Chen, “A 1.3 TOPS H.264/AVC single-chip encoder for HDTV applications,” in Digest of Technical Papers IEEE International SolidState Circuits Conference (ISSCC’05), Feb. 2005. [7] T.-C. Chen, S.-Y. Chien, Y.-W. Huang, C.-Y. Chen, T.-W. Chen, and L.-G. Chen, “Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 6, pp. 673–688, June 2006.

769 Authorized licensed use limited to: National Taiwan University. Downloaded on August 6, 2009 at 23:10 from IEEE Xplore. Restrictions apply.

Sampling Based on Local Bandwidth Dennis Wei - Semantic Scholar

Aggregating Bandwidth for Multihomed Mobile ... - Semantic Scholar

Towards local electromechanical probing of ... - Semantic Scholar

Person Memory and Judgment: Pragmatic ... - Semantic Scholar

Reduction of phonon lifetimes and thermal ... - Semantic Scholar

Person Memory and Judgment: Pragmatic ... - Semantic Scholar

Aeroengine Prognostics via Local Linear ... - Semantic Scholar

Multi-View Local Learning - Semantic Scholar

Application-Specific Memory Management in ... - Semantic Scholar

Automatic, Efficient, Temporally-Coherent Video ... - Semantic Scholar

Online Video Recommendation Based on ... - Semantic Scholar

PATTERN BASED VIDEO CODING WITH ... - Semantic Scholar

Scalable Video Summarization Using Skeleton ... - Semantic Scholar

Local Area Networks and Medium Access Control ... - Semantic Scholar

1 Local Area Networks and Medium Access ... - Semantic Scholar

Local search characteristics of incomplete SAT ... - Semantic Scholar

Learning and memory in mimicry: II. Do we ... - Semantic Scholar