Low-Complexity Shift-LDPC Decoder for High-Speed ... - IEEE Xplore

Viewer
Transcript

Low-Complexity Shift-LDPC Decoder for High-Speed Communication Systems Chuan Zhang, Li Li, and Jun Lin

Zhongfeng Wang

Institute of VLSI Design, Nanjing University LAPEM, Nanjing University Nanjing, Jiangsu 210093, China Email: {chzhang, lili, junlin}@nju.edu.cn

Broadcom Corporation 5300 California Avenue Irvine, CA 92617, USA Email: [email protected]

A.

Shift-LDPC Codes Figure 1 shows the parity check matrix H of a regular (N, M) (c, t) shift-LDPC code. It consists of c×t submatrices. The c submatrices in the leftmost block column are random L×L permutation matrices. The matrix P is an L×L permutation matrix for a single-step column shift.

⎡ H11 ⎢ H H = ⎢ 21 ⎢ ⎢ ⎣⎢ H c1

INTRODUCTION

978-1-4244-2342-2/08/$25.00 ©2008 IEEE.

P t −1 H11 ⎤ ⎥ P t −1 H 21 ⎥ ⎥ ⎥ t −1 P H c1 ⎦⎥

PH11 PH 21 PH c1

Figure 1. Parity check matrix of a regular shift-LDPC code.

The H matrix structure of shift-LDPC codes is well suitable for high-speed LDPC decoder implementation. Moreover, it has been shown in [5] that the shift-LDPC codes have comparable decoding performance to the computer generated randomly codes.

B. Message Scheduling for Decoding Shift-LDPC Codes

0 0 0 0

0 1 0 0

1 0 … 0 0 0 1 … 0 0 0 1 … 0 0 0 0 … 1 0 0 0 … 1 0 0 0 … 0 1

CPUN-M-L+1

0 0 … 0 1 0 0 … 0 0

CPUN-M-2 CPUN-M-1 CPUN-M

… … …

…

… … … … …

…

0 1 … 0 0 0 0 … 1 0 0 0 … 1 0 1 0 … 0 0 1 0 … 0 0 0 0 … 0 1

2

VPUL-1 VPUL

VPU1

…… …… …… ……

0 1 0 0

0 0 0 0

… … …

… … … …

… … … …

0 0 0 0

1 0 0 0

… … …

1 0 0 0

0 0 … 1 0 0 0 … 0 1 0 0 … 0 0

…

0 0 … 0 0

… … …

0 0 1 0

… …

0 0 1 1

0 0 … 0 0

… …

1 0 0 0

… … …

… … … …

VPU …

VPUL-1 VPUL

…

2

0 1 0 0

CPU2L-2 CPU2L-1 CPU2L

… …

……

… …

…

0 0 0 0

… … …

CPUL-2 CPUL-1 CPUL CPUL+1

Cycle t

1 0 … 0 0 0 0 … 0 0

… … … … …

…

… … … … …

CPU1

Cycle 2

VPUL-1 VPUL VPU1 VPU2

VPU …

VPU1

Cycle 1

… …

Low-density parity-check (LDPC) codes have been adopted by high-speed communication systems [1] due to their near Shannon limit error-correcting capability [2]. In order to achieve the desired bit error rate (BER), longer LDPC codes with higher code rate are preferred in practice. However, long codes usually lead to significant increase in hardware complexity, especially when the target throughput is very high [3]. Recently, Zhong et al. [4] designed an LDPC decoder for high-rate quasi-cyclic (QC) LDPC codes, which achieved 2.1 Gb/s throughput at 16 iterations. The required silicon area is 2.32 mm2 with 65 nm CMOS technology. Sha et al. [5] presented a more efficient design for high-speed LDPC decoder, which can achieve slightly higher data-rate with similar gate counts while using much older CMOS technology. To further reduce hardware complexity, single minimum decoding and non-uniform quantization schemes are explored in this paper. Shifting structure is incorporated to minimize the routing complexity. The implementation of an 8192-bit LDPC decoder demonstrates that about 63.3% hardware reduction can be achieved compared to the state-ofthe-art design for high speed decoding. With SMIC 0.18 μm CMOS technology, 5.4 Gb/s decoding throughput can be obtained at 15 decoding iterations. The remainder of this paper is organized as follows. A brief review of shift-LDPC codes is provided in Section II. In Section III, an efficient decoding approaches and simulation results are presented. The design of the low-complexity high-speed LDPC is presented in detail in Section IV. The implementation results and comparisons with other references are presented in Section V. Finally, Section VI concludes the paper.

… …

I.

REVIEW OF SHIFT-LDPC CODES

II.

… …

Abstract—In this paper, an efficient high-speed low-density parity-check (LDPC) decoder is presented. Single minimum decoding and non-uniform quantization schemes are explored to reduce the complexity of computation core and the memory requirement. Shifting structure is incorporated to significantly reduce the routing complexity of the LDPC decoder. The implementation of an 8192-bit LDPC decoder demonstrates that about 63.3% hardware reduction can be achieved compared with the state-of-the-art design for high speed LDPC decoding. It is also shown that, using SMIC 0.18 μm CMOS technology, 5.4 Gb/s decoding throughput can be obtained at 15 decoding iterations.

1 0 … 0 0 0 0 … 0 1 0 0 … 0 0

Figure 2. Decoding schedule of a regular shift-LDPC code.

Figure 2 shows the message passing scheduling for decoding a regular (N, M) (c, t) shift-LDPC code. One check node processing unit (CPU) performs message updating for one row of H matrix. The messages corresponding to all

1636

c × L rows of H matrix are processed in parallel. One variable node processing unit (VPU) completes the message updating associated to one column of H matrix in one clock cycle. Totally, L VPUs perform message updating in parallel. Thus, it needs t clock cycles to complete both the row message updating and column message updating in one decoding iteration.

L(ci )

and magnitude of L(qij ) respectively:

To reduce the hardware implementation complexity, single minimum decoding and non-uniform quantization schemes are explored in this work. The decoding performance simulation for the target LDPC code will be compared.

The Single Minimum Min-Sum Algorithm Initialization: L(qij ) = yi for i = 0,1,… , N − 1; ˆ = 0 do for k = 0 step 1 until kmax or cH begin Step. 1 β min = min β ij ; i∈V j

if N = 1 ⎪⎧ β min − β (if β ij = β min ) then β ij′ = ⎨ ; ⎪⎩α × β min else β ij′ = α × β min ; end i′∈V j \ i

i ′j

∑

j ′∈Ci \ j

L(rj′i );

Step. 3 L(Qi ) = L(ci ) + ∑ L(rji );

TABLE I.

j∈Ci

Step. 4 for i = 0,1,… , N − 1 do if L(Qi ) < 0 then cˆi = 1; else cˆi = 0; end

(3)

βij = L(qij ) .

(4)

PROPOSED 4-BIT NON-UNIFORM QUANTIZATION

Range of Value

4-bit Non-uniform Quant

s00000~s00001 s00010~s00011 s00100~s00101 s00110~s00111 s01000~s01010 s01011~s10000 s10001~s11000 s11001~s11111

s000 s001 s010 s011 s100 s101 s110 s111 s denotes the sign bit of a soft message.

The non-uniformly quantized messages are directly used in CPUs, which find only the minimum values. While in VPUs, both the expansion and compression blocks are employed.

end end Output: decoded bit cˆi

Although the normal Min-Sum algorithm requires less implementation complexity than Sum-Product algorithm, the computation and storage of the second minimum value still result in high hardware consumption [5]. To further reduce hardware complexity, the single minimum scheme [6] can be employed. The details of the algorithm are shown in the above. Here the log-likelihood ratio (LLR) L(ci ) is defined as:

α ij = sign ⎡⎣ L(qij ) ⎤⎦ ,

Table I shows the conversion between 6-bit uniformly quantized messages and 4-bit non-uniformly quantized messages.

⋅β ij′ ;

Step. 2 L(qij ) = L(ci ) +

(2)

B. Non-Uniform Quantization Scheme For LDPC decoder, the routing complexity as well as the memory requirement is linearly proportional to the wordlength of soft message. In order to further reduce the implementation complexity, using less quantization bits for each soft message is desired. However, straightforward reduction of word length usually leads to significant performance degradation as shown in [7]. In this paper, an optimized non-uniform quantization scheme is presented with 4-bits per soft message to achieve comparable decoding performance to that using 6-bit per message in the prior design [5].

T

∏α

L(qij ) = α ij β ij ,

The check-to-variable message and variable-to-check message are denoted as L(rji ) and L(qij ) . And N defines the number of the minimum values, with the scaling factor α = 0.68 and the off-set factor β = −0.125 .

A. Single Minimum Scheme

L(rij ) =

(1)

where ci is the i-th bit of the transmitted codeword c and yi is the i-th bit of the received word y . α ij and βij are the sign

DECODING ALGORITHMS

III.

⎛ Pr(ci = 0 yi ) ⎞ log ⎜ , ⎜ Pr(c = 1 y ) ⎟⎟ i i ⎠ ⎝

C. Simulation Results and Comparisons The performance simulation for a regular (8192, 7168) (4, 32) shift-LDPC code with the discussed decoding schemes is shown in Figure 3. With 4-bit non-uniform quantization, the normal Min-Sum algorithm can achieve the similar decoding performance as the normal Min-Sum algorithm with (6:3) uniform quantization. The BER of the 4-bit non-uniform

1637

quantization of single minimum scheme suffers only 0.03 dB performance loss.

index Input from Old Register min & 2nd-min of the Prior CPU neighbor sign

-1

10

BER, 4-bit non-uniform, general min-sum BER, 4-bit non-uniform, single min BER, uniform (6:3), general min-sum

-2

10

Output to Old Register of the Old Register Succeeding CPU neighbor Sign poped from sign Sign Register sign register magnitude Message min & 2nd-min Compare from VPU sign Input from New Register of the Prior CPU neighbor 2nd-min sign sign Index Set

-3

10

-4

10

BER

Message to VPU

Data Select

-5

10

index

-6

10

min New Register

CPU1

Output to New Register of the Succeeding CPU neighbor

-7

10

(a) Architecture of CPU using 4-bit non-uniform quantization scheme. index

-8

10

3.2

3.4

3.6

3.8

4.0

4.2

Figure 3. BER comparisons between different decoding algorithms for the (8192, 7168) (4, 32) shift-LDPC code.

IV.

HARDWARE ACHITECTURE OF SHIFT-LDPC DECODER

...

VPUL-1

Message to VPU

Data Computation

sign

Output to Old Register of the Old Register Succeeding CPU neighbor Sign poped from sign Sign Register sign register magnitude Message min Compare from VPU sign Input from New Register of the Prior CPU neighbor sign sign Index Set min index Output to New Register of the New Register Succeeding CPU neighbor

In this section, the hardware architecture of the shiftLDPC decoder using the proposed algorithms is developed. The overall decoder architecture of a regular (N, M) (c, t) shift-LDPC code is illustrated in Figure 4. As discussed in Section II, the structure is composed of L VPUs and M CPUs. VPU1

min

Input from Old Register of the Prior CPU neighbor

Eb/N0 (dB)

CPU2

(b) Architecture of CPU using single minimum scheme and 4-bit nonuniform quantization scheme.

VPUL

Figure 5. Achitecture of Check node Process Unit (CPU).

Figure 6 illustrates the architecture of VPU which adopts the same structure as that in [6]. In order to employ the nonuniform quantization scheme, the expansion block and the compression block are introduced in VPU. Based on the nonuniform quantization scheme, the two blocks can be easily derived. The overhead of the part of logic is small compared with the significant hardware reduction brought by the transformed Min-Sum algorithm. In this design, two levels of pipelines are added to VPU to shorten the critical path.

Permutation Network

CPU1

...

CPUN-M-1

CPUN-M

CPU Communication Network

4

Figure 4. Decoder architecture of a regular shift-LDPC code.

Figure 5(a) shows the architecture of CPU using 4-bit nonuniform quantization scheme. The architecture of CPU employing both single minimum and 4-bit non-uniform quantization schemes is shown in Figure 5(b). It executes the check-to-variable message computation and the magnitude comparison between the message from VPU and intermediate result of row process. The old register, new register, and sign register store the row process result of last iteration, the intermediate result of row process, and the sign bits of the check-to-variable messages respectively. Due to the adoption of single minimum scheme and non-uniform quantization scheme, the routing complexity and memory usage of CPU will be significantly decreased.

1638

Exp

6

StoT

9

1 Sign Bit

Intrinsic Information 9 4 6 Exp StoT IN1 4

Exp

6

StoT

TtoS

9

TtoS

6

6

Comp

Comp

IN2 4

Exp

6

StoT

9

TtoS

6

Comp

IN3 4 IN4

Exp

6

StoT

9

TtoS

6

Comp

VPU Figure 6. Achitecture of Variable node Process Unit (VPU).

4 OUT1 4 OUT2 4 OUT3 4 OUT4

V.

IMPLEMENTATION RESULTS

ACKNOWLEDGMENT

The proposed shift-LDPC decoders are synthesized with SMIC 0.18 μm CMOS technology, using Synopsys Design Compiler. Table II lists the implementation results of the decoder and comparisons with other references. It is shown that compared with [5], the proposed design with 4-bit nonuniform quantization scheme saves 58.8% area with more than 2 times higher throughput. Furthermore, the proposed design which employs both the single minimum and 4-bit non-uniform quantization schemes saves about 63.3% area than [5] while having small performance loss. The significant hardware reduction of the second approach is attributed to the savings of the second minimum. If the area is scaled to 65 nm for a fair comparison with [4], the proposed two designs can achieve 157.1% higher throughput with less than half hardware complexity. It can also be expected that our designs can achieve higher clock speed and thus higher throughput if more advanced technology is employed. Thus we conclude that the proposed designs are well suited for very high speed communication systems. TABLE II.

REFERENCES [1]

[2] [3]

[4]

[5]

[6]

IMPLEMENTATION RESULTS AND COMPARISONS

Reference

Proposed I*

Proposed II**

[4]

[5]

Code Length Code Rate Quantization

8192 7/8 180 nm 1.8 V 4-bit

9216 8/9 65 nm 0.9 V 4-bit

8192 7/8 180 nm 1.8 V 6-bit

Algorithm

Min-Sum

Min-Sum

Min-Sum

Frequency Throughput Iteration Area Area (Scaled to 65 nm)

333 MHz 5.4 Gb/s 15 7.62 mm2

8192 7/8 180 nm 1.8 V 4-bit Min-Sum with single min 333 MHz 5.4 Gb/s 15 6.79 mm2

300 MHz 2.1 Gb/s 16 2.32 mm2

150 MHz 1.8 Gb/s 20 18.5 mm2

0.99 mm2

0.89 mm2

2.32 mm2

2.41 mm2

Technology

This work is jointly supported by the High-Tech Foundation of Jiangsu Province of China under Grant No. BG2005030, the National Nature Science Foundation of China under Grant No. 90307011 and the High-Tech Foundation of Guangdong Province of China under Grant No. 2006B50101003.

[7]

* Proposed I algorithm employs 4-bit non-uniform quantization scheme. ** Proposed II algorithm employs single minimum and 4-bit non-uniform quantization schemes.

VI.

CONCLUSION

A low-complexity shift-LDPC decoder architecture for high-speed communication systems has been proposed. Both the single minimum scheme and the non-uniform quantization scheme are explored to reduce the hardware consumption while maintaining similar decoding performance. Two proposed designs for a regular (8192, 7168) (4, 32) shift-LDPC decoder implemented in SMIC 0.18 μm CMOS technology have shown a throughput of 5.4 Gb/s and area of only 7.62 mm2 and 6.79 mm2 respectively, which proves to be much more efficient than prior arts.

1639

X.-Y. Shih, C.-Z. Zhan, C.-H. Lin, and A.-Y. Wu, “An 8.29 mm2 52 mW multi-mode LDPC decoder design for mobile WiMAX system in 0.13 μm CMOS process,” IEEE Journal Magn., vol. 43, pp. 672-683, March 2008. R. G. Gallager, “Low-density parity-check codes,” IEEE Trans. Inform. Theory., vol. 8, pp. 21-28, Jan. 1962. H. Sankar and K.R. Narayanan, “Memory-efficient sum-product decoding of LDPC codes,” IEEE Trans. Comm., vol. 52, pp. 1225-1230, Aug. 2004. H. Zhong, W. Xu, N. Xie, and T. Zhang, “Area-efficient min-sum decoder design for high-rate quasi-cyclic low-density parity-check codes in magnetic recording,” IEEE Trans. Magn., vol. 43, pp. 41174122, Dec. 2007. J. Sha, M. Gao, Z. Zhang, L. Li, and Z. Wang, “Efficient decoder implementation for QC-LDPC codes,” in Proc. Int. Conf. Commun. Circuits Syst., vol. 4, pp. 2498-2502, June 2006. Q. Wang, K. Shimizu, T. Ikenaga, and S. Goto, “A power-saved 1Gbps irregular LDPC decoder based on simplified min-sum algorithm,” in Proc. Int. Symp. Design, Automation and Test, pp. 1-4, April 2007. Z. Cui and Z. Wang, “Efficient message passing architecture for high throughput LDPC decoder,” in Proc. IEEE Int. Symp. Circuits Syst., (ISCAS), pp. 917-920, May 2007.