A Novel Storage Scheme for Parallel Turbo Decoder

Viewer
Transcript

A Novel Storage Scheme for Parallel Turbo Decoder Xiang He, HanWen Luo, HaiBin Zhang EE department, Shanghai Jiao Tong University, China Email: [email protected] Abstract—In this paper we discuss a novel storage scheme for simultaneous memory access in parallel turbo decoder. The new scheme borrows its idea from vertex coloring in graph theory. Compared to similar method which also uses un-natural order in storage[2], our scheme requires more memory blocks but allows a simpler configuration method when code length changes, which can be implemented on-chip. The major bottleneck of our scheme is interconnection [3] since it doesn’t put any constraint on interleaver. However, experiment shows, for a moderate decoding throughput(40~50M bits/sec), the hardware cost is still affordable with 3GPP’s interleaver[4], 5 iterations and 80~100MHz system clock. Keywords-parallel turbo decoding; vertex coloring;

I. MEMORY MULIPLE ACCESS PROBLEM During FPGA implementation of turbo decoder, a substantial amount of memory is assigned to store channel information and extrinsic information. A decoder using parallel MAP algorithm contains multiple soft-input-soft-output modules (SISO) [1], so parallel access to these storages is required. When translated into hardware design it means that data required by different SISOs at the same time must not be stored in the same RAM block. Fig.1(a)(b) illustrates memory access during one iteration in turbo decoding, which is conceptually divided into 2 phases: (a) decoding against the 1st component code, and (b) decoding against the 2nd component code. During each phase, the trellis of the component code is divided into three segments, each taken cared of by one SISO module. Suppose we avoid memory access contention in the 1st phase by storing SISOs’ output physically in 3 different RAM blocks. However, during the 2nd phase, previously separate writing addresses from SISOs are translated by the interleaver Π . Potentially they can end up on the same RAM as shown in fig. 1(b), and memory access contention still exists. Designing the interleaving pattern wisely can prevent such collision [1] and empirical results show these contention-free interleavers can yield similar performance as conventional interleavers designed for serial implementation[5][6]. However, we notice they are all constructed in a semi-random fashion so that at least part of the interleaver pattern must be stored explicitly, which can be inconvenient when support for different code length is required. A solution is provided in [7], but it still requires explicit storage of the interleaver of the longest code. Interleavers of other length are then obtained by pruning this longest interleaver pattern. Other literatures try to solve this problem without redesigning interleaver pattern. This is partially because in some applications, interleaving pattern is predefined as part of the standard[4]. Also many excellent variable length real-time

0-7803-9152-7/05/$20.00 © 2005 IEEE

1950

addressable interleavers exist, although they are not contention free[8]. [3] proposes an architecture which buffers memory access requests when they point to the same address, but it requires a special hardware structure not available in today’s FPGA. The idea of resolving memory access contention by storing data in an unnatural order is first introduced in [2], which proves, in this way, P RAM blocks are adequate to support P concurrent memory access. It provides an “anneal procedure” which computes memory storage order offline. In this paper we follow the idea of [2] but try to make the calculation of storage order simple enough so it can be implemented on-chip. The new storage order calculation borrows its idea from graph theory and is in essence a serial vertex coloring algorithm using greedy heuristics [10]. Simulation with 3GPP’s interleaver pattern shows the solution given by this method requires a few more memory blocks than [2], but the storage order calculation module can be implemented with simple logic, which reconfigures the decoder at the time of code length change within O(10L) clock cycles, where L is the length of information bits. We also find the major bottleneck of this un-natural order storage scheme is interconnection, in that the number of required tri-state buffers increased significantly for high throughput. However, we verify that for moderate decoding throughput (40M~50M bits/sec), required amount of tri-state buffers is still affordable with 5 iterations and 80~100MHz system clock. The rest of the paper is organized as follows. Section II models concurrent memory access as vertex coloring problem. Section III explains the resultant architecture of turbo decoder. Section IV discusses the design of vertex coloring algorithm in the light of interconnection bottleneck. We also show how to implement this algorithm in hardware. Finally, to test the viability of our scheme, we implement it on a Xilinx Virtex II Pro xc2vp70 FPGA.

Figure 1. Memory access in parallel Turbo decoder when (a) decoding the first component code (b) decoding the second component code

II.

MEMORY ACCESS AS A VERTEX COLORING PROBLEM

Graph coloring is first used to solve register allocation problem in compiler design [9]. The same principle can be borrowed to solve memory access problem in the section I. It may be described as following:

(1) Every stage of the trellis of the component code is modeled as a vertex. Any 2 vertexes are connected with an (undirected) edge if and only if data related to these two stages are accessed simultaneously by different SISO processors [5] in “parallel turbo decoding algorithm”. (2) Let each color represents a RAM block. As long as any 2 adjacent vertexes are labeled with different colors, memory access will be collision free. This is exactly the “vertex coloring” problem in graph theory. (3) To use as few RAM blocks as possible, the number of colors in use should be minimized. Principle (2) and (3) should be obvious. The graph construction of principle (1) is demonstrated with an example in fig.2, where there are 2 SISO processors, and the trellis length is 8. Assume the interleaving pattern π ( x ) is 5,3,7,8,1,4,6,2 for x=1,2,…8. During the 1st phase of one iteration, trellis stages requiring simultaneous access are paired as {1, 5} {2, 6}, {3, 7}, {4, 8} according to fig.2(a). During the 2nd phase, they are paired as {5,1},{8,7},{2,3},{6,4} according to fig.2(b). The resultant graph is fig.2(c). In the following context, we call edges resulting from the 1st phase, thus drawn below the numbers, “low edges” , and edges from 2nd phase, thus drawn above the numbers, “high edges”.

 log 2  L / P  

input order, every table has size 2  log 2 χ  2 

bits

(=2K when χ =10 and  L / P  =256). During “learning period” [2], each SISO consults their neighbor’s table, so the connections between SISO processors and tables may switch to the “dash-dot” line in the figure. Data exchange between SISO processors and RAM stack is done via a “selection network”, whose internals are shown in fig. 4. The switches are implemented with tri-state buffers. We define χ p as the number of output ports of the pth switch. Input is copied to the port indicated by control signal (drawn in dots), and all other ports are left in high-impedance state. The port width of switch p (for read) is  log 2  L / P   . The port width of switch’ p (for write) is  log 2  L / P   +w, where w is word length of data. Thus the overall number of tri-state buffers can P be calculated with (2 log 2  L / P   + w) χ p .

∑

p =1

Figure 3. Turbo Decoder Architecture, P=4,

Figure 2. Constructing the graph from interleaving pattern when 2 SISO processors are used and trellis has 8 stages

III.

used by the SISO. Since every SISO must support 2 kinds of

χ

=6, L=20

HARDWARE DESIGN

A. Decoder Architecture For explanation purpose, channel and extrinsic information of one trellis stage are simply called an “element”. Let the number of stages in the trellis be L, the number of SISO processors be P, the graph is colored with χ colors. Then access collision can be avoided by storing the L elements in χ memory blocks, each with the capacity of  L / P  elements, as follows: Stage i’s data element (i=0,...,L-1) is stored in the cth RAM at position (i mod  L / P  ) where c is the stage's color.

Fig. 3 shows the turbo decoder’s overall architecture. Only information bits’ extrinsic values are stored with the new scheme, while channel values of parity check bits are still stored orderly in ordinary way. Information bits’ channel values, which remain unchanged over decoding period, are replicated twice and stored in natural and interleaved orders respectively (see fig.8 in [6]). “x” indicates the part of RAM which is occupied by a data element. The P tables store the c value of data elements

1951

Figure 4. selection network with concurrent read and write support, black lines: address bus, gray lines: data bus, dotted lines: control signals for switches and selectors. “delay p” and “delay’ p” is to compensate delays from reading table p, “delay’’p” is to compensate delays from reading RAM q. address ranges from 0 to  L / P  − 1  

B. Graph coloring algorithm In practice we find the number of tri-state buffers can be overwhelming, which equals 1600 when L=2048, P=8, w=9, χ p ≡ P . It is about 10% of all valiable tri-state buffers on a Xilinx xc2vp70 FPGA (6 million gates). We also notice if the decoder’s latency is fixed, which means 2 log 2  L / P   +w

remains as a constant, the number of tri-state buffers increases with P at the speed of O(P2). This rapid increase in hardware consumption is mainly due to “time varying” nature of switches and are called “interconnection bottleneck” in some literature[3].

the χ p for each p is its maximum value over the code length P listed below. Calculation with ∑ (2  log 2  L / P   + w) χ p p =1 shows 15% of trio-state buffers are saved.

Designing the coloring scheme properly can alleviate this problem. We do this by restricting the whole number of colors seen by a SISO processor when it decode the two component codes. If χ p can be restricted, the total tri-state buffer consumption will decrease. The resultant “reordered first fit” algorithm is described as follows: Let n=  L / P  , π ( x ) be the interleaved index of x. For simplicity, we assume P divides L. Let Ap be the index of elements processed by a SISO processor, which is defined as follows:

Ap = { pn + m, m = 0...n − 1} ∪ {π −1 ( pn + m), m = 0...n − 1} For p=0…P-1,

(a)

Color vertices in set Ap as follows: Look at the vertex’s every adjacent vertex and record all colors (if any) already used by them. Let the smallest color not used by adjacent nodes be the vertex’s color. The only difference of our algorithm from a canonical first fit [10] algorithm, which can also be used here, is that the latter colors vertices orderly from index 0 to L-1, while our algorithm colors vertices in set A1 first, and then colors vertices in set A2 , and so on. Thus every vertex is colored twice. This, in theory, will not increase coloring latency significantly if some complexity is introduced to jump over colored vertices. Reordered first fit algorithm and its canonical version are compared in the sense of number of tri-state buffers and χ in fig. 5, in which L/P are kept constant at 256 and P is increased from 4 to 16. The number of tri-state buffer is calculated as P ∑ p =1 (2 log 2  L / P   + w) χ p with w=9.  L / P  =256. We also provide estimated number of tri-state buffers of [2] whose results imply χ p ≡ P . According to fig.5(a), the reordered first fit algorithm saves 12%~20% tri-state buffers compared to canonical first fit algorithm. It uses even less tri-state buffers than [2] for P=16. According to fig.5(b), the coloring scheme uses 2~5 more RAM blocks than [2]. In practice, this additional cost can be alleviated by implementing the last RAM block with slices if it usually hosts less than 16 elements. Reordered first fit algorithm makes the connection network “irregular”, in that χ p is different from each other. When different code lengths are supported, we are concerned whether connections saved under one code length are unlikely to reappear under another code length. This is ensured by the following 2 observations: 1) color indices are assigned to Ap sequentially. That is, if color index i appears in Ap , all colors with index smaller than i must appear in Ap as well. 2) χ p generally increases monotonically with p because of the greedy nature of our algorithm. This is testified by fig. 6. Here

1952

(b) Figure 5. (a) tri-state buffers consumption (b) RAM block consumption when different vertex coloring scheme is used. L/P=256.

Figure 6. Comparison of χ p of different coloring scheme when supporting 4 types of code length: L=512,1024,2048,4096, P=8

IV.

C. Hardware design of Graph coloring algorithm The “config” module in fig. 3 computes data for the P tables when code length changes. At this time, ports of the tables are switched to the connection shown in fig. 7. Assume the configuration of one table entry takes N clock cycles. The first two cycles are used to read colors of neighboring vertices from tables, one for neighbors seen from “high edges”, the other for neighbors connected via “low edges”. The “read” ports are then idle during the next (N-2) cycles. This is because calculation for a new vertex can only start after the previous result has been written back into the tables. The “color calculator” picks up the color for the vertex, which is written to the table indicated by the “write control” module if necessary. For simplicity we just color every vertex twice, thus computation for all table entries take O(2NL) clock cycles. Address bus addra addrb are calculated as follows.

for p = 1...P for k ∈ Ap

IMPLEMENTATION RESULTS

To test the viability of our storage scheme, we implement the selection network (fig.4) with on-chip reconfiguration(fig. 7,fig.8) capability on a Xilinx Virtex II pro FPGA (xc2vp70, 7 million gates). The cost summary after placement and routing is shown in table. 1. Code lengths of 512, 1024, 1536, and 2048 are supported and P=8. The latency of the coloring scheme is 20L clock cycles, which is 512us for 80MHz system clock and L=2048. The decoding throughput at this clock and code length is 45M bits/sec for 5 iterations. V.

CONCLUSION

We conclude that the new storage scheme is usable for a turbo decoder with moderate throughput (40Mbit/sec~50Mbit/sec). Like [2], the scheme supports arbitrary interleaving pattern, which may not be realizable with a parallel interleaver [1]. However, our method offers a simpler configuration algorithm convenient for “on-chip” configuration, which doesn’t require iterative adjustments as in [2]’s annealing method.

for t = 0...N − 1 k mod  L / P  , t = 0   log  L / P   addra = π ( k ) mod  L / P  + 2  2    ,  otherwise    k /  L / P   , t = 0 addrb =   π ( k ) /  L / P   , otherwise end for end for end for We notice for reordered first fit algorithm, both π ( x ) and π −1 ( x) are required to be real-time addressable, which can be −1 inconvenient for some interleaver design. If only π ( x ) or π ( x) are real-time addressable, canonical first fit algorithm can be used instead. Internals of “color calculator” is shown in fig. 8. An additional bit is allocated in every table entry to indicate if it has been initialized or not. The correct value of this bit for initialized entry is indicated by “flag” signal, which flipped after every configuration round. Read-in data from table are first validated based on this bit (the flag bit) and then translated by the subsequent decoders(output 1<
1953

Figure 7. configuration module, gray lines: data bus, black lines: address bus, dotted lines: write enable signal, delay module: compensate latency of “color calculator”, tables: store color for every “element” addra , addrb are given in section III.C

Figure 8. color calculator, here the decoder gives 1<
TABLE I.

HARDWARE COST OF THE STORAGE SCHEME WITH RECONFIGURATION CAPABILITY ON XC2VP70 FPGA

Number of RAMB16s Number of SLICEs Number of TBUFs

19 out of 328 653 out of 33088 1976 out of 16544

5% 1% 11%

REFERENCES [1]

A. Nimbalker, T. K. Blankenship, B. Classon, T. E. Fuja, D. J. Costello, Jr, “Contention-free interleavers”, Int. Symp. on Inf. Theory, June 2004. [2] A. Tarable, G.Montorsi and S. Benedetto, “Mapping of interleaving laws to parallel turbo decoder architectures,” ,Proc. of 3rd Int. Symp. Turbo Codes Related Topics, Brest, France, Sep. 2003, pp153-156 [3] Michael J. Thul, Frank Gilbert, Norbert Wehn, “Optimized Concurrent Interleaving Architecture for High-Throughput Turbo Decoding”, [4] NTT DoCoMo, Nortel Networks, SAMSUNG Electronics Co., “Updated text proposal for Turbo code internal interleaver”, TSGR1#6(99)927 [5] Jaeyoung Kwak, Kwyro Lee, “Design of dividable interleavers for parallel decoding in turbo codes”, IEE Electronics Letters July 8, 2002 [6] Reuven Dobkin, Michael Peleg, Ran Ginosar, “Parallel VLSI Architecture and Parallel Interleaver Design for Low-Latency MAP Turbo Decoders”, IEEE [7] Libero Dinoi, Sergio Benedetto, “Variable-size interleaver design for parallel turbo decoder architecture”, available online at http://www.commgroup.polito.it/Papers/files/PID41289.pdf [8] Stewart Crozier and Paul Guinand, “High-Performance Low-Memory Interleaver Banks for Turbo Codes”, available online at www.crc.ca/en/html/fec/home/publications/papers/CRO01_VTCFall_int banks.pdf [9] Frank Mueller, “Register Allocation by Graph Coloring: A Review”, available online at http://moss.csc.ncsu.edu/~mueller/ftp/pub/PART/color.ps.Z [10] Assefaw Hadish Gebremedhin, Fredrik Manne, “Scalable parallel graph coloring algorithms”, available online at www.ii.uib.no/~assefaw/pub/thesis/paper1.pdf

1954

A Novel Scheme for Remote Data Storage - Dual Encryption - IJRIT

A Novel Parallel Architecture with Fault-Tolerance for ...

A novel material for hydrogen storage

A Trellis-Coded Modulation Scheme with a Novel ...

A reconstruction decoder for computing with words

A Novel Blind Watermarking Scheme Based on Fuzzy ...

A Novel Coordination Scheme of Transaction ...

Implementation of Viterbi decoder for a High Rate ...

A Reconstruction Decoder for the Perceptual Computer

Application of a Novel Parallel Particle Swarm ...

a novel parallel clustering algorithm implementation ... - Varun Jewalikar

a novel parallel clustering algorithm implementation ...

A reordered first fit algorithm based novel storage ... - Springer Link

UEAS: A Novel United Evolutionary Algorithm Scheme

a novel parallel clustering algorithm implementation ...

A Novel Commutative Blinding Identity Based Encryption Scheme

a novel pattern identification scheme using distributed ...

Survivable Storage Systems - Parallel Data Lab - Carnegie Mellon ...

Storage Area Networks May 2017 (2010 Scheme).pdf

A novel coding measure scheme for locating exons in ...