Optimized Software Implementation of a Full-Rate IEEE ... - IEEE Xplore

Viewer
Transcript

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

Optimized Software Implementation of a Full-Rate IEEE 802.11a Compliant Digital Baseband Transmitter on a Digital Signal Processor Yiyan Tang, Lie Qian, and Yuke Wang Department of Computer Science University of Texas at Dallas Richardson, TX, 75080, USA {yiyan, lqian, yuke}@utdallas.edu Abstract—The explosive growth of 802.11-based wireless LANs has attracted interest in providing higher data rates and greater system capacities. Among the IEEE 802.11 standards, the 802.11a standard based on OFDM modulation scheme has been defined to address high-speed and large-system-capacity challenges. Hardware implementations are often used to meet the high-datarate requirements of 802.11a standard. Although software based solutions are more attractive due to the lower cost, shorter development time, and higher flexibility, it is still a challenge to meet the high-data-rate requirements of 802.11a by software. In this paper, we implement a software-based 802.11a digital baseband transmitter on the TI TMS320C64x DSP. The transmitter can operate over all data rates defined in the 802.11a standard and is compatible with the high-rate portions of the 802.11g standard. Two major optimizations have been used in the software implementation to achieve the high-data-rate: 1) parallelizing the scrambler function and 2) concatenating the FEC encoder, puncturing, and interleaver functions. Experimental results show that the optimized software implementation on a single C64x DSP with a clock frequency of 1.0GHz can operate at the maximum of 136Mbits/s, which is twice as fast as the previous software implementation at the same clock frequency. Keywords: IEEE 802.11a, digital baseband transmitter, digital signal processor, software implementation

I.

INTRODUCTION

Due to the low-cost and high-data-rate, the popularity of IEEE 802.11-based Wireless Local Area Networks (WLAN) is growing exponentially. There are three major physical layer standards available in the 802.11 family: the Complementary Code Keying (CCK)-based 802.11b [1], the Orthogonal Frequency Division Multiplex (OFDM)-based 802.11a [2], and the OFDM-based 802.11g [3]. The 802.11b standard uses the 2.4GHz band and supports data rates of 1, 2, 5.5, and 11 Mbits/s. The 802.11a standard operates in the 5GHz band with possible data rates of 6, 9, 12, 18, 24, 36, 48, and 54 Mbits/s. The 802.11g standard released in 2003 operates in the 2.4GHz band and supports all the data rates defined in the 802.11a and 802.11b standards. For the higher data rates in 802.11a, the 802.11g standard uses the same OFDM technology in 802.11a,

IEEE Globecom 2005

while backward compatibility is added to support the lower data rates of 802.11b [4]. To support the high-data-rate requirements in the 802.11a and 802.11g standards, application specific integrated circuits (ASIC) [5][6] and field programmable gate arrays (FGPA) [7][8] designs have been used. However, hardware-based implementations often lack of flexibility and the hardware development cycle is onerous. On the other hand, softwarebased implementations enable elegant reuse of silicon area and dramatically reduce time-to-market through software modification, but are typically much slower than hardware implementations based on comparable technologies. An existing software implementation for a fully-compliant 802.11a full-rate digital baseband transmitter requires the use of a 22processor array running at a 1.0GHz clock frequency to reach 54Mbits/s performance [9]. Digital signal processors (DSPs) are a special class of processor optimized for signal-processing applications in communication systems. Although DSPs have been used to implement the 802.11a standard [10], they can only support limited data rates due to the lack of global parallelism found at the application level [9]. Hence, it is still a major challenge to develop a software implementation for the 802.11a standard on a DSP to meet the high-data-date requirements. In this paper, we present a software-based 802.11a digital baseband transmitter implementation on the TI TMS320C64x DSP. The transmitter can operate over all data rates defined in the 802.11a standard and is compatible with the high-rate portions of the 802.11g standard. Two major optimizations have been introduced to explore the parallelism within and between the individual functions of the transmitter to achieve the high-data-rate requirements: 1) parallelizing the scrambler function and 2) concatenating the FEC encoder, puncturing, and interleaver functions. Experimental results show that the optimized software implementation on a single C64x DSP with a clock frequency of 1.0GHz can operate at a maximum of 136Mbits/s, which is twice as fast as the software implementation in [9] at the same clock frequency. In the following, Section 2 introduces the digital baseband transmitter defined in 802.11a standard. The details of the

2194

0-7803-9415-1/05/$20.00 © 2005 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

optimized software implementation are described in Section 3. Section 4 presents the experimental results. Finally the conclusions are drawn in Section 5. II.

the scrambler generates a pseudo-random sequence used to randomize the input bit stream.

802.11A DIGITAL BASEBAND TRANSMITTER

The OFDM modulation scheme used in 802.11a distributes the data over 52 subcarriers on a 20MHz channel to mitigate the effects of multipath. Among the 52 subcarriers, 48 are for data and 4 are for pilot signals used for tracking. Each subcarrier is 312.5kHz wide, giving raw data rates from 125kbits/s to 1.125Mbits/s per subcarrier depending on the modulation type – binary phase shift keying (BPSK), quaternary PSK (QPSK), 16-quadrature amplitude modulation (QAM), or 64-QAM – and the error-correcting code rate (1/2, 2/3, or 3/4). The composite signal therefore has a data rate ranging from 6Mbits/s to 54Mbits/s in the 20MHz channel [11]. Table 1 lists the mode-dependent parameters for the 802.11a standard.

Figure 2. Structure of the scrambler

After the input bits are scrambled, they are encoded by a convolutional encoder with industry standard polynomials G0 = 1338 and G1 = 1718, of rate R = 1/2 as shown in Fig. 3. The value of the bit string {Z5,Z4,Z3,Z2,Z1,Z0} is called the state of the convolutional encoder and the bit denoted by X is outputted before Y for the same input bit in Fig. 3.

TABLE I. MODE-DEPENDENT PARAMETERS FOR 802.11A

Data rate (Mbits/s)

Modulation type

Code rate (R)

Coded bits per subcarrier (NBPSC)

6 9 12 18 24 36 48 54

BPSK BPSK QPSK QPSK 16-QAM 16-QAM 64-QAM 64-QAM

1/2 3/4 1/2 3/4 1/2 3/4 2/3 3/4

1 1 2 2 4 4 6 6

Coded bits per OFDM symbol (NCBPS) 48 48 96 96 192 192 288 288

Figure 3. Convolutional encoder in 802.11a standard

Data bits per OFDM symbol (NDBPS) 24 36 48 72 96 144 192 216

Because the rate of the encoder is fixed at 1/2, the output bit stream of the encoder must be punctured to obtain the code rates of 2/3 and 3/4 as shown in Fig. 4.

The block diagram of a digital baseband transmitter defined in 802.11a standard is shown in Fig. 1, which produces one OFDM symbol at a time based on the parameters in Table 1. The input bit stream is first randomized by a scrambler and encoded by a convolutional encoder at a coding rate of 1/2. Puncturing is used to obtain code rates other than 1/2. The bit stream is then interleaved and mapped to complex numbers representing frequency domain signals of the OFDM subcarriers based on modulation rules. After the pilot signals are inserted, an Inverse Fast Fourier Transform (IFFT) is performed to convert frequency domain signals to time domain signals. Finally the resulting time domain signals are cyclically extended to form the guard interval for each OFDM symbol.

Figure 4. Puncturing patterns to obtain 2/3 and 3/4 code rates

The output bits from the puncturing block are interleaved by two interleavers. The first interleaver is a bit-wise block interleaver with 16 rows and NCBPS/16 columns that accepts NCBPS bits at a time. Fig. 5 shows the structure of the block interleaver and an example of a block interleaver for NCBPS = 48.

Figure 5. Operation of the block interleaver for NCBPS = 48

The second interleaver is only used by QAM modulation types. By taking NCBPS bits from the first interleaver at a time, the second interleaver operates based on following equation: Figure 1. Block diagram for the digital baseband transmitter

Fig. 2 shows the scrambler structure defined in the 802.11a standard. The value of the bit string {x7,x6,x5,x4,x3,x2,x1} is called the state of the scrambler. Given a non-zero initial state,

IEEE Globecom 2005

The i-th input bit to the second interleaver, where i = 0, 1, ..., NCBPS – 1, is moved to the position j:

2195

j = s x floor(i/s) + (i + NCBPS – floor(16 x i/NCBPS)) mod s (1)

0-7803-9415-1/05/$20.00 © 2005 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

where s = max(NBPSC/2, 1) and the floor( ) function returns the largest integer not exceeding the parameter. Every NCBPS output bits of the interleavers are converted into 48 complex numbers by modulation mapping function for the chosen modulation type. Hence, each complex number represents NBPSC output bits of the interleavers. Before the complex numbers are processed by the 64-pt IFFT, four complex numbers representing the pilot signals are inserted and the total 52 complex numbers are extended to 64 complex inputs of IFFT. The four complex numbers for pilot signals can be fetched from a predefined sequence. Fig.6 illustrates the composition of the 64 complex IFFT inputs. The IFFT is performed based on the following equation: 1 63 x( n) = ∑ X (k )W64− kn 64 k = 0

Figure 7. Operation of the scrambler for three consecutive input bits

(2)

where X(k) and x(n) are complex numbers, j2 = -1, and W64− kn = e j 2πnk / 64 .

Figure 8. Parallelized scrambler to generate three bits at a time

Figure 6. The composition of the IFFT inputs

Finally, the 64 complex outputs of the IFFT are extended to form an array of 80 complex numbers by copying the last 16 outputs as the guard interval prior to the first IFFT output. III.

OPTIMIZED SOFTWARE IMPELEMENTATION

Constrained by the sequential execution model, software implementation of the IEEE 802.11a transmitter on single-chip VLIW DSPs requires a high number of instructions per cycle to achieve the required data rate [9]. Instead of looking for powerful DSP with higher clock frequency, we develop our software implementation on DSP with two major optimizations to explore parallelism within and between individual functions of the transmitter: 1) parallelizing the scrambler function and 2) concatenating the FEC encoder, puncturing, and interleaver functions. A. Parallizing the Scrambler Function The scrambler in Fig. 2 processes one input bit at a time. By observing the operation of the scrambler defined in the 802.11a standard for consecutive input bits, we found that three consecutive output bits can be generated based on solely the current state of the scrambler as shown in Fig. 7. Based on this observation, we parallelize the scrambler to take three consecutive input bits at a time and generate three output bits as shown in Fig. 8.

IEEE Globecom 2005

The corresponding C code to compute three consecutive output bits based on the current state of the scrambler is shown in Fig. 9, where all variables are declared as unsigned integer type. The current state of the scrambler is stored in the seven most significant bits (MSBs) of sc_state and the three consecutive output bits are collected in tmp_block. The _bitr( ) function performs bit order reversal on a 32-bit variable, which is necessary because the C64 processor uses little-endian bit order and the first output bit is the least significant bit (LSB) of tmp_block. Since all values of NDBPS in Table 1 can be divided by three, the parallelized scrambler works for all data rates defined in Table 1 without a bit alignment problem.

Figure 9. Corresponding C code of the parallelized scrambler

B. Concatenating the FEC Encoder, Puncturing, and Interleaver Functions To explore parallelism between the individual FEC encoder, puncturing, and interleaver functions, we first parallelize the FEC encoder function, then concatenate it with the puncturing function, and finally parallelize the concatenated FEC encoder and puncturing function again to concatenate it with the interleaver function. 1) Parallelizing the FEC Encoder Function The operation of the convolutional encoder function in Fig. 3 can be represented as logically ANDing two bit masks with the input bit stream, one bit mask having the value M0 = 1338 to generate X and the other bit mask having the value M1 =

2196

0-7803-9415-1/05/$20.00 © 2005 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

1718 to generate Y. The output bits X and Y are generated by counting the number of ‘1’ bits in the ANDed results, the output bit being ‘1’ when there are an odd number of ‘1’ bits and ‘0’ otherwise. Hence, the convolutional encoder in Fig. 3 can be redrawn as in Fig. 10, where bit masks are used to represent the generator polynomials.

Figure 12. Concatenating the FEC encoder, puncturing, and interleaver functions for BPSK, R = 3/4

Figure 10. Bit mask representation of the FEC encoder

By observation, the convolutional encoder generates output bits independent of each other, i.e., generating an output bit of the convolutional encoder does not depend on any previous output bits. Hence, more than one input bit can be processed by the convolutional encoder at a time by using more than two bit masks. For example, the convolutional encoder in Fig. 11 can generate 12 output bits at a time for 6 consecutive input bits by using 12 bit masks.

The corresponding C code to compute the concatenated FEC encoder, puncturing, and interleaver function is shown in Fig. 13, where the first 8 rows (row0 to row7) of output bits in Fig. 12 are generated. The remaining 8 rows of output bits (row8 to row15) can be generated in the same way.

Figure 11. Concatenating the parallelized convolutional encoder with the puncturing function for a code rate of 1/2

2) Concatenating the Interleaver with the FEC Encoder and Puncturing Function Since the concatenated function in Fig. 11 generates output bits independent of each other, we can reorder the bit masks to generate output bits in the output order of the interleaver shown in Fig. 5 instead of the input order. In this way, the interleaver function is concatenated with the convolutional encoder and puncturing functions. For example, given the concatenated function in Fig. 11 and the interleaver in Fig. 5, Fig. 12 concatenates the convolutional encoder, puncturing, and interleaver functions for BPSK with code rate 3/4 by parallelizing the concatenated function in Fig. 12. In Fig. 12, 48 output bits are generated for 36 input bits at a time. Each row of three bit masks generates output bits corresponding to one row in the interleaver. Since all values of NDBPS in Table 1 with code rate 3/4 can be divided by 36, the concatenated function in Fig. 12 can be used for all cases in which the code rate is 3/4.

IEEE Globecom 2005

Figure 13. Concatenating the FEC encoder, puncturing, and interleaver functions for BPSK, R = 3/4

In Fig. 13, the six LSBs of cc_state store the state of the convolutional encoder, cc_mask0 and cc_mask1 contain one row of the M0 and M1 bit masks in Fig. 13 respectively, and input_block contains the input bit stream. The variables cc_tmp0 to cc_tmp7 collect the output bits corresponding to the first eight row of the interleaver in Fig. 13. The concatenated function for rate 1/2 and rate 2/3 cases can be constructed in the similar way of the R = 3/4 case. IV.

EXPERIMENTAL RESULTS

The optimized software implementation of the 802.11a transmitter was developed on the TI TMS320C64x DSP, which is a fixed-point DSP with enhanced VLIW (Very Long Instruction Word) architecture. The C64x DSP has eight functional units that are capable of completing at most eight operations in parallel at a maximum clock frequency of 1GHz [12].

2197

0-7803-9415-1/05/$20.00 © 2005 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

The functions of the transmitter are developed in C codes and compiled in the Code Composer Studio (CCS) v3.0 with the maximum compilation efforts toward execution speed. The IFFT function is implemented by the DSP_fft16x16t function from the TI C64x DSP library [13]. Table 2 shows the performance for the individual functions of the transmitter in terms of the clock cycles to compute the functions. The data rate in Megabits per second (Mbits/s) is computed as: Assuming 1GHz clock frequency Data rate (Mbits/s) = 1000MHz / (Cycles/NDBPS (bits)) (3) TABLE II.

PERFORMANCE OF THE OPTIMIZED SOFTWARE IMPLEMENTATION

Modulation Type

Code rate (R)

Scrambler (Cycles)

BPSK BPSK QPSK QPSK 16-QAM 16-QAM 64-QAM 64-QAM

1/2 3/4 1/2 3/4 1/2 3/4 2/3 3/4

Modulation Type

Code rate (R)

72 115 138 196 218 332 405 468 Modulation Mapping and Pilot Insertion (Cycles)

BPSK BPSK QPSK QPSK 16-QAM 16-QAM 64-QAM 64-QAM

1/2 3/4 1/2 3/4 1/2 3/4 2/3 3/4

Modulation Type

Code rate (R)

Required Data Rate (Mbits/s)

BPSK BPSK QPSK QPSK 16-QAM 16-QAM 64-QAM 64-QAM

1/2 3/4 1/2 3/4 1/2 3/4 2/3 3/4

6 9 12 18 24 36 48 54

Concatenated FEC Encoder, Puncturing, Interleaver (Cycles) 93 106 123 139 165 201 242 263 IFFT (Cycles)

V.

In this paper, we have implemented a software based 802.11a digital baseband transmitter on the TI TMS320C64x DSP. The transmitter can operate over all data rates defined in the 802.11a standard and is compatible with the high-rate portions of the 802.11g standard. Two major optimizations have been performed in the software implementation to achieve the high-data-rate: 1) parallelizing the scrambler function and 2) concatenating the FEC encoder, puncturing, and interleaver functions. Experiments show that the software implementation on a single C64x DSP at a clock frequency of 1GHz can operate at a maximum of 136Mbits/s, which is twice as fast as the previous software implementation at the same clock frequency. REFERENCES

Second Interleaver (Cycles)

[1]

Bypassed Bypassed Bypassed Bypassed

[2]

[3]

82 113

[4]

Pilot Insertion (Cycles)

[5]

[6]

280 311 388

[7]

44

301 306

[8] The transmitter as a whole Cycles 877 933 1004 1078 1198 1348 1498 1582

Mbits/s 28 38 48 66 80 106 128 136

[9]

[10]

[11]

From Table 2, the software implementation on a single C64x DSP with a clock frequency of 1GHz can operate at a maximum of 136Mbits/s while satisfying all the data rate requirements defined in the 802.11a standard.

IEEE Globecom 2005

CONCLUSION

[12] [13]

2198

IEEE 802.11b-1999, “Wireless LAN medium access control (MAC) and Physical layer (PHY) Specifications: High Speed Physical Layer Extension in the 2.4 GHz Band,” 1999. IEEE 802.11a-1999, “Wireless LAN medium access control (MAC) and Physical layer (PHY) Specifications: High Speed Physical Layer in the 5 GHz band,” 1999. IEEE 802.11b-1999, “Wireless LAN medium access control (MAC) and Physical layer (PHY) Specifications: Further High Speed Physical Layer Extension in the 2.4 GHz Band,” June 2003. W. Stallings, “IEEE 802.11: wireless LANs from a to n,” IT Professional, Vol. 6, Issue 5, pp. 32-37, Sept.-Oct., 2004. R. Ahola, etc., “A single-chip CMOS transceiver for 802.11a/b/g wireless LANs,” IEEE Journal of Solid-State Circuits, Vol. 39, Issue 12, pp. 2250-2258, Dec. 2004. K. Vavelidis, etc., “A dual-band 5.15-5.35-GHz, 2.4-2.5-GHz 0.18-um CMOS transceiver for 802.11a/b/g wireless LAN,” IEEE Journal of Solid-State Circuits, Vol. 39, Issue 7, pp. 1180-1184, July 2004. P. Coulton and D. Carline, “An SDR inspired design for the FPGA implementation of 802.11a baseband system,” Proc. of IEEE International Symposium on Consumer Electronics, pp. 470-475, Sept. 1-3, 2004. C. Dick and F. Harris, "FPGA implementation of an OFDM PHY," Proc. of The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Vol. 1, pp. 905-909, Nov. 9-12, 2003. M.J. Meeuwsen, O. Sattari, and B.M. Baas, “A full-rate software implementation of an IEEE 802.11a compliant digital baseband transmitter,” Proc. of IEEE Workshop on Signal Processing Systems (SIPS 2004), pp. 124-129, Oct. 13-15, 2004. M.F. Tariq, Y. Baltaci, T. Horseman, M. Butler, and A. Nix, “Development of an OFDM based high speed wireless LAN platform using the TI C6x DSP,” Proc. of IEEE International Conference on Communications (ICC), Vol. 1, pp. 522-526, April 28-May 2, 2002. T.H. Meng, B. McFarland, D. Su, and J. Thomson, “Design and implementation of an all-CMOS 802.11a wireless LAN chipset,” IEEE Communications Magazine, Vol. 41, Issue 8, pp. 160-168, Aug. 2003. Texas Instruments, “DSP Selection Guide”, SSDV004P, Feb. 2005. Texas Instruments, “TMS320C64x DSP Library Programmer’s Reference,” SPRU565A, April 2002.

0-7803-9415-1/05/$20.00 © 2005 IEEE