Design and verification for dual issue digital signal processor

Viewer
Transcript

Design and Verification for Dual Issue Digital Signal Processor Cheng-Hung Lin

Chun-Yu Lin

Shih-Chieh Chang

Dept. of Industrial Technology Education National Taiwan Normal University Taipei, Taiwan [email protected]

Dept. of Computer Science, National Tsing-Hua University Hsinchu, Taiwan [email protected]

Dept. of Computer Science, National Tsing-Hua University Hsinchu, Taiwan [email protected]

Abstract-Digital Signal Processor (DSP) has been widely used in processing video and audio streaming data. Due to the huge amount of streaming data, increasing throughput is the key issue in designing DSP architecture. One way to increase the throughput of a DSP is to increase the instruction level parallelism. To increase the instruction level parallelism, many architectures have been proposed and can be classified into two main approaches, the superscaler and the VLIW architectures. Among the hardware architectures, the VLIW attracts a lot of attention due to its simple hardware complexity. However, the VLIW architecture suffers from the explosion of instruction memories due to the overhead of instruction grouping. In this paper, we propose a novel DSP architecture which contains three pipelines and performs dynamic instruction grouping by hardware. The experimental results show that our architecture can reduce 13% of memory requiremnt on average while maintaining the same performance.

grouping but maintain the instruction order. In our hardware architecture, we add a minimum hardware to recognize instruction pallellsim. Since instruction have been ordered in a way that potential parallel instructions are in adjacent order, our hardware (much simpler than the superscaler) can easily recognize the parallelism. In this way, we can take advantage of the memory efficiency of the superscaler while still maitaining the instruction parellism as in the VLIW architecture. We have performed our idea on a target architecture in Figure 1 which consists of three pipelines datepaths, p-path, d-path, and DSP-path.The experimental results show that our architecture can reduce an average of 13% of memory requiremnt while maintaining the same performance. AG Instruction Instruction Dispatch Prefetch Decode

Keywords-DSP; instruction level parallelism; pipelines

I.

AG

978-1-4244-5035-0/09/$26.00 ©2009 IEEE

M WBP

D-path Co-DSP

INTRODUCTION

Telecommunication and multimedia applications are among the fastest growing portions of the microelectronic markets today. Many digital signal processors (DSP) [6][7][8][9] are proposed to enhance computing performance for streaming data, such as wireless communication, video and audio processing, and integrated networking technologies. Due to the huge amount of streaming data, increasing throughput has become the key issue in designing DSP architecture. One way to increase throughput of DSP is to increase the instruction level parallelism which contains two main steps: instruction grouping and function unit assignment. To increase the instruction level parallelism, many architectures are proposed and can be classified into two main approaches, the superscaler [12][13] and the VLIW architectures[1][2][3][4][5][6][7]. The superscaler architectures perform instruction grouping and function unit assignment by hardware which has the advantages of instruction compatibility. On the other hand, the VLIW architecture has low hardware complexity because it relies on software to perform instruction grouping and function unit assignment. However, the VLIW architecture suffers from large instruction memory due to the overhead information needed to perform instruction grouping. And in mnay cases, NOP has to be inserted to improve the parallelism. In this paper, we propose a new architecture which combines both the advantages of the superscaler and the VLIW architecture so as to reduce the memory size and increase the IPC (instruction per cycle). Our idea is to use a compiler to recognize instruction parallelism as in the VLIW architecture. Since compiler has a global view of instructions, the results of instruction grouping are usually good. Still, the grouping of instructions may contain redundancy. We then remove the

M WBP

P-path

Figure 1: Our pipeline architecture

II.

REVIEWS OF INSTRUCTION LEVEL PARALLELISM

As described in the introduction, one way to increase the throughput of a DSP is to increase the instruction level parallelism. As shown in Figure 2(a), the single pipeline excutes one instruction in each cycle. On the other hand, the superscalar as shown in Figure 2(b) exectues multiple instructions in each cycle. In terms of instructions per cycle (IPC) rate, the superscalar is 3 times better than the single pipeline.

inst

IF

ID

EX

WB

(a) Single pipeline inst inst inst

EX IF

ID

EX

WB

EX

(b) Superscalar Figure 2. Two kinds of pipelining archectures The Instruction level parallelism has two major steps: instruction grouping for independent instructions and function unit assignment for limited hardware execution. As shown in Figure 3, a data dependency graph is first created based on the dependency of instructions where the instructions I1 and I2, I3

-536-

ISOCC 2009

and I4 are independent and can be grouped and executed concurrently. Then in the second step, the grouped instructions are assigned to different pipelines to execute based on the instruction types. I0 Instruction group

I1

I2

I5 I3 I1 I0

Function Cycle time unit assign

I3

I4 I2 I3

I4 Grouped instructions

I5 Data dependency graph

I4 I1 I2

Func2

I5 I3 I1 I0

I5

Function unit assign

I3

I5 Software

I0

I4 I1

I4 I2

I4

I2

Still, in EPIC, normally, a length information or redudnant NOP (no operation) is needed. EPIC limits the number and types of instructions in a instruction group to reduce the memory overhead. For example in Figure 5(a), the instructions (I0, I2), (I1, I5) and (I3, I4) can be grouped together, but the extra overhead is needed where a potential length overhead is highlighted by rectangular box. To reduce the overhead, EPIC may decide not use group (I0, I2) and (I1, I5) and the final results are in Figure 5(b).

I3

I2

I1

I5

I0 I1 I3 I2 I5 I4

I4

DUAL ISSUE VLIW DSP ARCHITECTURE

From the above example, since there are still memory overhead to achieve intruction level parallelism we purpose a new architecture which performs dynamic grouping to both reduce the memory ovehead and increase the intruction level parallelism. Take the same example in Figure 5, the result after optimization is shown as Figure 6. I0

I2

I1

I5

I0 I2 I1 I5 I3

I0 I1 I3

I4 Pre-fetch I2 I5 I4 and Scheduling

I3

I4

Figure 6. Optimization Using Dynamic Scheduling The main idea is to add hardware components to perform dynamic instruction grouping and function unit assignment in EPIC architecture. As shown in Figure 7, the new architecture includes a novel dynamic intruction grouping hardware so that the new hardware has the advantage of EPIC and can improve the instruction level parallelism. In the following, we detail the mechanism of the new architecture. Compiler

Hardware

Figure 4. EPIC Functional Block

I0

Figure 5. Cost of instruction grouping

A. Main Idea

I0

I3

(b) Option 2 of instruction grouping

III.

To increase the instruction level parallelism, many architectures are proposed and can be classified into two main approaches, the superscaler and the VLIW architectures. The superscaler architectures perform instruction grouping and function unit assignment dynamically on run-time by hardware which has the advantages of instruction compatibility. On the other hand, the VLIW architectures perform instruction grouping and function unit assignment statically by compiler which has the advantages of simple hardware complexity. In 2001, Intel purposed anther novel architecture for instruction level parallelism, Explicitly Parallel Instruction Computing (EPIC) [1][2][3]. As shown is Figure 4, the EPIC performs instruction grouping by compiler and function unit assignment by hardware. The advantages of EPIC include global instruction grouping, good instruction compatibility, and good hardware flexibility.

I2

I4

I4

I3

Figure 3. Instruction level parallelism flow

I1

I1

I5

I0 I2 I1 I5 I3

Executed pipeline

Instruction group

I2

I0 Func0 Func1

I5

I0

Global Instruction Grouping

inst inst inst

Dynamic Instruction Grouping

Instruction Bundle

Function Unit Assignment

Func0 Func1

Instruction Split and Issue

Figure 7. Imporved EPIC Hardware Architecture

B. New Hardware Architecture To ease the discussion, we illustrate our new architecture using a target DSP design which consists of three pipelines in Figure 1. In Figure 8, our new architecture has three major components: instruction pre-fetch, dynamic scheduler, and execution pipeline. In the first stage, the instruction pre-fetch unit is to group multiple instructions into a composite instruction packet. Then, the instruction packet is pushed into the instruction queue. Besides, the instruction pre-fetch unit also performs branch prediction which is determined on compile time. The branch prediction includes function call, directly jump, and static conditional branch prediction.

(a) Option 1 of instruction grouping

-537-

ISOCC 2009

inst inst inst

VLIW Compiler

IF

Instruction Prefetch

ID

Scheduler

Dynamic Scheduler

EX EX EX

EXPERIMENTAL RESULTS

IV. WB

Execution Pipeline

Figure 8. New EPIC Architecture In the second stage, the dynamic scheduler performs instruction decode and issues instructions into execution pipelines. Besides, the dynamic scheduler also detects data hazard caused by dynamical instruction packet. In the final stage, the execution pipelines perform operation executions. On this stage, multiple instructions can be executed concurrently.

C. Dynamic Instruction Grouping and Prefetch In order to execute multiple instructions in parallel, grouping more instructions into a composite packet is necessary. Usually in a sequence of instructions, there exists a subsequence which contains a memory access instruction followed by an ALU computing instruction. Therefore, the dynamic instruction grouping will group memory access and ALU computing instructions into a packet. The other instructions, such as flow control and Vector-SIMD instructions, are isolated and cannot be merged into a packet. On the instruction prefetch stage, the dynamic instruction grouping unit does not check data hazard. However, when the two instructions of one packet have data dependency, the function unit assignment unit will split this packet and issue these two instructions sequentially.

In this section, we describe our experimental environment and results. We construct our simulation environment by Cadence’s NC-verilog simulator, and use Blackfin Koop [14] as the golden model to verify our design. The C compiler of the MSA is made by GNU C compiler (GCC) with Newlib as C-runtime library. The major benchmark is DSPStone [11] which contains several digital signal processing algorithms like gsm, g721, adpcm, fir, and matrix multiply. The second benchmark is mediabench [10] which contains media applications such as jpeg, mpeg2, epic, mesa and ghostscript. Since the file I/O function is simulated by virtual system calls invoked by Newlib library, the latency of file I/O is ignored in our experiments. The simulation flow is shown in figure 11. First, the benchmark program is compiled by our modified GCC compiler. The resultant assembly codes are both simulated by our design in NC-Verilog simulator and by the golden model in GDB simulation. After simulation, we compare the results of our design with the gold model and measure the execute cycle time and the number of the instruction types. The processor golden model is written by verilog and co-verified with C model. Benchmark

DSPstone Mediabench

GCC compiler

NC verilog simulation

Blackfin Koop

D. Instruction Schedule and Split check

Instruction grouping have two categories: static instruction grouping and dynamic instruction grouping. As shown in Figure 9, the static instruction grouping must be split because the pipelines are changed to three pipelines (Load, ALU, Vector). Because the static instruction grouping has no data dependence, the detection of data hazard is not necessary. Therefore, the splitter can issue 2-instructions (Load-Vector) concurrently followed by the Load instruction. Static grouping Memory access

Memory access Memory access

Memory access Vector/SIMD

Vector/SIMD Split

Second Split

First Split

Figure 9. Static Instruction Groups Split The dynamic instruction grouping has two cases. First, if the data hazard is detected, the scheduler must split the group and issue the instructions in order. As shown in Figure 10. Second, if the data hazard is not detected, the scheduler will issue this instruction group concurrently. Dynamic grouping Memory access Computing

Computing Split

Second Split

Memory access First Split

Figure 10. Dynamic Instructions Groups Split

Performance monitor

Figure 11. Experimental Flow The experimental results are shown in Tables 1, 2 and 3. Table 1 shows the simualtion results of the programs compiled for speed optimzation while Table 2 shows the simualtion results of the programs compiled for area optimzation. In Table 1, column 1 shows the name of benchmark , columns 2, 3, 4, 5, 6, 7, and 8 show the total instruction count, load-store, alu, vector, VLIW, dual issue instruction counts, and memory size, respectively. Columns 9 and 10 show the clock cycle and the instructions per cycle (IPC) of the original design while Columns 11, 12 and 13 show the clock cycle, IPC and improvement of our design. For example, the jpeg benchmark has 8,615,154 instructions, 4,317,122 load-store instructions, 2,802,854 alu computing instructions, 32,423 vector instructions, 21,464 VLIW instructions, and dynamic scheduling has 601,021 instruction packets. The memory size is 93,072 bytes. The execution time of jpeg is 12,198,449 cycles on without dymainc scheduler, and 11,573,176 cycles with dynamic scheduler. The improvement of cycles is 5.13%. We would like to mention the improvement of cycles comes from the grouping of instructions across the boundaries of branches or loops which are hard to be optimized by compiler. Table 2 shows the results of optimiation for memory size. The improvement has 3%~6% on average. The results show that the dynamic scheduler can improve the performance of EPIC without consuming large memories. Table 3 shows the comparison of memory size and performance between

-538-

ISOCC 2009

[6]

optimization for area amd speed. The results show that the our architecture can reduce 13% of memory requiremnt on average and still achieve 1% of performance improvement.

R.K. Kolagotla, et al., “A 333-MHz dual-MAC DSP architecture for next-generation wireless applications,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, May 2001, pp. 1013-1016. R.K. Kolagotla, et al., “High Performance Dual-MAC DSP Architecture,” IEEE Signal Processing Magazine, vol. 19, 2002, pp. 42–53. D. Pham, et al., "The Design and Implementation of a First-Generation CELL Processor", 10.2 Proc. of ISSCC 2005. J. C. Chu, et al., “An embedded coherent-multithreading multimedia processor and its programming model”, Proceedings of the 44th annual conference on Design automation 2007, pp 652-657. C. Lee, Potkonjak, M. and Mangione-Smith, W.H., "MediaBench: a tool for evaluating and synthesizing multimedia and communications systems," Proceedings., Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, Dec 1997,pp.330-335. “DSPStone: A DSP-Oriented Benchmarking Methodology”, Institute for Integrated Systems in Signal Processing, 1994 K. C. Yeager, “The MIPS R10000 Superscalar Microprocessor”, IEEE Micro, 1996 Keller, J., “The 21264: a superscalar Alpha processor with out-of-order execution.”, Presented at the 9th Annual Microprocessor Forum, San Jose, CA., 1996 Blackfin Koop website, “https://blackfin.uclinux.org/gf/”

[7]

V.

CONCLUSIONS [8]

In this paper, we propose a novel DSP architecture which contains three pipelines and performs dynamic instruction grouping by hardware. The experimental results show that our architecture can reduce 13% of memory requirement on average while maintaining the same performance.

[9]

[10]

REFERENCES [1] [2] [3] [4] [5]

M.S. Schlansker and B.R. Rau,”EPIC: Explicitly Parallel Instruction Computing” Computer, 2000, pp. 37-45. M.S. Schlansker and B.R. Rau, “EPIC: An Architecture for Instruction-Level Parallel Processors,” HPL Tech. Report, Jan. 2000 “IA-64 Application Developer's Architecture Guide”, Intel Corp., 1999. V. Kathail, M. Schlansker, and B.R. Rau, “HPL-PD Architecture Specification: Version 1.1.” Tech. Report HPL, Feb. 2000 H. Sharangpani, K. Arora, “Itanium Processor Microarchitecture,” IEEE Micro, 2000, pp. 24-43.

[11] [12] [13] [14]

Table 1. Experimental Results for GCC Optimization for Speed instruction issue types for GCC –O2 Benchmark matrix2 matrix1 convolution fir lms real updates complex updates biquad sections jpeg adpcm epic Average

Original

11,236 14,790 62,142 94,938 107,291 156,406

6,532 6,242 49,412 78,086 78,116 114,972

3,870 6,130 12,396 16,494 28,789 41,070

1 1 1 1 1 1

0 1,000 0 0 3 0

dual issue 18 8 10 12 14 10

385,790

287,006

90,222

1

4,096

10

471,805

246,046

192,629

1

16,385

8,615,154 10,522,595 11,252,102

4,317,122 3,226,834 2,266,122

2,802,854 4,370,880 5,633,490

32,423 1,007,628 1,689,770

21,464 5,780 35,116

Instru.

load-store

alu

Vector

VLIW

Memory (byte) 4,964 4,940 4,808 4,856 4,956 4,904

clock cycle

dynamic scheduler IPC

clock cycle

IPC

Impro.

20,883 28,587 91,179 128,079 189,591 201,830

0.54 0.52 0.68 0.74 0.57 0.77

20,864 28,578 91,169 128,067 189,576 201,820

0.54 0.52 0.68 0.74 0.57 0.77

0.09% 0.03% 0.01% 0.01% 0.01% 0.00%

4,968

488,557

0.79

488,547

0.79

0.00%

8

4,932

689,272

0.68

689,263

0.68

0.00%

601,021 709,676 328,323

93,072 28,936 55,796

12,198,449 14,534,858 14,668,632 1

0.71 0.72 0.77 1

11,573,176 13,764,491 14,225,943 96%

0.74 0.76 0.79 1.01

5.13% 5.30% 3.02%

Table 2. Experimental Results for GCC Optimization for Memory Size instruction issue types for GCC –Os Benchmark matrix2 matrix1 convolution fir lms real updates complex updates biquad sections jpeg adpcm epic Average

15,269 18,139 107,193 135,877 185,085 217,826

7,824 10,006 69,891 86,273 98,593 135,442

6,863 7,555 28,782 32,876 53,374 73,834

1 1 1 1 1 1

0 0 0 0 0 0

dual issue 126 124 4,105 8,204 16,393 4112

443,114

291,090

102,504

1

0

24,594

Instru.

load-store

alu

Vector

VLIW

Memory (byte) 4,701 4,681 4,561 4,593 4,689 4,649 4,705

Original clock cycle 23,833 26,993 140,316 173,100 271,471 255,051

0.64 0.67 0.76 0.78 0.68 0.85

dynamic scheduler clock IPC Impro. cycle 23,108 0.66 3.14% 26,179 0.69 3.11% 132,117 0.81 6.21% 164,895 0.82 4.98% 255,074 0.72 6.43% 242,752 0.89 5.07%

537,683

0.82

504,902

0.87

6.49%

IPC

701,155

180,504

331,891

1

0

94,218

4,657

869,464

0.80

717,908

0.97

21.11%

7,917,416 10,588,673 11,695,409

4,098,841 3,307,355 2,612,927

2,506,312 4,814,547 5,779,615

86,870 1,007,776 1,695,983

241 5,632 193

555,710 480,843 339,498

88,169 11,385 52,349

12,478,294 14,767,492 15,236,157 1

0.63 0.71 0.76 1

11,883,175 14,197,380 14,653,480 96%

0.66 0.74 0.79 1.06

5.01% 4.02% 3.98%

Table 3. Comparison with Memory Size and Performance Benchmark matrix2 matrix1 convolution fir lms real updates complex updates biquad sections jpeg adpcm epic Average

GCC-O2 Memory size 4,964 4,940 4,808 4,856 4,956 4,904 4,968 4,932 93,072 28,936 55,796 1

GCC-Os W/O dynamic scheduler 20,883 28,587 91,179 128,079 189,591 201,830 488,557 689,272 12,198,449 14,534,858 14,668,632 1

-539-

Memory size

W/ 4,701 4,681 4,561 4,593 4,689 4,649 4,705 4,657 88,169 11,385 52,349 13%

dynamic scheduler 23,108 26,179 132,117 164,895 255,074 242,752 504,902 717,908 11,883,175 14,197,380 14,653,480 1%

ISOCC 2009

Online PDF VLSI Digital Signal Processing Systems: Design and ...