network access traffic manager project report

Viewer
Transcript

NETWORK ACCESS TRAFFIC MANAGER PROJECT REPORT

This project report is submitted in partial fulfillment of the requirements for the award of the degree

Master of Technology

In ELECTRONICS DESIGN AND TECHNOLOGY Submitted by,

Poovaiah M P and Sushanth Kini (4610-510-081-05887) (4610-510-081-05942) Under the guidance of Kuruvilla Varghese

CENTRE FOR ELECTRONICS DESIGN AND TECHNOLOGY Indian Institute of Science, Bangalore-560012 June 8, 2010

Poovaiah M P and Sushanth Kini

ii/145

June 8, 2010

Contents I

INTRODUCTION

2

1 Introduction 1.1

3

Quality of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Market Survey

3 5

2.1

QoSWORKS 10000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Blue Coat Packet Shaper series . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.3

ET/R2816 10Gb/s Appliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.4

VIPER TES-1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

3 Literature Survey 3.1

3.2

3.3

8

Quality of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3.1.1

Defining ’Flow’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3.1.2

End-to-End QoS levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3.1.3

QoS Requirements of VoIP . . . . . . . . . . . . . . . . . . . . . . . . . .

9

3.1.4

QoS Requirements of Video . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.5

QoS requirements for Data . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Packet Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1

PQ: Prioritized Queuing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.2

Rules for classification of the packets . . . . . . . . . . . . . . . . . . . . . 14

3.2.3

HTTP Header Look-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.4

Ternary Content Addressable Memory . . . . . . . . . . . . . . . . . . . . 19

3.2.5

Decision tree based Packet classification . . . . . . . . . . . . . . . . . . . 20

Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.1

First Come First Serve (FCFS) . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.2

Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 iii

CONTENTS

CONTENTS

3.3.3

Weighted Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.4

Deficit Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.5

Class-Based Weighted Fair Queuing . . . . . . . . . . . . . . . . . . . . . 26

3.3.6

Flow-Based Weighted Fair Queuing . . . . . . . . . . . . . . . . . . . . . . 27

3.4

II

Traffic Shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

SYSTEM DESIGN

37

4 Motivation For the Proposed Design

38

4.1

Downstream Traffic Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2

TCP Window Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3

HTTP Header Look-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4

TCP vs UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5

Privileged Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.6

Control Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Algorithms for the Design 5.1

40

Packet Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1.1

Layer2 parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.2

Layer3 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.3

Layer4 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1.4

Algorithm to generate the groupNum . . . . . . . . . . . . . . . . . . . . . 46

5.2

Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3

Shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Proposed Architecture 6.1

50

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.1.1

Flow based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1.2

Class-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1.3

Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2

Overview of the Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . 51

6.3

Packet Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.4

Packet Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.5

Queue Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Poovaiah M P and Sushanth Kini

iv/145

June 8, 2010

CONTENTS

CONTENTS

6.6

Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.7

Traffic Shaper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.8

Memory Read Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.9

Memory Write Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.10 DDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.11 Egress Packet Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.12 DDR2 SDRAM Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7 Target Specifications

65

III

66

IMPLEMENTATION

8 Micro-Architecture Details 8.1

8.2

8.3

8.4

67

Packet Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 8.1.1

Brief Description of I/O ports: . . . . . . . . . . . . . . . . . . . . . . . . 68

8.1.2

Interface to RxQueue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8.1.3

Interface to the PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.1.4

Arbiter for DPRAM access . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Packet Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.2.1

Brief Description of I/O Ports

. . . . . . . . . . . . . . . . . . . . . . . . 71

8.2.2

Interface to PA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8.2.3

L2 Header Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.2.4

L3 Header Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8.2.5

L4 Header Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8.2.6

Interface to QM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

8.2.7

Interface to PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Queue Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 8.3.1

Brief Description of I/O Ports

. . . . . . . . . . . . . . . . . . . . . . . . 79

8.3.2

Tables in DPRAMs used in the design . . . . . . . . . . . . . . . . . . . . 81

8.3.3

Space Calculation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8.3.4

Interface to PC and MWC

8.3.5

Interface to DDRC and SCH . . . . . . . . . . . . . . . . . . . . . . . . . 88

. . . . . . . . . . . . . . . . . . . . . . . . . . 84

Memory Write Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.4.1

Brief Description of I/O Ports

Poovaiah M P and Sushanth Kini

. . . . . . . . . . . . . . . . . . . . . . . . 90

v/145

June 8, 2010

CONTENTS

CONTENTS

8.4.2

Interface to QM and PA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8.4.3

Interface to DDRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8.5

8.6

8.7

8.8

Memory Read Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.5.1

Interface to QM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

8.5.2

Interface to DDRC WRAPPER . . . . . . . . . . . . . . . . . . . . . . . . 96

8.5.3

Interface to TXQueue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

8.5.4

Interface to SHAPER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

8.5.5

Interface to SCHEDULER . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

8.5.6

Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8.6.1

Interface to Shaper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8.6.2

Interface to QM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8.6.3

Interface to PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

8.6.4

Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Shaper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.7.1

Brief Description of I/O Ports

8.7.2

DPRAMs used in the design . . . . . . . . . . . . . . . . . . . . . . . . . 104

8.7.3

Interface to Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.7.4

Token Update Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.7.5

Interface to MRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.7.6

Interface from PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

DDRC Wrapper

. . . . . . . . . . . . . . . . . . . . . . . . 104

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.8.1

Brief Description of I/O Ports

. . . . . . . . . . . . . . . . . . . . . . . . 110

8.8.2

Interface to MWC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.8.3

Interface to MRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.8.4

Interface to MRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

9 Host Software

124

9.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

9.2

Software-Hardware Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

9.3

Software features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

9.4

Data formats of the configuration registers . . . . . . . . . . . . . . . . . . . . . . 125 9.4.1

SCHEDULER UPDATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

9.4.2

PRIV IP UPDATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Poovaiah M P and Sushanth Kini

vi/145

June 8, 2010

CONTENTS

CONTENTS

9.4.3

PC WEIGHT UPDATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

9.4.4

SHAPER TABLE UPDATE . . . . . . . . . . . . . . . . . . . . . . . . . 126

9.4.5

SHAPER TIMER UPDATE . . . . . . . . . . . . . . . . . . . . . . . . . 126

10 Testing, Verification and Results

127

10.1 Functional Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.2 Timing simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.3 Hardware Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10.4 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 10.5 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 10.6 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 11 Industrial Design for the Project

130

12 Future Scope and Conclusion

132

12.1 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 12.2 Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Appendix

133

A Logic Cores

134

A.1 Rx Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 A.1.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 A.1.2 Interface Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.2 Tx Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.2.2 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.3 RGMII I/O Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.4 CAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.4.1 CAM Core Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.4.2 Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.4.3 Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.4.4 Specifying CAM Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.5 TEMAC (Tri-mode Ethernet MAC) . . . . . . . . . . . . . . . . . . . . . . . . . 141

Poovaiah M P and Sushanth Kini

vii/145

June 8, 2010

List of Figures 1.1

QoS requirements of different applications . . . . . . . . . . . . . . . . . .

4

2.1

QoSWorks from Sitara Networks . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

ET R2816 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.3

VIPER TES-1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

3.1

End-to-End QoS levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2

Video conferencing Packet-size breakdown . . . . . . . . . . . . . . . . . . 12

3.3

Video Conferencing Traffic rates(384kbps Session Example . . . . . . . . 12

3.4

Packets treated with different levels of Priority . . . . . . . . . . . . . . . 14

3.5

MIME Types being sent back with the data content . . . . . . . . . . . . 16

3.6

HTTP transactions consisting of request and response messages . . . . 17

3.7

Example GET transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.8

Step 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.9

Step 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.10 First Come First Serve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.11 Code used implement Round Robin . . . . . . . . . . . . . . . . . . . . . . 24 3.12 simulation results for Round Robin . . . . . . . . . . . . . . . . . . . . . . . 25 3.13 Code for the simulation of Weighted Round Robin . . . . . . . . . . . . . 26 3.14 Pseudocode for Deficit Round Robin . . . . . . . . . . . . . . . . . . . . . . 27 3.15 Code for Deficit Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.16 Simulation results for Deficit Round Robin with the quantum functions indicated below them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.17 Flow-Based Weighted Fair Queuing . . . . . . . . . . . . . . . . . . . . . . . 29 3.18 Leaky bucket algorithm (adopted From [[6]]) . . . . . . . . . . . . . . . . . 32 3.19 Illustrating Token Bucket Algorithm . . . . . . . . . . . . . . . . . . . . . . 33 viii

LIST OF FIGURES

LIST OF FIGURES

3.20 Illustrating CIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.21 Single-Rate dual token bucket . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.22 Dual-Rate Dual Token Bucket . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1

Figure depicting the Token Bucket algorithm . . . . . . . . . . . . . . . . 48

6.1

Block schematic of the proposed architecture . . . . . . . . . . . . . . . . 51

6.2

Packet Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.3

Interface Diagram of Packet Classifier . . . . . . . . . . . . . . . . . . . . . 54

6.4

Interface Diagram of Queue Manager . . . . . . . . . . . . . . . . . . . . . 55

6.5

The Interface diagram of the Scheduler Block . . . . . . . . . . . . . . . . 57

6.6

I/O diagram of the Shaper Block . . . . . . . . . . . . . . . . . . . . . . . . 59

6.7

Interface Diagram of Memory Read Controller . . . . . . . . . . . . . . . 59

6.8

Interface Diagram of Memory Write Controller . . . . . . . . . . . . . . . 61

6.9

Interface diagram of Egress Packet Accumulator . . . . . . . . . . . . . . 62

6.10 I/O Diagram of DDR2 SDRAM Controller . . . . . . . . . . . . . . . . . . 63 8.1

I/O diagram of Packet Accumulator . . . . . . . . . . . . . . . . . . . . . . 67

8.2

State Machine of the controller in PA . . . . . . . . . . . . . . . . . . . . . 69

8.3

I/O diagram of PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.4

pa pc readFSM in PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.5

Flowchart for the L2 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8.6

Flowchart for the L3 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.7

Flowchart for the L4 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8.8

I/O diagram of Queue Manager . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.9

FSM to read out entries from the pc qm cmdFIFO . . . . . . . . . . . . . 85

8.10 FSM to record the packet length of the packets being enqueued . . . . 87 8.11 FSM to dequeue the packets out of the queues . . . . . . . . . . . . . . . 89 8.12 I/O diagram of MWC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.13 FSM to execute commands from qm mwc cmdFIFO . . . . . . . . . . . . 93 8.14 FSM to read out the data and address to the DDRC . . . . . . . . . . . 94 8.15 I/O diagram of MRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 8.16 FSM for Address Generation in MRC . . . . . . . . . . . . . . . . . . . . . 99 8.17 FSM for controlling TxQ Interface in MRC . . . . . . . . . . . . . . . . . 100 8.18 I/O Diagram of SCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Poovaiah M P and Sushanth Kini

ix/145

June 8, 2010

LIST OF FIGURES

LIST OF FIGURES

8.19 I/O Diagram of SHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.20 Token Bucket Update FSM in Shaper . . . . . . . . . . . . . . . . . . . . . 106 8.21 Dequeue FSM in Shaper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 8.22 I/O Diagram of DDRC Wrapper . . . . . . . . . . . . . . . . . . . . . . . . 109 8.23 Memory Init FSM in DDRC Wrapper . . . . . . . . . . . . . . . . . . . . . 112 8.24 Timing Diagram of DDR2 Memory Initialization Command . . . . . . . 113 8.25 Write FSM in DDRC Wrapper - Part 1 . . . . . . . . . . . . . . . . . . . . 114 8.26 Write FSM in DDRC Wrapper - Part 2 . . . . . . . . . . . . . . . . . . . . 115 8.27 Timing Diagram of Write command in DDRC Wrapper . . . . . . . . . . 118 8.28 Read FSM in DDRC Wrapper - Part 1 . . . . . . . . . . . . . . . . . . . . 120 8.29 Read FSM in DDRC Wrapper - Part 2 . . . . . . . . . . . . . . . . . . . . 121 8.30 Timing Diagram of Read Command in DDRC Wrapper

. . . . . . . . . 123

10.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 11.1 The Front-Top-Left Perspective rendering of the device . . . . . . . . . . 131 11.2 The Back-Top-Right Perspective rendering of the device . . . . . . . . . 131 A.1 Interface diagram of the RxQueue . . . . . . . . . . . . . . . . . . . . . . . 135 A.2 Interface diagram of Tx Queue . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.3 CAM Schematic Symbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.4 CAM Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.5 CAM Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.6 Normal Transmission at 1 Gbps . . . . . . . . . . . . . . . . . . . . . . . . . 142 A.7 Normal Frame Reception at 1Gbps . . . . . . . . . . . . . . . . . . . . . . . 142

Poovaiah M P and Sushanth Kini

x/145

June 8, 2010

List of Tables 2.1

Blue Coat Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

3.1

Voice Bandwidth (Without Layer 2 Overhead) . . . . . . . . . . . . . . . 10

3.2

Voice Bandwidth (With Layer 2 Overhead) . . . . . . . . . . . . . . . . . . 11

5.1

Values in EtherType . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2

Upper layer protocol and their weights . . . . . . . . . . . . . . . . . . . . 42

5.3

Values in Protocol field of IP . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4

Port Numbers and their weights . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5

Possible values of groupNum . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8.1

The control signal values for the End-of-frame in rxq pa cntrl . . . . . . 69

8.2

Protocols with their corresponding index and weight . . . . . . . . . . . 77

8.3

Packet Length Table in Queue Manager . . . . . . . . . . . . . . . . . . . . 81

8.4

Enqueue Queue Next Pointer Table in Queue Manager . . . . . . . . . . 82

8.5

Dequeue Queue Next Pointer Table in Queue Manager . . . . . . . . . . 82

8.6

Queue Tokens Table in Shaper . . . . . . . . . . . . . . . . . . . . . . . . . . 104

9.1

Configurable registers of NATM and their PCI address . . . . . . . . . . 124

10.1 Protocols with their corresponding index and weight . . . . . . . . . . . 128

xi

Acknowledgements We take great privilege in thanking all the people who are responsible for the successful completion of the project. First and foremost, we like to express our heart-felt gratitude to our project guide, Mr. Kuruvilla Varghese for his support and guidance at every stage of the project. We thank Dr. K. Gopakumar, The Chairman, CEDT and all faculty members for their suggestions and support. We thank our sponsors, Xilinx, Inc., for providing us with the resources necessary for the completion of the project. We extend our special thanks to all the people in the NetFPGA commmunity, who have contributed their time and effort by answering all our queries related to the hardware. We thank all the others who have been directly or indirectly involved in the project.

Part I

INTRODUCTION

2

Chapter 1

Introduction 1.1

Quality of Service

The project aims at developing a unit that guarantees Quality of Service for different users, applications and traffic classes. The unit is meant to reside in between the Local Area Network edge and the WAN gateway router. In the field of computer networking and other packet-switched telecommunication networks, the traffic engineering term QoS refers to resource reservation control mechanisms. QoS is the ability to provide different priorities to different applications, users, or data flows, or to guarantee a certain level of performance to a data flow. For example, a required bit rate, delay, jitter, packet dropping probability and/or bit error rate may be guaranteed. Quality of service guarantees are important if the network capacity is insufficient. Especially for real-time streaming multimedia applications such as voice over IP, online games and IP-TV, since these often require fixed bit rate and are delay sensitive. Different applications have different QoS requirements. For example, video-on-demand applications can tolerate moderate end-to-end delay but require high throughput but can tolerate low reliability. In contrast, Internet telephony needs very low end-to-end latency but needs moderate throughput and a slightly higher reliability (than VoD) is acceptable. Application level QoS parameters could include media quality, end-to-end delay requirements, inter/intra stream synchronization and others derived from user’s QoS specifications. Considering the user profile in a Local Area Network, few of them could be categorized as a privileged group. For instance, if we consider a LAN in an educational campus, the computer systems used by the Director, Departmental Heads, Deans and the like fall under this privileged group. The traffic originating from these users must be provided preferential treatment. Whereas, the traffic generated by the computer systems used by the students on the campus could be treated with low priority. Such scenarios demand the use of a device that can guarantee stringent Quality of Service to the users. The Access Traffic Manager caters to this. The major problem when it comes to managing the bandwidth is that the peer-peer softwares like bittorent that does not have a specific characteristic that can identify it. Hence making it hard to detect. This decreases the available bandwidth for the other important traffic to communicate effectively.

3

1.1. Quality of Service

Chapter 1. Introduction

Figure 1.1: QoS requirements of different applications

Poovaiah M P and Sushanth Kini

4/145

June 8, 2010

Chapter 2

Market Survey Bandwidth Managers have been in use for a long time till now. These devices are supporting higher throughputs nowadays as the bandwidths are scaling.

2.1

QoSWORKS 10000

Manufacturer: Sitara Networks This model [Fig 2.1]supports data rates up to 100M bit/sec and has an integrated Pentium III 600-MHz processor, dual 10/100M bit/sec Ethernet interface The QosWorks 10000 features real-time traffic flow monitoring. This unit can classify traffic by application/port or by address of the sender or the receiver. This unit can operate based on the application that is generating the traffic. If it encounters an application that is unknown, custom port filters can also be built. As some applications use random ports on both ends, it becomes difficult to classify these types of traffic by application. The unit also supports classification of traffic based on the port number.

Figure 2.1: QoSWorks from Sitara Networks

2.2

Blue Coat Packet Shaper series

Manufacturer: Blue Coat Blue coat manufactures a series of Bandwidth managers in the name ’Packet Shaper’. The details of these products are mentioned in the table 2.1. These units support various protocols the lists of which are available in [5].

5

2.3. ET/R2816 10Gb/s Appliance PACKETSHAPER Series IP Flows (TCP/Other IP) Classes Dynamic Partitions Static Partitions Shaping Policies Max no. of Matching Rules No. of IP Hosts Maximum Throughput

Chapter 2. Market Survey

900

1700

3500

7500

10000

5,000/ 2,500 256 ** 128 256 640

30,000/ 15,000 512 1,024 256 512 2,562

40,000/ 20,000 1,024 1,024 512 1,024 2,562

200,000/ 100,000 1,024 10,000 512 1,024 5,120

300,000/ 150,000 2,048 20,000 1,024 2,048 5,000

5,000 2 Mbps

15,000 10 Mbps

20,000 45 Mbps

150,000 200 Mbps

200,000 1 Gbps

Table 2.1: Blue Coat Products

2.3

ET/R2816 10Gb/s Appliance

Manufacturer: Emerging Technologies Inc.

Figure 2.2: ET R2816

• 8 cores capable of processing 16 simultaneous threads with Hyperthreading. • 8GB RAM • A fully loaded R2816 can manage up to 8Gb/s of traffic. • 2 10/100/1000 Gb/s Ethernet ports, plus 2 high performance 10/100/1000 Failover/Bypass ports • Dual HDD • Standard SVGA and Keyboard Inputs Poovaiah M P and Sushanth Kini

6/145

June 8, 2010

Chapter 2. Market Survey

2.4. VIPER TES-1000

• GUI for configuration • Price: $21,995.00

2.4

VIPER TES-1000

• HP ProLiant DL140 G3 Server • Dual Core Intel Xenon 5110 CPU (1.6 GHz, 1066 MHz FSB) • 1024 MB PC2-3200 Fully Buffered RAM with Advanced EEC • 2x80 GB 7200 rpm SATA RAID-1 Hard Drives • Intel 5000X Chipset, Mirroring Mode • Rail Kit, 650W Power Supply • Two Broadcom 10/100/1000 Mbit/s NICs, DVD-ROM drive • Price: hardware platform - $1800 software platform- $1000

Figure 2.3: VIPER TES-1000

Poovaiah M P and Sushanth Kini

7/145

June 8, 2010

Chapter 3

Literature Survey 3.1

Quality of Service

Network level QoS can have the following parameters: • Delay: Time taken for a message to be transmitted. • Response time: Round-trip time from request transmission to reply receipt; • Jitter: Variation in delay or response time; • Systems-level data rate: Bandwidth required or available, in bits or bytes per second; • Application-level data rate: Bandwidth required or available, in application specific units such as video frame rate; • Transaction rate: Number of operations requested or processed per second; • Mean time to failure (MTTF): Normal operation time between failures. • Mean time to repair (MTTR): Downtime from failure to restarting next operation; • Mean time between failures (MTBF): MTBF = MTTF + MTTR; • Percentage of time available: MTTF/MTTF + MTTR. These parameters typically form part of the service level agreement (SLA) between a network service provider and a service user. For the Internet, it typically refers to availability of the access link to the service provider; • Packet loss rate: Proportion of total packets that do not arrive as sent, e.g., lost because of congestion in the network; • Bit error rate: Proportion of total data that does not arrive as sent because of errors in the network transmission system. For example, bit error rate increases if transmission speed is increased over the telephone line. 8

Chapter 3. Literature Survey

3.1.1

3.1. Quality of Service

Defining ’Flow’

A flow can be defined in a number of ways. One common way refers to a combination of source and destination addresses, source and destination socket numbers, and the session identifier. It can also be defined more broadly as a stream of packets from any source to a destination.

3.1.2

End-to-End QoS levels

Service levels refer to the actual end-to-end QoS capabilities, meaning the capability of a network to deliver service needed by specific network traffic from end to end or edge to edge. The services differ in their level of QoS strictness, which describes how tightly the service can be bound by specific bandwidth, delay, jitter, and loss characteristics. Three basic levels of end-to-end QoS can be provided across a heterogeneous network, as shown in Figure 3.1: • Best-effort service - Also known as lack of QoS, best-effort service is basic connectivity with no guarantees. This is best characterized by FIFO queues, which have no differentiation between flows. • Differentiated service (also called soft QoS) - Some traffic is treated better than the rest (faster handling, more average bandwidth, and lower average loss rate). This is a statistical preference, not a hard and fast guarantee. This is provided by classification of traffic and the use of QoS tools such as PQ, CQ, WFQ, and WRED (all discussed later in this chapter). • Guaranteed service (also called hard QoS) - This is an absolute reservation of network resources for specific traffic.

3.1.3

QoS Requirements of VoIP

VoIP deployments require the provisioning of explicit priority servicing for VoIP (bearer stream) traffic and a guaranteed bandwidth service for Call-Signaling traffic. These related classes are examined separately. Voice (Bearer Traffic) The following list summarizes the key QoS requirements and recommendations for voice (bearer traffic): • Voice traffic should be marked to DSCP EF (Expedite Forwarding) as per the QoS Baseline and RFC 3246. • Loss should be no more than 1 percent. • One-way latency (mouth to ear) should be no more than 150 ms. • Average one-way jitter should be targeted at less than 30 ms. Poovaiah M P and Sushanth Kini

9/145

June 8, 2010

3.1. Quality of Service

Chapter 3. Literature Survey

Figure 3.1: End-to-End QoS levels Bandwidth Consumption G.711 G.711 G.729A G.729A

Packetization Interval 20 ms 30 ms 20 ms 30 ms

Voice Payload in Bytes 160 240 20 30

Packets Second 50 33 50 33

Per

Bandwidth Per Conversation 80 kbps 74 kbps 24 kbps 19 kbps

Table 3.1: Voice Bandwidth (Without Layer 2 Overhead) • A range of 21 to 320 kbps of guaranteed priority bandwidth is required per call (depending on the sampling rate, the VoIP codec, and Layer 2 media overhead). • Voice quality directly is affected by all three QoS quality factors: loss, latency, and jitter.

PPS (Packets Per Second) is defined as Bit Rate (kbps) / Voice Payload Size (bits per packet)

Call-Signaling Traffic The following list summarizes the key QoS requirements and recommendations for Call-Signaling traffic: Poovaiah M P and Sushanth Kini

10/145

June 8, 2010

Chapter 3. Literature Survey

3.1. Quality of Service

Bandwidth Consumption

802.1Q Ethernet

PPP

MLP

G.711 at 50 pps G.711 at 33 pps G.729A at 50 pps G.729A at 33 pps

93 83 37 27

84 77 28 21

86 78 30 22

kbps kbps kbps kbps

kbps kbps kbps kbps

kbps kbps kbps kbps

Frame Relay with FRF.12 84 kbps 77 kbps 28 kbps 21 kbps

ATM

106 kbps 84 kbps 43 kbps 28 kbps

Table 3.2: Voice Bandwidth (With Layer 2 Overhead) • Call-Signaling traffic should be marked as DSCP CS3 per the QoS Baseline (during migration, it also can be marked the legacy value of DSCP AF31). • 150 bps (plus Layer 2 overhead) per phone of guaranteed bandwidth is required for voice control traffic; more may be required, depending on the Call-Signaling protocol(s) in use.

3.1.4

QoS Requirements of Video

Two main types of video traffic exist: Interactive-Video (videoconferencing) and Streaming-Video (both unicast and multicast). Each type of video is examined separately.

Interactive-Video When provisioning for Interactive-Video (video conferencing) traffic, the following guidelines are recommended: • Interactive-Video traffic should be marked to DSCP AF41; excess videoconferencing traffic can be marked down by a policer to AF42 or AF43. • Loss should be no more than 1 percent. • One-way latency should be no more than 150 ms. • Jitter should be no more than 30 ms. • Assign Interactive-Video to either a preferential queue or a second priority queue (when supported); when using Cisco IOS LLQ, overprovision the minimum-priority bandwidth guarantee to the size of the videoconferencing session plus 20 percent. (For example, a 384-kbps videoconferencing session requires 460 kbps of guaranteed priority bandwidth.) Because IP videoconferencing (IP/VC) includes a G.711 audio codec for voice, it has the same loss, delay, and delay-variation requirements as voice but the traffic patterns of videoconferencing are radically different from those of voice. For example, videoconferencing traffic has varying packet sizes and extremely variable packet rates. These are illustrated in Figures 3.2 and 3.3 Poovaiah M P and Sushanth Kini

11/145

June 8, 2010

3.1. Quality of Service

Chapter 3. Literature Survey

Figure 3.2: Video conferencing Packet-size breakdown

Figure 3.3: Video Conferencing Traffic rates(384kbps Session Example Streaming-Video When addressing the QoS needs of Streaming-Video traffic, the following guidelines are recommended: • Streaming-Video (whether unicast or multicast) should be marked to DSCP CS4, as designated by the QoS Baseline. • Loss should be no more than 5 percent. • Latency should be no more than 4 to 5 seconds (depending on the video application’s buffering capabilities). • There are no significant jitter requirements. Poovaiah M P and Sushanth Kini

12/145

June 8, 2010

Chapter 3. Literature Survey

3.1. Quality of Service

• Guaranteed bandwidth (CBWFQ) requirements depend on the encoding format and rate of the video stream. • Streaming-Video is typically unidirectional; therefore, remote branch routers might not require provisioning for Streaming-Video traffic on their WAN or VPN edges (in the direction of branch to campus). Streaming-Video applications have more lenient QoS requirements because they are not delay sensitive (the video can take several seconds to cue up) and are largely not jitter sensitive (because of application buffering). However, Streaming-Video might contain valuable content, such as e-learning applications or multicast company meetings, in which case it requires service guarantees. The QoS Baseline recommendation for Streaming-Video marking is DSCP CS4. An interesting consideration with respect to Streaming-Video comes into play when designing WAN and VPN edge policies on branch routers: Because Streaming-Video is generally unidirectional, a separate class likely is not needed for this traffic class in the branch-to-campus direction of traffic flow.

3.1.5

QoS requirements for Data

Best-Effort Data When addressing the QoS needs of Best-Effort traffic, the following guidelines are recommended: • Best-Effort traffic should be marked to DSCP 0. • Adequate bandwidth should be assigned to the Best-Effort class as a whole because the majority of applications default to this class. It is recommended to reserve at least 25 percent for Best-Effort traffic. The Best-Effort class is the default class for all data traffic. Only if an application has been selected for preferential or deferential treatment is it removed from the default class. Bulk Data When addressing the QoS needs of Bulk Data traffic, the following guidelines are recommended: • Bulk Data traffic should be marked to DSCP AF11; excess Bulk Data traffic can be marked down by a policer to AF12 or AF13. • Bulk Data traffic should have a moderate bandwidth guarantee but should be constrained from dominating a link. The Bulk Data class is intended for applications those are relatively non-interactive and not drop sensitive, and that typically span their operations over a long period of time as background occurrences. Such applications include FTP, e-mail, backup operations, database synchronizing or replicating operations, video content distribution, and any other type of application in which users typically cannot proceed because they are waiting for the completion of the operation (in other words, a background operation). Poovaiah M P and Sushanth Kini

13/145

June 8, 2010

3.2. Packet Classification

3.2 3.2.1

Chapter 3. Literature Survey

Packet Classification PQ: Prioritized Queuing

PQ ensures that important traffic gets the fastest handling at each point where it is used. It was designed to give strict priority to important traffic. Priority queuing can flexibly prioritize according to network protocol (for example IP, IPX, or AppleTalk), incoming interface, packet size, source/destination address, and so on. In PQ, each packet is placed in one of four queueshigh, medium, normal, or low-based on an assigned priority. Packets that are not classified by this priority list mechanism fall into the normal queue (see Figure 3.4). During transmission, the algorithm gives higher-priority queues absolute preferential treatment over low-priority queues. PQ is useful for making sure that mission-critical traffic traversing various WAN links gets priority treatment. For example, Cisco uses PQ to ensure that important Oracle-based sales reporting data gets to its destination ahead of other, less-critical traffic. PQ currently uses static configuration and thus does not automatically adapt to changing network requirements.

Figure 3.4: Packets treated with different levels of Priority

3.2.2

Rules for classification of the packets

Packets can be classified based on the content found in its header or by inspecting the containing data. The header based classification can be made at each level of the communication stack. The classification can also be made on the type of protocol used at every level. For example the L3 protocol can be IPv4, IPv6, IPX , etc.. Therefore it becomes a primary design issue to support the different protocols that we can support. Packets that contain any incompliant protocol may be dropped. Possible classifications Link layer Poovaiah M P and Sushanth Kini

14/145

June 8, 2010

Chapter 3. Literature Survey

3.2. Packet Classification

In Link layer the packets can be classified based on the link level protocol, source and destination address, and the type of data it carries (i.e., link control frame or data frame). The rate control in this layer can be most effective in the case of Denial of Service. Network Layer In Network layer few privileged users’ source address can be given very high priority there-by classifying them as the privileged set of people of any institution. Also any other packet that carries the routing information (information necessary to keep the routers in shape) can be treated differently from the packets carrying actual information. The IPv4 is the most wide spread protocol used in the internet and it makes much sense to support this protocol. Whereas protocols like IPX are used very rarely. Hence, we have to strike a compromise between the number of protocols we support and the amount of hardware that is consumed in FPGA for its implementation. The IP protocol for example has a ’protocol’ field that contains a code that is used to demultiplex which higher level protocol had invoked its service. This information tells us which transport layer protocol is using it. Based on this there can be differentiation made here. Also the Error messages like ICMP etc., can be given higher priority because delaying these packets may make the network to become unstable and hence unusable until it settles. Transport Layer In Transport layer there are a lot of protocols supported by IETF. These Protocols are basically designed to suit the requirements of almost all the applications that depend on it. Therefore it is almost impossible to support all the protocols with low area on chip in mind. However there are a set of protocols that are used by most of the applications and supporting them would be enough to support the required QoS. The different ways of forming Rules may be based on the source and destination port numbers, etc., To control the throughput (and hence its rate) of any reliable transport protocol is to modify the window size mentioned in the packet. But in Protocols that are inherently unreliable like UDP the only way to reduce its rate is by dropping the packets once they exceed a threshold rate. Application Layer Few protocols like bittorrent, movie downloads etc., cannot be detected unless we do a DPI (Deep Packet Inspection). There are many protocols that are overlays (i.e., they use TCP and build a custom protocol) that are hard to be detected unless there is an inspection made in the application layer. Hence checking the HTTP objects would also be a good idea to control the traffic that is coming in.

3.2.3

HTTP Header Look-up

One of the ways a quality of service could be provided is by inspecting packets at the very top layer which is the Application Layer. Since nowadays, most of the web traffic is HTTP based, examining the HTTP header could provide more assistance in offering better quality of service to the users. For instance, as on date, almost all applications are embedded into the popular browsers. Say, a user could stream video, audio, chat and browse the internet all using the same browser. Thus real-time video & audio data is being received over the internet within the HTTP payload itself. In fact, even file downloads are initiated via HTML requests which Poovaiah M P and Sushanth Kini

15/145

June 8, 2010

3.2. Packet Classification

Chapter 3. Literature Survey

ultimately appear inside the HTTP stream. Hence, inspecting the HTTP header can determine the application protocol. File downloads tend to occupy all the available bandwidth in the network if not restricted. If the QoS provider wants to restrict the bandwidth allocation to file downloads to a certain limit, the same could be done once having inspected the HTTP header and subsequently low-prioritizing the HTTP file-download packets queue. Similarly, since video and audio streams have to be real-time, these will have to be high-prioritized once HTTP header look-up is done and subsequently identified as video/audio stream. Thus, various applications could be distinguished and could be allotted different bandwidths and priorities based on the QoS guarantee to the user. HTTP Object Tags Because, the Internet hosts many thousands of different data types, HTTP carefully tags each object being transported through the Web with a data format label called a MIME type. MIME (Multipurpose Internet Mail Extensions) was originally designed to solve problems encountered in moving messages between different electronic mail systems. MIME worked so well for email that HTTP adopted it to describe and label its own multimedia content. Web servers attach a MIME type to all HTTP object data (see Figure 3.5). When a web browser gets an object back from a server, it looks at the associated MIME type to see if it knows how to handle the object. Most browsers can handle hundreds of popular object types: displaying image files, parsing and formatting HTML files, playing audio files through the computer’s speakers, or launching external plug-in software to handle special formats.

Figure 3.5: MIME Types being sent back with the data content A MIME type is a textual label, represented as a primary object type and a specific subtype, separated by a slash. For example: • An HTML-formatted text document would be labelled with type text/html. • A plain ASCII text document would be labelled with type text/plain. • A JPEG version of an image would be image/jpeg. • A GIF-format image would be image/gif. • An Apple QuickTime movie would be video/quicktime. • A Microsoft PowerPoint presentation would be application/vnd.ms-powerpoint. Poovaiah M P and Sushanth Kini

16/145

June 8, 2010

Chapter 3. Literature Survey

3.2. Packet Classification

HTTP Transactions An HTTP transaction consists of a request command (sent from client to server), and a response result (sent from the server back to the client). This communication happens with formatted blocks of data called HTTP messages, as illustrated in Figure [3.6].

Figure 3.6: HTTP transactions consisting of request and response messages HTTP supports several different request commands, called HTTP methods. Every HTTP request message has a method. The method tells the server what action to perform (fetch a web page, run a gateway program, delete a file, etc.). Table 1-2 lists five common HTTP methods. GET is the most common method. It usually is used to ask a server to send a resource. Since the QoS device is most likely to be situated at the edge between the WAN gateway and the campus user hosts, it would be a better approach to providing QoS guarantees to the users if the user HTTP request messages are inspected. Prevention is better than cure! What is meant here is that, if there is a possibility to identify what is the purpose of the outgoing HTTP request message, suitable action could be taken then and there itself. Say, if a campus user is initiating a file-download. The same would be present in the GET method’s argument list. If it is possible to make out the same, then either the request could be dropped right at the QoS unit or put into a very low priority queue or could be allowed to pass through to the gateway, depending on what time of the day it is and which resources are allowed to be accessed then, as per the campus Internet access rules. However, such deep packet Layer 7 inspection is easier to be done in software rather than hardware. Nevertheless, if implemented in hardware, the software implementation would be much slower inevitably. This HTTP header inspection is being looked-into and decision on this will be taken when the target specifications of the project will be finalized later in the due course of the project. About Group Level Shaping As per the suggested architecture for designing the QoS unit, separate queues of packets are maintained within the unit. The way these queues are formed depends mainly on the IP Addresses in the packets and also on the protocol encapsulated within the IP payload i.e. Poovaiah M P and Sushanth Kini

17/145

June 8, 2010

3.2. Packet Classification

Chapter 3. Literature Survey

Figure 3.7: Example GET transaction either TCP or UDP. Plan, as already mentioned earlier, is to form enqueue the packet into the appropriate queue, based on the following algorithm: • Initially, check if packet originated from a machine which is classified as a privileged user. If so, enqueue the packet in the Higher Priority queue. • If step1 is not true, i.e. packet didn’t originate from the machine of a privileged user, then, this has to be further inspected. Layer 3 payload inspection is done to find out which Layer 4 protocol data is encapsulated within the IP packet. Based on various protocols, we enqueue them into various queues. Say, if TCP data happened to lie inside an IP packet, then this would go into the TCP queue. Similarly, if UDP data hid inside IP, the same would be sent to the UDP queue. For unsupported protocols, we plan to maintain a default queue. In reality, the above queues are actually a group of queues. In step 1, there are a bunch of higher priority queues thereby forming a higher priority group. Similarly, in step 2 the TCP queue is actually a group of queues, so also is the case with UDP queue. Need for queue groups Poovaiah M P and Sushanth Kini

18/145

June 8, 2010

Chapter 3. Literature Survey

3.2. Packet Classification

The main reason why the above queue grouping is done is to carry out group shaping in addition to the queue shaping. Each queue in the group will be subjected to shaping as per the final shaping algorithm chosen, say token bucket algorithm for instance. This step will ensure that no queue ever exceeds bandwidth allocated to it. Note that this has to be done for the Higher Priority queues also. In addition, the entire queue group is subjected to group shaping. This is essential because the entire group would have been allocated a fixed percentage of the bandwidth on the outgoing link. The limit on the bandwidth for a group is again important because if not for this, then the lower priority queues may be starved of bandwidth. Consider a scenario wherein all the queues in the Higher Priority queue group are getting filled up continuously. In this case, the queues would tend to take up the entire link bandwidth had there not been a Group bandwidth limit. Therefore, group shaping shall be done on the queue groups. Simple token bucket algorithm will suffice for this purpose. The token rate of this bucket and the maximum threshold which decides the maximum possible tokens accumulated should be set appropriately by the administrator of the QoS unit. Hence during scheduling of a queue to be dequeued next, initially Group shaper should be consulted. If it is found that this group is running at rate faster than the allocated bandwidth, the shaper should indicate the same using some flags. The scheduler now looks at the status of these flags. If flags say Go, then the queue-group is selected for deciding the next queue to be dequeued. If flags indicate OverMaxRate, then this entire group won’t be selected. Instead the scheduler moves onto the next queue-group. A while later, the queue-group which had gone Over MaxRate, may come back into contention when the shaper resets the OverMaxRate flag. This will happen when after some time, tokens start accumulating in the token buckets used for shaping that particular group’s traffic. Only in the case when Queue Group is found to be operating within max rate, will the scheduler go onto its second step. The second step would be to select a queue among the queues in this group. The procedure for this would again depend on the scheduling algorithm used. If say, round-robin scheduling was being followed, then, scheduler selects the next queue after having consulted with the shaper. The consultation is necessary because the shaper is maintaining flags which indicate the shaped status of traffic on that particular queue. Here again, if a queue is found to going overmaxrate, then it won’t be considered for dequeuing in that round, instead the scheduler simply moves onto the next eligible queue. The ineligible queue can become eligible again once sufficient tokens accumulate in the token bucket that is being used for shaping that particular queue’s traffic.

3.2.4

Ternary Content Addressable Memory

There are many ways in which a packet can be classified based on some governing rules. Out of which using TCAMs is an attractive option due to their speed of operation. But the major Poovaiah M P and Sushanth Kini

19/145

June 8, 2010

3.2. Packet Classification

Chapter 3. Literature Survey

drawbacks are its high power consumption and inefficiency to handle port ranges. i.e., if we have to represent a filter for port ranging from 1000 to 1999, we need to have 1000 entries for the table. In comparison to the SRAM type of implementation TCAMs are fast. Extended TCAMs can deliver high performance (100 million lookups per second) for large filter sets (100,000 filters), while reducing power consumption by a factor of ten and improves space efficiency by a factor of three [2]. The paper [4] discusses about distributed TCAM scheme that exploits chip-level-parallelism that is proposed to greatly improve the throughput performance.

3.2.5

Decision tree based Packet classification

By Decision tree based packet classification [4] implemented using FPGAs a throughput of 80Gbps for minimum size packets (40 byte). This can support about 10k unique rules. The algorithm for this implementation has been shown in fig. 3.8 and fig. 3.9. The operation of this algorithm is two-fold 1) Building the Decision Tree and 2)mapping a decision tree onto a pipeline

Poovaiah M P and Sushanth Kini

20/145

June 8, 2010

Chapter 3. Literature Survey

3.2. Packet Classification

Figure 3.8: Step 1

Poovaiah M P and Sushanth Kini

21/145

June 8, 2010

3.2. Packet Classification

Chapter 3. Literature Survey

Figure 3.9: Step 2

Poovaiah M P and Sushanth Kini

22/145

June 8, 2010

Chapter 3. Literature Survey

3.3

3.3. Scheduling

Scheduling

Scheduling Algorithms Scheduling of resources such as link bandwidth and available buffers is the key to providing performance guarantees to applications that require QoS support from the network. The QoS box needs to distinguish between the flows requiring different QoS (and possibly sort them into separate queue) and then, based on a scheduling algorithm, send these packets to the outgoing links. Goals to be achieved by the scheduling techniques to support QoS in packet switching networks: • Sharing bandwidth; • Providing fairness to competing flows; • Meeting bandwidth guarantees (minimum and maximum); • Meeting loss guarantees (multiple levels); • Meeting delay guarantees (multiple levels); • Reducing delay variations.

3.3.1

First Come First Serve (FCFS)

Packets from all flows are enqueued into a common buffer and a server serves packets from the head of the queue. This scheme is either called first come first serve (FCFS) or first in first out (FIFO) as shown in Figure. It shows arrival of packets into a FCFS queue and the scheduler

Figure 3.10: First Come First Serve serving packets from the head of the queue. Figure 3.1(b) shows a scenario in which all buffers are occupied and a newly arrived packet is dropped. FCFS fails to allocate max-min fair share Poovaiah M P and Sushanth Kini

23/145

June 8, 2010

3.3. Scheduling

Chapter 3. Literature Survey

bandwidth allocation to individual flows. A greedy source can occupy most of the queue and cause delay to other flows using the same queue.

3.3.2

Round Robin

To address the fairness problem of a single FCFS queue, the round robin scheduler maintains one queue for each flow. Each incoming packet is placed in an appropriate queue. The queues are served in a round robin fashion, taking one packet from each nonempty queue in turn. Empty queues are skipped over. This scheme is fair in that each busy flow gets to send exactly one packet per cycle. Further, this is a load balancing among the various flows. Note that there is no advantage to being greedy. A greedy flow finds that its queue becomes long, increasing its delay, whereas other flows are unaffected by this behaviour. If the packet sizes are fixed, such as in ATM networks, round robin provides a fair allocation of link bandwidth. If packet sizes are variable, which is the case in the Internet, there is a fairness problem. Consider a queue with very large packets and several other queues with very small packets. With round robin, the scheduler will come back to the large-packet queue quickly and spend long times there. On an average, the large-packet queue will get the large share of the link bandwidth. Another problem with round robin is that it tries to allocate fair bandwidth to all queues and hence differential treatment, or any specific allocation of bandwidth to specific queues, is not achieved. The in Fig. 3.11 was used to simulate the round robin scheduling algorithm. The variable queue is a three dimensional array that has its queue index as the first dimension; packet priority as the second dimension; and the packet size as the third dimension. It has been observed that Round robin achieves fairness among the queues but to reach the steady state it may take a lot of rounds Fig. 3.12. The figure shows that after a long time the queues achieve fairness of service.

Figure 3.11: Code used implement Round Robin

Poovaiah M P and Sushanth Kini

24/145

June 8, 2010

Chapter 3. Literature Survey

3.3. Scheduling

Figure 3.12: simulation results for Round Robin

3.3.3

Weighted Round Robin

Weighted round robin (WRR) is a simple modification to round robin. Instead of serving a single packet from a queue per turn, it serves n packets. Here n is adjusted to allocate a specific fraction of link bandwidth to that queue. Each flow is given a weight that corresponds to the fraction of link bandwidth it is going to receive. The number of packets to serve in one turn is calculated from this weight and the link capacity. The WRR works fine with fixed size packets, such as in ATM networks. However, WRR has difficulty in maintaining bandwidth guarantees with variable size packets (the Internet). The problem with a variable size packet is that flows with large packets will receive more than the allocated weight. In order to overcome this problem, the WRR server needs to know the mean packet size of sources a priori. Short-term fairness is another problem encountered by WRR. On a small time scale, WRR does not meet the fairness criteria, since some flows may transmit more than others. The other advantage of WRR is that we can provide differential treatment for the queues.

3.3.4

Deficit Round Robin

Deficit round robin (DRR) improves WRR by being able to serve variable length packets without knowing the mean packet size of connections a priori. The algorithm works as follows: Initially a variable quantum is initialized to represent the number of bits to be served from each queue. The scheduler starts serving each queue that has a packet to be served. If the packet size is less than or equal to the quantum, the packet is served. However, if the packet is bigger than the quantum size, the packet has to wait for another round. In this case another counter, called a deficit counter, is initialized for this queue. If a packet can’t be served in a round, its deficit counter is incremented by the size of the quantum. The following pseudo code adapted from Shreedhar and Varghese [1] describes the DRR scheme. We observe that DRR scheme can be used to provide a QOS promise on the bandwidth. The algorithm makes sure that the required QoS is achieved for the flow to the byte-level. The simulation was carried out for different weighing functions and the results are as indicated in Fig 3.16. DRR should set the quantum (bits to be served) to at least one packet from each connection. Poovaiah M P and Sushanth Kini

25/145

June 8, 2010

3.3. Scheduling

Chapter 3. Literature Survey

Figure 3.13: Code for the simulation of Weighted Round Robin This would require one to set the quantum to MTU of the link. For example, if the link consists of Ethernet packets, it should be set to 1,500 bytes. DRR is not fair at time scales shorter than a packet time. However, ease of implementation makes it an attractive scheduler.

3.3.5

Class-Based Weighted Fair Queuing

Class-based WFQ (CBWFQ) is one of the congestion-management tools for providing greater flexibility. When you want to provide a minimum amount of bandwidth, use CBWFQ. This is in comparison to a desire to provide a maximum amount of bandwidth. CBWFQ allows a network administrator to create minimum guaranteed bandwidth classes. Instead of providing a queue for each individual flow, a class is defined that consists of one or more flows. Each class can be guaranteed a minimum amount of bandwidth. One example in which CBWFQ can be used is in preventing multiple low-priority flows from swamping out a single high-priority flow. For example, a video stream that needs half the bandwidth of T1 will be provided that by WFQ if there are two flows. As more flows are added, the video stream gets less of the bandwidth because WFQ’s mechanism creates fairness. If there are 10 flows, the video stream will get only 1/10th of the bandwidth, which is not enough. Even setting the IP precedence bit = 5 does not solve this problem. Since, 1 9 + 6 = 15, video gets 6/15th of the bandwidth, which is less than the bandwidth video needs. A mechanism must be invoked to provide the half of the bandwidth that video needs. CBWFQ provides this. The network administrator defines a class, places the video stream in the class, and tells the router to provide 768 kbps (half of a T1) service for the class. Video is now given the bandwidth that it needs. A default class is used for the rest of flows. This class is serviced using flowbased WFQ schemes allocating the remainder of the bandwidth (half of the T1, in this example).

Poovaiah M P and Sushanth Kini

26/145

June 8, 2010

Chapter 3. Literature Survey

3.3. Scheduling

Figure 3.14: Pseudocode for Deficit Round Robin Note that with CBWFQ, a minimum amount of bandwidth can be reserved for a certain class. If more bandwidth is available, that class is welcome to use it. The key is that it is guaranteed a minimum amount of bandwidth. Also, if a class is not using its guaranteed bandwidth, other applications may use the bandwidth.

3.3.6

Flow-Based Weighted Fair Queuing

For situations in which it is desirable to provide consistent response time to heavy and light network users alike without adding excessive bandwidth, the solution is flow-based WFQ (commonly referred to as just WFQ). It is a flow-based queuing algorithm that creates bit-wise fairness by allowing each queue to be serviced fairly in terms of byte count. For example, if queue 1 has 100-byte packets and queue 2 has 50-byte packets, the WFQ algorithm will take two packets from queue 2 for every one packet from queue 1. This makes service fair for each Poovaiah M P and Sushanth Kini

27/145

June 8, 2010

3.3. Scheduling

Chapter 3. Literature Survey

Figure 3.15: Code for Deficit Round Robin

queue: 100 bytes each time the queue is serviced. WFQ ensures that queues do not starve for bandwidth and that traffic gets predictable service. Low-volume traffic streams-which comprise the majority of traffic-receive increased service, transmitting the same number of bytes as high-volume streams. This behavior results in what appears to be preferential treatment for low-volume traffic, when in actuality it is creating fairness, as shown in Fig. 3.17. WFQ is designed to minimize configuration effort, and it automatically adapts to changing network traffic conditions. Flow-based WFQ creates flows based on a number of characteristics in a packet. Each flow (also referred to as a conversation) is given its own queue for buffering if congestion is experienced. The weighted portion of WFQ comes from the use of IP precedence bits to provide greater service for certain queues. Using settings 0 to 5 (6 and 7 are reserved), WFQ uses its algorithm to determine how much more service to provide to a queue. WFQ is efficient in that it uses whatever bandwidth is available to forward traffic from lowerpriority flows if no traffic from higher-priority flows is present. This is different from strict time-division multiplexing (TDM), which simply carves up the bandwidth and lets it go unused Poovaiah M P and Sushanth Kini

28/145

June 8, 2010

Chapter 3. Literature Survey

3.3. Scheduling

Figure 3.16: Simulation results for Deficit Round Robin with the quantum functions indicated below them

Figure 3.17: Flow-Based Weighted Fair Queuing

if no traffic is present for a particular traffic type. WFQ works with both-IP precedence and Resource Reservation Protocol (RSVP), described later in this chapter-to help provide differentiated QoS as well as guaranteed services. The WFQ algorithm also addresses the problem of round-trip delay variability. If multiple highvolume conversations are active, their transfer rates and inter-arrival periods are made much Poovaiah M P and Sushanth Kini

29/145

June 8, 2010

3.3. Scheduling

Chapter 3. Literature Survey

more predictable. This is created by the bit-wise fairness. If conversations are serviced in a consistent manner with every round-robin approach, delay variation (or jitter) stabilizes. WFQ greatly enhances algorithms such as SNA Logical Link Control (LLC) and the Transmission Control Protocol (TCP) congestion control and slow-start features. A weight is a number calculated from the IP precedence setting for a packet in flow. This weight is used in WFQ’s algorithm to determine when the packet will be serviced. Weight = (4096 / (IP precedence + 1) Weight = (32384 / (IP precedence + 1)

Poovaiah M P and Sushanth Kini

30/145

June 8, 2010

Chapter 3. Literature Survey

3.4

3.4. Traffic Shaping

Traffic Shaping

Traffic shaping is about regulating the average rate (and burstiness) of data transmission. When a connection is set up, the user and the ISP (i.e., the customer and the carrier) agree on a certain traffic pattern (i.e., shape) for that circuit. Sometimes this is called a service level agreement(SLA). As long as the customer fulfils his part of the deal and only sends packets according to the agreed-on contract, the carrier promises to deliver them all in a timely fashion. Traffic shaping reduces congestion and thus helps the carrier live up to its promise. Such agreements are not so important for file transfers but are of great importance for real-time data, such as audio and video connections, which have stringent quality-of-service requirements. Traffic shaping, smoothens out the traffic on the server side, rather than on the client side. Monitoring a traffic flow is called traffic policing. There is a slight difference between traffic shaping and traffic policing as seen in the following definitions: • Policing:- Policing typically limits bandwidth by discarding traffic that exceeds a specified rate. However, policing also can remark traffic that exceeds the specified rate and attempt to send the traffic anyway. Since policing’s drop behaviour causes TCP retransmits, it is recommended for use on higher-speed interfaces. Also, note that policing can be applied inbound or outbound on an interface. • Shaping:- Shaping limits excess traffic, not by dropping it but by buffering it and retransmitting at the required rate. This buffering of excess traffic can lead to delay. Because of this delay, shaping is recommended for slower-speed interfaces. Unlike policing, shaping cannot remark traffic. As a final contrast, shaping can be applied only in the outbound direction on an interface.

The Leaky Bucket Algorithm To understand the leaky bucket algorithm, we consider a bucket with a small hole in the bottom, as illustrated in fig. 3.18. No matter the rate at which water enters the bucket, the outflow is at a constant rate, when there is any water in the bucket and zero when the bucket is empty. Also, once the bucket is full, any additional water entering it spills over the sides and is lost (i.e., does not appear in the output stream under the hole). The same idea is applied to packets, as shown in Figure. Conceptually, each host is connected to the network by an interface containing a leaky bucket, that is, a finite internal queue. If a packet arrives at the queue when it is full, the packet is discarded. In other words, if one or more processes within the host try to send a packet when the maximum number is already queued, the new packet is discarded. This arrangement can be built into the hardware interface or simulated by the host operating system. The host is allowed to put one packet per clock tick onto the network. Again, this can be enforced by the interface card or by the operating system. This mechanism turns an uneven flow of packets from the user processes inside the host into an even flow of packets onto the network, smoothing out bursts and greatly reducing the chances of congestion. When the packets are all of the same size (e.g., ATM cells), this algorithm can be used as described. However, when variable-sized packets are being used, it is often better to allow a fixed Poovaiah M P and Sushanth Kini

31/145

June 8, 2010

3.4. Traffic Shaping

Chapter 3. Literature Survey

number of bytes per tick, rather than just one packet. Thus, if the rule is 1024 bytes per tick, a single 1024-byte packet can be admitted on a tick, two 512-byte packets, four 256-byte packets, and so on. If the residual byte count is too low, the next packet must wait until the next tick. For implementing the leaky bucket algorithm all that is needed is a finite queue. When a packet arrives, if there is room on the queue it is appended to the queue; otherwise, it is discarded. At every clock tick, one packet is transmitted (unless the queue is empty). The byte-counting leaky bucket is implemented almost the same way. At each tick, a counter is initialized to n. If the first packet on the queue has fewer bytes than the current value of the counter, it is transmitted, and the counter is decremented by that number of bytes. Additional packets may also be sent, as long as the counter is high enough. When the counter drops below the length of the next packet on the queue, transmission stops until the next tick, at which time the residual byte count is reset and the flow can continue.

Figure 3.18: Leaky bucket algorithm (adopted From [[6]])

Token Bucket Algorithm The leaky bucket algorithm enforces a rigid output pattern at the average rate, no matter how bursty the traffic is. For many applications, it is better to allow the output to speed up somewhat when large bursts arrive, so a more flexible algorithm is needed, preferably one that never loses data. One such algorithm is the token bucket algorithm. In this algorithm, the leaky bucket holds tokens, generated by a clock at the rate of one token every ∆ T sec. In Fig 5.1(a) we see a bucket holding three tokens, with five packets waiting to be transmitted. For a packet to be transmitted, it must capture and destroy one token. In Fig. 5.1(b) we see that three of the five packets have gotten through, but the other two are stuck waiting for two more tokens to be generated. The token bucket algorithm provides a different kind of traffic shaping than that of the leaky Poovaiah M P and Sushanth Kini

32/145

June 8, 2010

Chapter 3. Literature Survey

3.4. Traffic Shaping

Figure 3.19: Illustrating Token Bucket Algorithm

bucket algorithm. The leaky bucket algorithm does not allow idle hosts to save up permission to send large bursts later. The token bucket algorithm does allow saving, up to the maximum size of the bucket, n. This property means that bursts of up to n packets can be sent at once, allowing some burstiness in the output stream and giving faster response to sudden bursts of input. Another difference between the two algorithms is that the token bucket algorithm throws away tokens (i.e., transmission capacity) when the bucket fills up but never discards packets. In contrast, the leaky bucket algorithm discards packets when the bucket fills up. Here, too, a minor variant is possible, in which each token represents the right to send not one packet, but k bytes. A packet can only be transmitted if enough tokens are available to cover its length in bytes. Fractional tokens are kept for future use. The implementation of the basic token bucket algorithm is just a variable that counts tokens. The counter is incremented by one every ∆ T and decremented by one whenever a packet is sent. When the counter hits zero, no packets may be sent. In the byte-count variant, the counter is incremented by k bytes every ∆ T and decremented by the length of each packet sent. A potential problem with the token bucket algorithm is that it allows large bursts again, even though the maximum burst interval can be regulated by careful selection of ρ and M. It is frequently desirable to reduce the peak rate, but without going back to the low value of the original leaky bucket. Token Bucket - Leaky Bucket Combination One way to get smoother traffic is to insert a leaky bucket after the token bucket. The rate of the leaky bucket should be higher than the token bucket’s but lower than the maximum rate of the network. This way, the excessive burstiness at the output of the token bucket is lessened and hence a better notion of shaping is achieved. Poovaiah M P and Sushanth Kini

33/145

June 8, 2010

3.4. Traffic Shaping

Chapter 3. Literature Survey

Dual Token Bucket Committed Information Rate or CIR in a network is the average bandwidth for a virtual circuit guaranteed by an ISP to work under normal conditions. At any given time, the bandwidth should not fall below this committed figure. The number of bits or bytes that is sent during a timing interval is called the Committed Burst (Bc). The timing interval is written as Tc. For example, consider that you have a physical line rate of 128 kbps, but the CIR is only 64 kbps [figure 3.20]. Also consider that there are eight timing intervals in a second (that is, Tc = 1/8 of a second = 125 ms), and during each of those timing intervals, 8000 bits (that is, the committed burst parameter) are sent at line rate. Therefore, over the period of a second, 8000 bits were sent (at line rate) eight times, for a grand total of 64,000 bits per second, which is the CIR. However, if all the Bc bits (or bytes) were not sent during a timing interval, you have an option to ”bank” those bits and use them during a future timing interval. The parameter that allows this storing of unused potential bandwidth is called the Excess Burst (Be) parameter. The Be parameter in a shaping configuration specifies the maximum number of bits or bytes that can be sent in excess of the Bc during a timing interval, if those bits are indeed available. For those bits or bytes to be available, they must have gone unused during previous timing intervals. Policing tools, however, use the Be parameter to specify the maximum number of bytes that can be sent during a timing interval. Therefore, in a policing configuration, if the Bc equals the Be, no excess bursting occurs. If excess bursting does occur, policing tools consider this excess traffic as exceeding traffic. Traffic that conforms to (that is, does not exceed) the specified CIR is considered by a policing tool to be conforming traffic. As part of your policing configuration, you can specify what action to take when traffic conforms to the CIR and what other action to take when the traffic exceeds the CIR.

Figure 3.20: Illustrating CIR Poovaiah M P and Sushanth Kini

34/145

June 8, 2010

Chapter 3. Literature Survey

3.4. Traffic Shaping

The relationship between the Tc, Bc, and CIR is given with the following formula: CIR = Bc / Tc Alternatively, the formula can be written as follows: Tc = Bc / CIR Therefore, if you want a smaller timing interval, you could configure a smaller Bc.

Figure 3.21: Single-Rate dual token bucket In a dual token bucket, two buckets exist. The first bucket has a depth of Bc, and the second bucket has a depth of Be. If a packet can be forwarded using bytes in the Bc bucket, it is said to be conforming. If the packet cannot be forwarded using the bytes in the Bc bucket, but it can be forwarded using the bytes in the Be bucket, it is said to be exceeding. If the packet cannot be forwarded using either of the buckets individually, it is said to be violating. These packet markings could be used to reflect in the shaper flags tables which the scheduler maintains per queue and per group. The scheduler responds accordingly considering the severity of the flag set.

Dual-Rate Dual Token Bucket Instead of using a single rate dual token bucket, dual-rate shaping / policing can be done. With dual-rate policing, you still have two token buckets. The first bucket is the Committed Information Rate (CIR) bucket, and the second bucket is the Peak Information Rate (PIR) bucket. These buckets are replenished with tokens at different rates, with the PIR bucket being filled at a faster rate. When a packet arrives, the dual-rate shaper/policer checks to see whether the PIR bucket has enough tokens to send the packet. If there are not sufficient tokens, the packet is said to be violating, and it is discarded. Otherwise, the shaper checks to see whether the CIR bucket has enough tokens to forward the packet. If the packet can be sent using the CIR bucket’s tokens, the packet is conforming. If the CIR bucket’s tokens are not sufficient, but the PIR bucket’s tokens are sufficient, the packet is said to be exceeding, and the exceed action (for example, Poovaiah M P and Sushanth Kini

35/145

June 8, 2010

3.4. Traffic Shaping

Chapter 3. Literature Survey

transmit with a DSCP value of AF11) is applied.

Figure 3.22: Dual-Rate Dual Token Bucket

Poovaiah M P and Sushanth Kini

36/145

June 8, 2010

Part II

SYSTEM DESIGN

37

Chapter 4

Motivation For the Proposed Design In the following sections, certain topics are briefly discussed which led to the design being proposed in a manner as shall be mentioned in the document later.

4.1

Downstream Traffic Monitoring

To provide QoS guarantees, either the upstream traffic or the downstream traffic could be monitored. Upstream traffic here refers to traffic flowing from the LAN users to the Internet while downstream traffic refers to the traffic in the opposite direction. Monitoring upstream traffic does not provide us with as many details as monitoring the incoming / downstream traffic would. It is also observed that the bandwidth of the downstream is a more premium resource than the upstream. Hence it has been decided that the downstream traffic will be monitored. Monitoring downstream also provides the device with a future scope of having a HTTP header look-up based Packet Classifier.

4.2

TCP Window Manipulation

The TCP window manipulation technique is a technique where in the advertised window size is modified, so that the receiver of the packet slows down. However modifying the contents of any packet incurs the overhead of recomputing checksums at TCP level and Ethernet level. Also from our experiments with TCP, we can conclude that the communication generally never happens at the advertised window size. It is almost always limited by the congestion window at about 3 to 5 times the MSS (Maximum Segment Size). Hence changing the advertising window is not incorporated in the design.

4.3

HTTP Header Look-up

In the current scenario, a lot of network traffic is HTTP based. For instance, audio streaming, video streaming, online chats, video chats etc. Hence, merely classifying the traffic as HTTP and enacting some policy on it and not going ahead with further inspection will not yield accurate results. 38

Chapter 4. Motivation For the Proposed Design

4.4. TCP vs UDP

In order to distinguish more precisely between audio, video, file traffic etc. a deep packet inspection will have to be done. A study of HTTP headers reveals that there could be a way to distinguish between various kinds of traffic. This is by inspecting the keyword that follows the keyword called Content-Type inside the HTTP Header. The keywords following Content-Type could be like audio, video, text, gif, jpeg, application/*, multipart etc. Each of this translates to some class of traffic. Hence, if some efficient pattern matching algorithm could be employed to look for these keywords, then the packet classification would be more complete. However since there is no proper standard for traffic types to follow and also that the keywords to look for do not exist in a fixed location in the header, it becomes prohibitive to implement this in hardware.

4.4

TCP vs UDP

TCP is a reliable connection oriented protocol. Hence whenever a TCP segment is dropped, the segment will be retransmitted and the rate of transmission is lowered. On the other hand when a UDP segment is dropped, it is just lost and the sender has no knowledge about it (due to the lack of acknowledgement). Under such circumstances, the application running over UDP could initiate lot of retransmissions which consumes a lot of bandwidth. Hence in a congested scenario, UDP, if not curtailed, would end up acquiring the entire bandwidth and TCP communication would not happen. Hence it is very essential to rate limit UDP traffic.

4.5

Privileged Users

In an institution few users need to be given prime importance and their traffic should be treated with utmost priority. Hence they should be given a superior QoS. Traffic directed to Privileged users can be identified by the destination address in the IP header.

4.6

Control Frames

Few packets are not meant for carrying data. They are necessary to facilitate communication in a layered environment and also facilitate network maintenance. These packets are to be given the highest priority, even higher than that of the privileged users. Delaying the controlling packets may make the network unstable and hence oscillatory.

Poovaiah M P and Sushanth Kini

39/145

June 8, 2010

Chapter 5

Algorithms for the Design Here in this chapter all the algorithms that shall be used in the project are discussed in the sections below.

5.1

Packet Classification

The objective of Classifying a Packet is to provide a priority to every packet and hence a queue number for it. The Packet classifier comes up with a queueNum for the packet that was just processed. This queueNum identifies the queue into which the packet shall actually be enqueued inside the DDR2 memory. As has been mentioned earlier, the queue numbers increase in the decreasing order of priority i.e. queue 0 has the highest priority over all the other queues, queue 1 is the next higher priority queue while queue 31 is the lowest priority queue. The higher priority queues will typically store those packets which need to be serviced with the utmost importance. Examples include ARP, RARP packets etc. Hence, such factors come into play in this algorithm which the Packet Classifier follows to form the queueNum. The algorithm could be explained as follows: At every level of packet parsing, i.e. Layer2, Layer3 etc. a number or weight is decided upon. As parsing proceeds to higher levels, all the weights generated thus far are added up to yield a queueNum Partial. Simultaneously, a groupNum is decided upon too. This, when prefixed to the queueNum Partial will form the final queueNum.

5.1.1

Layer2 parsing

The Layer 2 Parsing commences by extracting 2 bytes called Type field that follow the Destination MAC Address in the Layer 2 frame. The value in the Type field corresponds to various higher layer protocols encapsulated in the Layer 2 frame. The values in these 2 bytes could be any of the following tabulated in Table 5.1.

40

Chapter 5. Algorithms for the Design

5.1. Packet Classification

Table 5.1: Values in EtherType Value in Type 0x0800 0x0806 0x8035 0x8100 0x9100 0x9200 0x9300 0x88A8 0x0 - 0x5DC

Description Carries IPv4 Carries ARP Carries RARP VLAN-Tagged (802.1 Q) Frame VLAN-Double Tagged (802.1 Q-in-Q) Frame VLAN-Double Tagged (802.1 Q-in-Q) Frame VLAN-Double Tagged (802.1 Q-in-Q) Frame VLAN-Double Tagged (802.1ad) Frame Length in bytes of the Ethernet II Frame

• If the value in Type turned out to be 0x0800 or 0x0806 or 0x8035, it means it is an Ethernet II frame carrying an IP packet, ARP packet or RARP packet respectively within it. • If the value turned out to be any number between 0x0 and 0x5DC, then it is the Length of the frame in bytes. Thus, it means that the frame is in 802.3 format and hence the EtherType appears 8 bytes later after the LLC-SNAP header. • If the value was 0x8100, it means this particular frame is VLAN (Virtual LAN) tagged wherein, the VLAN Id of 2 bytes width has been inserted into the header. Hence, further parsing is done on such frames to extract 2 bytes following the VLAN Id. The values in these 2 bytes are inspected to determine the upper layer protocol which could correspond to IPv4 or ARP or RARP. • If the value was 0x9100 or 0x9200 or 0x9300 or 0x88A8, the particular frame is double VLAN tagged. Hence, the extraction pointer is moved ahead by 4 bytes to position itself to extract the EtherType and proceed as mentioned earlier. The ARP and the RARP containing frames are to be treated as high priority frames because of the nature of the data that they carry. The frames that carry IPv4 will be treated as the next most important frames. In order to generate a queueNum, wherein lower the queueNum, higher shall be the priority, the weight generated for ARP, RARP containing frames shall be the lowest possible while the weight generated for IPv4 containing frames shall be slightly higher. Following on the same lines, the weight generated for the frames holding a Type value other than the ones listed above will be the highest possible. The weight generated is denoted as a2 . It has been decided that the weight at this stage of parsing, a2 , will be as per Table 5.2

Poovaiah M P and Sushanth Kini

41/145

June 8, 2010

5.1. Packet Classification

Chapter 5. Algorithms for the Design

Table 5.2: Upper layer protocol and their weights Upper layer protocol ARP RARP IPv4 Others

weight, a2 0 0 1 15

In case, the layer 2 frames contained ARP and RARP, the results of any further packet processing (i.e. higher layer processing) for these shall not be considered. So is the case with the frames containing Others. The queueNum Partial will be generated based on the following equation: a2 + a3 + a4 = queueNum Partial where a3 , a4 , are the weights that will be generated when packet parsing will take place at the Layers 3 and 4. Hence, as per the above equation, for ARP & RARP frames, the queueNum Partial generated will always be 0 because a3 and a4 will be zero due to no further packet processing for these frames. Similarly, for Others, the queueNum Partial shall be 15. For the rest, i.e. IPv4containing frames, the queue number won’t get decided until the packet processing occurs at the upper layers too.

5.1.2

Layer3 Parsing

At the next layer of packet processing, which is Layer 3, the fields of interest extracted from the IPv4 header will be Destination IP Address and the Protocol field. The Destination IP Address will be subjected to a CAM lookup to determine if it was a privileged IP address or not. The CAM would have been initialized with a list of privileged IP Addresses. The result of this lookup also determines the groupNum which is explained in the section 5.1.4 Following this, the values in the Protocol field will decide the weight, a3 , to be generated accordingly. The Protocol field is one-byte wide. This a2 will aid in generation of the queueNum Partial. It has been decided that the weight at this stage of parsing, a3 , will be as per the following table. Table 5.3: Values in Protocol field of IP Value in Protocol field 0x01 0x02 0x32 0x33

Poovaiah M P and Sushanth Kini

Corresponding Layer 4 Protocol weight, a3 ICMP 0 IGMP 0 ESP 1 AH 1 Continued on next page 42/145

June 8, 2010

Chapter 5. Algorithms for the Design

5.1. Packet Classification

Table 5.3 – continued from previous page Value in Protocol field Corresponding Layer 4 Protocol 0x73 L2TP 0x38 TLS 0x06 TCP 0x84 SCTP 0x11 UDP 0x88 UDP Lite – Others

weight, a3 1 1 2 2 3 4 13

Observations from the Table 5.3 are as follows: • The ICMP, IGMP packets will avail a queueNum Partial of 1 because of a2 + a3 = 1 + 0 = 1 Hence, they fall into the highest priority queue among the queue numbers getting generated at this stage of packet processing. • The next set of protocols being given higher preference are ESP (Encapsulating Security Payload) and AH (Authentication Header) which are constituents of the IPSec framework. At the same preference level are also present the packets using L2TP (Layer 2 Tunneling Protocol) as well as the packets using the TLS (Transport Layer Security) protocol. All of these will attain a queueNum Partial of a2 + a3 = 1 + 1 = 2 • Note that for all the above protocols i.e. ICMP, IGMP, ESP, AH and TLS, further packet processing is not done and hence queue number computation for these ends at this stage itself. • TCP and SCTP (Stream Control Transmission Protocol) packets are next in line of preference. The TCP is widely used in the Internet while SCTP is very similar to TCP and shall soon find itself being used in TCP-like implementations. Hence, these are given same weight of a3 as 2. The packets using the SCTP protocol will get enqueued into queues partially numbered as 3 because of the equation a2 + a3 = 1 + 2 = 3 However, the packets using TCP protocol will be subjected to further packet processing and hence the queue number is not completely determined yet at this stage. • Packets belonging to the UDP protocol are the next preferred. They will be subjected to further packet processing as well. However, packets following the UDP Lite protocol will not be further processed and instead will be given a weight at this stage as a3 = 4. For all unsupported and little relevant protocols, the weight added would be a large number (i.e. a3 = 13) so that the queueNum Partial turns out to be 14 for them. Later, in conjunction with the groupNum , the final queueNum would be computed which will result in a queueNum of either 14 or 30. Poovaiah M P and Sushanth Kini

43/145

June 8, 2010

5.1. Packet Classification

5.1.3

Chapter 5. Algorithms for the Design

Layer4 Parsing

This stage of packet parsing would be done on only two kinds of packets i.e. TCP and UDP. These are the most widely used protocols over the Internet and hence the reason for this choice. At this stage, the field in the packet header of primary interest is ”Source Port Number ”. The value in this field shall determine the weight a4 that would be added to the results of the previous stages of packet parsing to yield the final queueNum Partial . The weight a4 could be picked up from a CAM which would have been initialized by the device administrator. This provides for user configurability. The various values in Source Port Number field and the corresponding default weights are tabulated below. Table 5.4: Port Numbers and their weights Source Port Number 20, 21 22 23 25 43 53 80 109 110 123 143 194 220 443 465 554 593 989, 990 992 993 995 1214 1234 1293 1512 1701 1719, 1720 1723 1755 1863

Upper layer protocol Default weight, a4 FTP (File Transfer Protocol) 4 SSH (Secure Shell) 1 Telnet Protocol 1 SMTP (Simple Mail Transfer Protocol) 2 Whois Protocol 11 DNS (Domain Name Service) 0 HTTP (Hypertext Transfer Protocol) 3 POP2 (Post Office Protocol 2) 2 POP3 (Post Office Protocol 3) 2 NTP (Network Time Protocol) 0 IMAP (Internet Message Access Protocol) 2 IRC (Internet Relay Chat) 5 IMAP version 3 2 HTTPS (HTTP Secure) 3 SMTP Secure 2 RTSP (Real Time Streaming Protocol) 6 Remote Procedure Call over HTTP 11 FTPS (FTP Secure) 4 Telnet Secure 1 IMAP Secure 2 POP3 Secure 2 Kazaa P2P File-Sharing Port 8 VLC Media Player - Streaming Port 6 IPSec 1 WINS (Windows Internet Name Service) 0 L2F (Layer 2 Forwarding) & L2TP (Layer 2 Tunneling Protocol) 1 H.323 7 PPTP (Point-to-Point Tunneling Protocol) 1 Windows Media TCP Unicast Stream Port 6 Windows Live Messenger Port 5 Continued on next page

Poovaiah M P and Sushanth Kini

44/145

June 8, 2010

Chapter 5. Algorithms for the Design

5.1. Packet Classification

Table 5.4 – continued from previous page Source Port Number Upper layer protocol Default weight, a4 3128 HTTP Squid-cache Port 3 5004, 5005 RTP (Real-time Transport Protocol) 6 5050 Chat Port for Yahoo! Messenger 5 5060 SIP (Session Initiation Protocol) 7 5061 SIP Secure 7 5100 Webcam port for Yahoo! Messenger 5 5190 AOL Instant Messenger Port 5 5222, 5223, 8010 Xtensible Messaging Protocol Port - used by many IM clients including GTalk 5 6881 Most common port for BitTorrent clients 8 8008, 8080, 8081, 8090 HTTP Alternate Port 3 23399 Skype Port (Default) 7 – All the rest 11

The default weights in ascending order (and hence decreasing order of priority) are assigned to the protocols as follows: • DNS, NTP, WINS • SSH, Telnet, IPSec, L2F&L2TP, PPTP • SMTP, POP2, POP3, IMAP, IMAPv3, SMPTPS, IMAPS, POP3S • HTTP, HTTPS, ports numbered 3128, 8008, 8080, 8081, 8090 • FTP, FTPS • IRC and ports numbered 1863, 5050, 5100, 5190, 5222, 5223, 8010 • VLC Streaming Port, Windows Media Streaming Port, RTP, RTSP • SIP, SIPS, Skype, H.323 • Kazaa, BitTorrent • WHOIS, RPC-HTTP and the rest of the protocols By default, higher priority has been given to email clients over web browsers, so also, browsers over chat clients, chat clients over Streaming multimedia clients. BitTorrent applications are of lower priority. All the other unsupported protocols default to the highest weight and hence the lowest priority possible. For TCP and UDP packets, the partial queue number is obtained as a sum of all the 3 weights that were generated at the 3 stages of packet parsing. Mathematically, a2 + a3 + a4 = queueNum Partial Poovaiah M P and Sushanth Kini

45/145

June 8, 2010

5.2. Scheduling

Chapter 5. Algorithms for the Design

Since, at this stage, for such packets, a2 =1 and a3 =2 or 3, the weight a4 that is user-configurable shall be a number between 0 and 11 only. That is because, the queueNum Partial can assume a maximum value of 15 only. If it so happens that this packet does not belong to the Privileged Group, then a bit ’1’ will be prefixed to the partial queue number to yield the final queue number. However, if it does belong, then bit ’0’ would be prefixed to the partial queue number which gives basically the same final queue number as before.

5.1.4

Algorithm to generate the groupNum

The groupNum is to be generated in parallel with the queueNum Partial . The concatenation of the groupNum with the queueNum Partial yields the actual queueNum. There shall be three groups of queues in total. The group0 would be called the Control Group and shall hold queues of packets which typically are Control packets such as ARP, RARP, ICMP etc. This group shall consist of only two queues numbered 0 and 1. The next group would be called the Privileged Group which shall contain queues numbered from 2 to 15. The Privileged Group holds queues of packets which contained a destination IP Address that was listed among the privileged IP Addresses. The final group would be called the Non-Privileged Group which shall contain queues numbered from 16 to 31. The Non-Privileged Group holds queues of packets which contained a destination IP Address that did not figure in the list of privileged IP Addresses. The three groups listed above would be numbered 0, 1 and 2 respectively which would be called the groupNum . Following is the groupNum for the various kinds of packets that will be encountered during the processing. Table 5.5: Possible values of groupNum Packet/Frame Type ARP,RARP Others Frame Type IPv4 IPv4 IPv4

5.2

groupNum 0 2 0 – if IP Protocol == ICMP or IGMP 1 – if privileged IP Address is present 2 – if privileged IP Address is absent

Scheduling

In the design, the packets are stored in the forms of queues. Also, the queues which are 32 in number are divided into Queue-Groups. There are 3 groups namely • Control Group Poovaiah M P and Sushanth Kini

46/145

June 8, 2010

Chapter 5. Algorithms for the Design

5.2. Scheduling

• Privileged Group • Non-Privileged Group Priority-wise the Control Group scores over the Privileged Group and the Privileged Group scores over the Non-Privileged Group. The Control Group is a group consisting of 2 queues which are numbered 0 and 1. These queues will consist of packets following the ARP, RARP, ICMP protocol etc. which are typically called Control Packets. These packets are to be treated with utmost importance and hence the reason for them being provided with the highest priority. Following the Control Group are two groups called Privileged Group and Non-Privileged Group. These two groups contain all non-control (data) packets destined to the users. The deciding factor, which determines whether a packet gets enqueued in the Privileged group or not, is the presence of a privileged IP Address in the destination IP Address field in the IP packet header. A set of IP Addresses belonging to certain class of users in a network are classified as privileged ones to provide QoS guarantees to the users owning them. The Privileged group contains 14 queues of packets, the queues being globally numbered from 2 to 15. The Non-Privileged Group contains 16 queues of packets, the queues being globally numbered from 16 to 31. The scheduler works as follows: • The scheduler first decides upon a group among the 3 groups. • In the group selected, the scheduler now selects one among the several queues. • For deciding upon a group, the scheduler follows simple round-robin algorithm. Every time a scheduling algorithm begins, the group selected shall be the next in succession to the group selected in the previous scheduling round. • However, that particular group shall only be selected if the GroupEmpty flag is not set for the same. This GroupEmpty flag is generated as an OR of all the QueueEmpty flags which is maintained per queue for all the queues belonging to that group. • Once the group is selected, the scheduler will follow the Weighted Round Robin algorithm to decide among the queues in that group. A note on the WRR algorithm: Weighted round robin (WRR) is a simple modification to the round robin algorithm. Instead of serving a single packet from a queue per turn, it serves n packets. Here n is adjusted to allocate a specific fraction of link bandwidth to that queue. Each flow is given a weight that corresponds to the fraction of link bandwidth it is going to receive. The number of packets to serve in one turn is calculated from this weight and the link capacity. The WRR works fine with fixed size packets, such as in ATM networks. However, WRR has difficulty in maintaining bandwidth guarantees with variable size packets (the Internet). The problem with a variable size packet is that flows with large packets will receive more than the allocated weight. In order to overcome this problem, the WRR server needs to know the mean packet size of sources a priori. Short-term fairness is another problem encountered by WRR. On a small time scale, WRR does not meet the fairness criteria, since some flows may transmit more than others. The other advantage of WRR is that we can provide differential treatment for the queues. The inadequate ability of the WRR algorithm to handle variable sized packets efficiently will not be of much problem to the project, since most of the packets will be full frame size in length. Poovaiah M P and Sushanth Kini

47/145

June 8, 2010

5.3. Shaping

Chapter 5. Algorithms for the Design

Also, the differential treatment to the queues can easily be given with WRR. Hence, this is the algorithm of choice for the scheduler in the design. Once a queue is decided for being dequeued (based on the round-robin nature), a certain number of packets will be read out from this queue, the number, which, shall be proportional to the weight assigned to the queue number. The next queue will be serviced based on its weight and so on. In the architecture, since a lower queue number represents a queue of higher priority, the weight assigned to the queue shall be proportional to the queue number too.

5.3

Shaping

The shaper works in conjunction with the scheduler and is responsible for performing the traffic shaping making use of some algorithm. The algorithm chosen in the design is the classical Token Bucket Algorithm. A note on the Token Bucket Algorithm:

Figure 5.1: Figure depicting the Token Bucket algorithm In this algorithm, a bucket holds tokens, generated by a clock at the rate of one token every ∆T sec. In Fig 5.1(a) we see a bucket holding three tokens, with five packets waiting to be transmitted. For a packet to be transmitted, it must capture and destroy one token. In Fig. 5.1(b) we see that three of the five packets have gotten through, but the other two are stuck waiting for two more tokens to be generated. The token bucket algorithm provides a different Poovaiah M P and Sushanth Kini

48/145

June 8, 2010

Chapter 5. Algorithms for the Design

5.3. Shaping

kind of traffic shaping than that of the leaky bucket algorithm. The leaky bucket algorithm does not allow idle hosts to save up permission to send large bursts later. The token bucket algorithm does allow saving, up to the maximum size of the bucket, n. This property means that bursts of up to n packets can be sent at once, allowing some burstiness in the output stream and giving faster response to sudden bursts of input. Another difference between the two algorithms is that the token bucket algorithm throws away tokens (i.e., transmission capacity) when the bucket fills up but never discards packets. In contrast, the leaky bucket algorithm discards packets when the bucket fills up. Here, too, a minor variant is possible, in which each token represents the right to send not one packet, but k bytes. A packet can only be transmitted if enough tokens are available to cover its length in bytes. Fractional tokens are kept for future use. The implementation of the basic token bucket algorithm is just a variable that counts tokens. The counter is incremented by one every ∆T and decremented by one whenever a packet is sent. When the counter hits zero, no packets may be sent. In the byte-count variant, the counter is incremented by k bytes every ∆T and decremented by the length of each packet sent. A potential problem with the token bucket algorithm is that it allows large bursts again, even though the maximum burst interval can be regulated by careful selection of ρ and M. It is frequently desirable to reduce the peak rate, but without going back to the low value of the original leaky bucket. In the design, traffic shaping is done on a per-queue basis. A token bucket is maintained per queue whose token bucket rate and maximum burst threshold shall be programmable. Once a packet is read out of a queue, the packet length in bytes is subtracted from the corresponding token bucket. If the tokens in the bucket are now non-positive, then, no more packets can be read out of this queue for a while until the tokens re-accumulate in the token buckets, at which time the queue again becomes eligible for dequeue. This is told to the scheduler too by means of certain set and reset flags. Since there are only 3 groups out of which, one is control group while the other two are data groups, the idea of group shaping proposed earlier is no longer being followed up. This becomes redundant because, each queue is being individually shaped which is sufficient for the overall traffic shaping because ultimately packets (traffic) ends up in queues only. Moreover, since the queues are being serviced in Weighted Round Robin fashion, if it so happens that for a while a major chunk of bandwidth was taken up by packets belonging to a single queue, it may result in the Group Token Bucket running out of its quota of tokens. Which means that, had there been no Group Token Bucket, a next queue could be dequeued, but if there was one such bucket, then the other queues, will have to starve for a while for no fault of theirs. Hence the group shaping will not be done in the design, instead shaping shall rely solely on the individual queue traffic shaping using the token buckets.

Poovaiah M P and Sushanth Kini

49/145

June 8, 2010

Chapter 6

Proposed Architecture 6.1

Introduction

Bandwidth Managers have been in use for a long time till now. Most of them being implemented in software. These designs are inherently serial in nature and hence are not suitable for high speed applications. This problem was not so much pronounced until lately where the speeds of communication have become so high that the softwares run on processors are failing to cater to these needs. Hence a hardware solution is needed that caters to the issue, the memory requirements and speeds of which should be scalable to the current demands. In the project, an architecture is proposed, whose memory requirements do not increase with the number of users in the network and the bandwidth of communication.

6.1.1

Flow based

The packet classification and categorizing involves a lot of computation. If it is done for every packet that is plying through the network, the computation overhead would limit the speed of operation. Hence, classification and categorization is done only once and the 4-tuple is remembered. This makes the further operations to be just a look-up. However this architecture suffers from the problems of memory explosion. In a scenario where the number of connections are high (due to increased number in users and applications) and where the network bandwidth is high, such a solution handicaps the network.

6.1.2

Class-based

The class-based classification is stateless in its highest level of operation. That is, every packet that traverses through the network should be subjected to inspection and classification. There would be no need to maintain any information regarding the connections that were established. This would need an optimization in the hardware due to the high processing overhead. However this problem becomes uni-dimensional as the architecture no longer depends on the number of users using the network and depends only on the network bandwidth. 50

Chapter 6. Proposed Architecture

6.1.3

6.2. Overview of the Proposed Architecture

Proposed Architecture

The Proposed architecture follows the under mentioned guidelines. • The architecture is class based. • The design of the packet classifier is critical and hence the algorithm is chosen carefully. • The size of the queues are selected based on the priority of the traffic class and the operating bandwidth. • The number of memory accesses required dictates the throughput and hence care has been taken to have minimum number of memory accesses.

6.2

Overview of the Proposed Architecture

Figure 6.1: Block schematic of the proposed architecture The traffic that is found at the point of the network where the device is installed consists of 2 types - uplink traffic (from the LAN to the internet) and downlink traffic (from the internet to the LAN). It has been observed that in large LANs connected to the internet, the uplink traffic volume is much lower compared to the downlink traffic volume. Hence, controlling the downlink traffic provides us with a better control of the QoS. Poovaiah M P and Sushanth Kini

51/145

June 8, 2010

6.3. Packet Accumulator

Chapter 6. Proposed Architecture

The proposed architecture is based on the store and forward type of architectures. The MAC core accesses packets from the physical layer and provides it as a 64-bit output datapath. The packet entering the device is stored in the form of a queue. The exit of the packet is decided by the scheduling decisions taken by the scheduler. The burst rates of the the output as specified by the SLA (Service Level Agreement) can be honoured by having a traffic shaper which controls the output of the device. The burst rates of each type of traffic can be configured using the traffic shaper for each queue. To accomodate the input burst, an input packet accumulator is used. The entries in the queue are maintained by the Queue manager. However the most important function of parsing and classifying the packets into different queues is taken care of by the Packet Classifier. The Queues are formed within the DDR2 SDRAM memory. Hence we need a DDR2 controller to perform reads and writes into the memory. The block schematic of the proposed architecture is as shown in the figure 6.1 The major advantage of such a design is that every block operates independently. The intercommunication depends on the handshake signals provided by the interfaced blocks. This makes each block operate independently at high speeds. Also the clock resources are used intelligently so that the design fits into the NetFPGA board (FPGA - Virtex2P50 FF1152 with speed grade of -7).

6.3

Packet Accumulator

The Packet Accumulator lies at the output interface of the Rx queue0. The I/O diagram of the Packet Accumulator is as shown in the figure 6.2. Basically, this block makes use of a sufficiently large FIFO which acts as a placeholder to hold

Figure 6.2: Packet Accumulator the packets temporarily before they are read out subsequently to be written into the DDR2. The functionality could be described as follows: 1. The packets trickling in at the rate of 8bytes (from the input bus data out rxq0 [63:0] ) for every clock cycle of the operating frequency start accumulating in a FIFO where each Poovaiah M P and Sushanth Kini

52/145

June 8, 2010

Chapter 6. Proposed Architecture

6.4. Packet Classifier

entry is 8bytes wide. The start of a new packet is identified by the encoding that is followed in the input bus cntrl out rxq0 [7:0]. 2. When an End of Frame is encountered (again identified by encoding in cntrl out rxq0 [7:0] ), the FIFO write pointer is so adjusted that the next write (which would mean the next Ethernet Frame) should happen in the next location inside the FIFO and not in the same location where the earlier frame ended. 3. Few clock cycles after the Start of Frame is sensed, a signal is conveyed to the Packet classifier which triggers processing on this current packet in steps, slowly reading out the packet in successive reads from the enqueue FIFO. 4. The FIFO write pointer will roll around when the depth of the FIFO is hit. And the write continues smoothly. It is assumed that the FIFO is so deep that when this pointer roll-over occurs, the subsequent location into which write will occur would already have been read out by the MWC. Hence, no loss of data occurs. 5. In addition to the interface to the Packet classifier, there exists an interface to the MWC as well which is used by the latter to read the packets out from the enqueue FIFO inside the Packet Accumulator. It is to be noted that the packet classifier and the MWC maintain their own read pointers which is sent to the enqueue FIFO to read the packet at the correct location. This is necessary since Packet classifier and the MWC will be operating on different locations and different packets altogether. 6. Hence, there will be an arbiter in the logic inside the Packet Accumulator which will arbitrate among the read requests coming in from the MWC and the Packet Classifier. 7. An important function performed by the Packet Accumulator is that it computes the packet length for all incoming packets. The same is communicated to the Packet Classifier when the latter requests for the packet length using a pc enqfifo read new pkt signal. Since many packets accumulate in the packet accumulator, all the corresponding packet lengths are maintained in a tiny FIFO too. On every packet length read request, this fifo is accessed, and its topmost entry is read out to drive the enqfifo pc pktlength bus. The data is validated by a corresponding signal called enqfifo pc pktlength valid. 8. Also, the packet classifier starts reading out packets from the Packet Accumulator only after atleast one packet has accumulated into it and whose packet length has been computed. Hence, the status of the same is communicated by the Packet Accumulator to the Packet Classifier via the enqfifo pc data ready signal.

6.4

Packet Classifier

This block is a very important block in the overall design and is in charge of the packet classification, based on traffic classes, which happens to be the crux of the project. The I/O diagram of the block is as shown in the Figure 6.3. The functionality of the packet classifier could be explained using the following points: 1. The packet classifier starts processing the packets a few clock cycles after the packet starts to accumulate in the enqueue FIFO inside the Packet Accumulator. Typically, Poovaiah M P and Sushanth Kini

53/145

June 8, 2010

6.4. Packet Classifier

Chapter 6. Proposed Architecture

Figure 6.3: Interface Diagram of Packet Classifier the packet classifier can start packet processing only after atleast one full packet has accumulated in the Packet Accumulator. Once, the enqfifo pc data ready is sensed to be high, the classifier starts reading the packets in chunks. However, when it is about to read a new packet, it asserts the signal pc enqfifo readNewPkt so that the Packet Accumulator could respond for the same with the packet length on enqfifo pc pktLength validated by enqfifo pc pktlength valid. 2. As it issues a read command to the FIFO in the Packet Accumulator, it keeps gathering the packet into its local buffer. In the process, it also extracts the necessary fields and saves them for future use. 3. The parsing progress is indicated by the setting of appropriate flags (like L2 parsing done, L3 parsing done, L4 parsing done etc.) as the packet parsing proceeds from the Layer 2 frame onwards to the Layer 3 header and so on. 4. The fields of interest to us in the Layer 2 Frame header are the EtherType or the Length field. Based on the value in the Type field, the parser finds out if the frame is actually an Ethernet Frame or an 802.3 Frame or if it is a VLAN-Tagged frame. 5. Once parser moves on to the Layer 3 header (IP), it extracts the fields like Destination IP Address, Protocol, IPHeaderLength etc. Based on the value contained in the Protocol field, the L4 Header type is learnt of. In the meanwhile, the extracted parameters are all saved in local registers. The extraction of IPHeaderLength is necessary too because, it helps the parser to determine the presence / absence of the Options field in the IP Header. 6. The L4 packet could be of the type TCP, UDP, RTSP, ICMP etc. Only if the connection is of type TCP or UDP, the parser will go ahead into the next stage which is the further probing of the TCP / UDP Source Port. This step helps in finding out the nature of the application that is being carried as part of the payload. Based on the value in the Source Port Number, different weights are assigned. The weights could be user configurable or could be default assigned too. 7. Now, that the parsing is complete and we have all the necessary fields extracted, the queueNum formation goes into its final stage. At every stage of parsing, (L3, L4 etc.), a unique weight would be allocated to the packet depending on what it is actually turning out to be. For instance, after L3 header was parsed, the parser could find that the L4 Poovaiah M P and Sushanth Kini

54/145

June 8, 2010

Chapter 6. Proposed Architecture

6.5. Queue Manager

header is ICMP. In such a case, this would be given a high priority and hence a lower weight. In the entire design, it is assumed that a higher queueNum translates to a lower priority. Also, if the L4 header was RTSP, it would be given the lowest possible weight. If the connection was UDP based, it would be given a very high weight as it is intended to be given the lowest priority. Any type detected which is not among the set of ICMP, TCP, UDP or RTP will be given a medium weight such that it translates to a queueNum of medium priority. 8. Once all the weights have been assigned at all levels of parsing, the same is summed up at the end, to get a number which is then encoded to yield a unique 5-bit wide queueNum. 9. This will be sent to the Queue Manager along with the packet length which would be computed during the parsing process itself. The signals used for communicating this are pc qm queueNum, pc qm pktlength validated by the signal pc qm enqueue. The group number which is determined in parallel is told to the QM too, over the signal pc qm groupNum. 10. Once the queue number is determined for a packet, the parser need not continue reading out the entire packet. Instead, it could move onto the next packet, because of which the necessity for the signal called pc enqfifo readNewPkt comes in. This signal is asserted along with the next pc enqfifo re and the packet accumulator responds by sending the next packet.

6.5

Queue Manager

The Queue Manager is a significant block in the design of the Network Traffic Manager. It is responsible for maintaining the information about the queues of packets which are stored in the DDR2 memory. The I/O diagram of the QM is as shown in 6.4

Figure 6.4: Interface Diagram of Queue Manager The functionality of the queue manager could be described with the following steps: Poovaiah M P and Sushanth Kini

55/145

June 8, 2010

6.5. Queue Manager

Chapter 6. Proposed Architecture

1. Initially, the packet classifier after due processing on a packet which was read out from the enqueue FIFO comes up with a unique queueNum as well as a groupNum. This is told to the QM over the signals pc qm queueNum and pc qm groupNum. The packet length is also informed to the QM over the bus pc qm pktlength. 2. The QM now has to account for this queue. So, it indexes into the QueueTailPointer table with queueNum and updates the Tail Pointer for that queue. The Tail Pointer is moved by a value equal to the packet length in bytes. Also, if it was the first packet in that queue, the QueueHeadPointer too is properly updated. To begin with, the QueueHeadPointer and the QueueTailPointer would have been initialized to the beginning address of the memory segment that has been allocated to that queue. 3. Now, if the status of the queue changed from empty to non-empty, i.e. an enqueue happened into an earlier empty queue, then, the scheduler is informed over the signals qm sch update, qm sch empty and qm sch queueNum. The signal qm sch groupNum is also appropriately driven with the group number to which the queueNum belongs. 4. The entire DDR2 memory space is statically divided amongst the 32 possible queues and hence the start and end addresses of the memory segments are known. So, allocation of memory (i.e. updation of queue Tail and Head pointers) will be done using a valid address in this range of memory addresses. 5. The QM regularly receives dequeue requests from the scheduler which schedules the next queue for dequeue based on the scheduling algorithm that the latter follows. The scheduler issues the requests over the signals sch qm deq req and sch qm queueNum. 6. On receipt of such requests from the scheduler, the QM stores the same, processes the request. Processing the dequeue request here means that, the QM looks up the QueueHeadPointer table with sch qm queueNum as the index, retrieves the Head Pointer, and drives the same along the signal qm mrc deq raddr along with the signals qm mrc queueNum and qm mrc dequeue. 7. The MRC acknowledges the successful readout of a packet from a queue, along with the length of the packet read out, to the QM over the signals rmrc qm deq ack, mrc qm queueNum, and mrc qm queueNum. When this happens, the queue Head Pointer for that queue will be updated i.e. increments by the length of the packet that was just read out of the memory by the MRC. 8. Once the Queue Head and Tail Pointers merge, it means that the queue is empty and hence the pointers are reset to the beginning of the memory address range meant for that queue. The status of the queue becoming empty is also informed to the scheduler over the interface to the latter. i.e. over the signals qm sch update, qm sch empty and qm sch queueNum. 9. The QM also issues write commands to the MWC in order to write the packet into the DDR2 that is so far lying inside the enqueue FIFO. In the process, it informs the MWC with the packet length too so that the MWC knows how many bytes it needs to read from the enqueue FIFO. 10. In some cases, the QM may issue a drop signal to the MWC along with the pktLength too. Such a scenario occurs when there is not sufficient space left in the DDR2 to accommodate Poovaiah M P and Sushanth Kini

56/145

June 8, 2010

Chapter 6. Proposed Architecture

6.6. Scheduler

another packet inside the same queue. This is essentially a DropTail behavior. The MWC now responds by not actually reading out the packet from the enqueue FIFO but, by merely updating its local enqueue FIFO read pointer by the packet Length of the packet that is about to be dropped from the enqueue FIFO.

6.6

Scheduler

The main function of the scheduler in the design is to decide the next queue for dequeing. The I/O block diagram of the scheduler is as shown in the figure 6.5:

Figure 6.5: The Interface diagram of the Scheduler Block

1. The scheduler operates in two steps. In the first step, one of the three queue groups is selected. And in the second step, decision is made among the queues in that queue group only. 2. The three queue groups are the Control queue group, Privileged queue group and the Non-Privileged queue group. The privileged queue group has a higher priority than the non-privileged group while the Control queue group has the highest priority among all the groups. 3. Hence, in the group scheduling round, following is the algorithm: • The scheduler follows a simple Round-Robin algorithm to decide upon a group. Hence, it simply checks if the next group in succession is eligible for dequeue or not. Eligibility-check, here, means that it first checks the groupEmpty flag for the group. If the group is empty, the scheduler moves onto the next succeeding group. However, if not, the scheduler selects this particular group. • This is an extensible design and hence in future even if we decide to support more number of queue groups, that can be easily done. Poovaiah M P and Sushanth Kini

57/145

June 8, 2010

6.7. Traffic Shaper

Chapter 6. Proposed Architecture

4. Once group scheduling is completed, the scheduler moves on to queue scheduling wherein a weighted round robin algorithm is employed to select the next queue to be dequeued. The way this is done is: • If the nextQueue is not empty & not outOfTokens, it shall be selected. • Else, the scheduler moves onto the succeeding queue. To figure out that a queue is empty or not, the scheduler maintains an empty bit per queue in a QueueEmptyTable. • Also, a QueueoutOfTokens table is maintained whose entries are maintained per queue. A bit in this table is set, if the shaper says so, when it drives the signals shp sch queueoutOfTokens set and shp sch queueoutOfTokens set qNum. An already set bit in this table shall be reset when the shaper asserts the signal shp sch queueoutOfTokens reset and shp sch outOfTokens reset qNum. 5. Now that the queue is decided for dequeue, the scheduler makes a dequeue-request for the same to the QM. In the process, the scheduler drives the signals sch qm deq req and sch qm deq queueNum. 6. The scheduler is updated about the status of a queue changing from empty to nonempty (due to a packet enqueue), over the inputs qm sch update, qm sch queueNum and qm sch empty. The empty signal is kept low in this case. 7. The scheduler is also updated about the status of a queue changing from non-empty to empty (due to the dequeue of the last packet in this queue). However, in this case, the qm sch empty signal will be driven high. 8. An important input to the scheduler is the signal coming from the MRC namely mrc sch pause. Under normal circumstances, the scheduler keeps running at its pace issuing dequeue requests at a constant rate to the QM which subsequently issues dequeue requests to the MRC. However, if it so happens that the MRC is slow in actually reading out the packets from the DDRC and driving them out, then the rate at which MRC will service the pending dequeue requests will be less. Hence, the dequeue command FIFO in the MRC may get filled up in which case it can accept no more dequeue requests. In this case, the MRC informs the same to the scheduler saying that it cannot accept dequeue requests for a while and it should slow down. For this purpose, the MRC asserts the signal mrc sch pause.

6.7

Traffic Shaper

The shaper works in conjunction with the scheduler and helps in shaping the queue traffic. The I/O diagram of the shaper is as shown in the figure 6.6: 1. For each queue, the shaper maintains token buckets. In addition the MaxTokenRate and the MaxBurstThreshold for each bucket is also maintained by the shaper. These two parameters shall be configurable by the NATM administrator. 2. Tokens keep accumulating in the token buckets at the programmed token rate and saturate at the MaxBurstThreshold. Poovaiah M P and Sushanth Kini

58/145

June 8, 2010

Chapter 6. Proposed Architecture

6.8. Memory Read Controller

Figure 6.6: I/O diagram of the Shaper Block 3. Also, the tokens are decremented from the buckets as and when the MRC informs the scheduler about the successful read-out of a packet from the DDR2 along with the packet Length. The signals exercised during this process are mrc shp queueNum, mrc shp rd done and mrc shp pktLength. 4. If any decrement results in tokens becoming zero or negative, the same is immediately informed to the scheduler over the shp sch queueOutOfTokens set signal which is accompanied by the queue number information on the shp sch queueOutOfTokens set qNum output. 5. As and when the tokens accumulate, if it so happens that the number of tokens become positive, then the shaper informs about the same to the scheduler using the output signals shp sch queueOutOfTokens reset and shp sch queueOutOfTokens reset qNum.

6.8

Memory Read Controller

Figure 6.7: Interface Diagram of Memory Read Controller The Memory Read Controller is responsible for reading out the packets stored in the DDR2 Poovaiah M P and Sushanth Kini

59/145

June 8, 2010

6.8. Memory Read Controller

Chapter 6. Proposed Architecture

memory. It interfaces to the Egress Packet Accumulator as well as the shaper, scheduler and the Queue Manager. The I/O diagram of the MRC is as shown in 6.7: 1. The MRC receives a dequeue request from the QM (over the signal qm mrc dequeue) for a particular queue, qm mrc queueNum. This dequeue request is accompanied with the DDR2 memory address, qm mrc deq raddr from where the next packet in this queue must be read out. 2. In response to this, the MRC issues a read command to the DDRC along with the read address that it received from the QM. The signals activated during this transaction are mrc ddrc re, mrc ddrc raddr. 3. After a while, when the first few bytes of the packet at the head of the queue is read out from the memory by the MRC, received on ddrc mrc rdata validated by ddrc mrc rdvalid, the tag information stored as a prefix to the actual packet is also obtained with the first few packet bytes. 4. The pktLength is stored in this tag. The same is now used in two ways. • Firstly, the pktLength is used to compute the number of future read commands to be issued to the DDRC by the MRC in order to entirely read out the packet. • Secondly, the pktLength is communicated back to the QM along with the corresponding queueNum and a read done signal. This bit of information is needed by the QM since it has to move the QueueHeadPointer for that queue in its QueueHeadPointer Table to account for the dequeue of a packet from the memory. The signals used in this process are rmrc qm deq ack, mrc qm queueNum and mrc qm pktLength. 5. There is an interface from the MRC to the shaper as well. Whenever MRC reads out a packet from the DDR2 memory, it knows of the length of the packet. This is now told to the Shaper over the signals mrc shp pktLength, mrc shp queueNum validated by mrc shp rd done. The shaper now makes use of this information to decrement the tokens from the token bucket corresponding to the mrc shp queueNum by an amount equal to the mrc shp pktLength. If this deduction now results in the tokens becoming non-positive, then the shaper informs this to the scheduler over dedicated signal lines. 6. The other interface of the MRC is to the Egress Packet Accumulator. Whenever the EPA is ready, indicated by the assertion of epa mrc ready, it means that the FIFO inside EPA has sufficient space to accept new packets from the MRC. In this situation, the MRC sends packets to the EPA, in effect, writes the packets read out from the memory into the FIFO inside the EPA using the signal lines epa mrc we and epa mrc wdata. 7. The MRC also has an output signal going to the scheduler called the mrc sch pause. This is an important signal and is asserted when the dequeue command FIFO inside the MRC is near full or full. It acts as a pause signal to the scheduler in whose response, the scheduler pauses functioning for a while by not issuing any more dequeue requests to the QM. Later, when this signal gets deasserted because of the FIFO level going down, the scheduler resumes its normal functioning. Poovaiah M P and Sushanth Kini

60/145

June 8, 2010

Chapter 6. Proposed Architecture

6.9

6.9. Memory Write Controller

Memory Write Controller

This block is responsible for issuing the write commands to the DDRC to write the packets into the DDR2 memory. The I/O diagram of the MWC is as shown in the figure 6.8

Figure 6.8: Interface Diagram of Memory Write Controller The functionality of this block could be explained in the following steps: 1. Typically, the MWC receives a write request from the QM along with the packet length and the write address into the DDR2, over the input signals qm mwc write, qm mwc waddr and qm mwc pktLength. 2. Before the MWC communicates the same to the DDRC, the MWC needs to read the packet out from the enqueue FIFO housed within the packet accumulator block. It does this by a series of reads by issuing the qm mwc pktLength and the mwc enqfifo rdptr to receive the packet on the enqfifo mwc rdata bus validated by the signal enqfifo mwc rdvalid. The MWC knows how many read requests are to be issued, based on the packet Length that it has to read and based on the granularity of each read response. 3. Once the entire packet is read out, it is sent to the DDRC by issuing commands to the latter in the format that the DDRC desires. Hence, the MWC does have a small storage area to store the packet while it is being sent to the DDRC. The signals used for this process are mwc ddrc we, mwc ddrc wdata and mwc ddrc waddr. 4. When a new packet is being written into the DDRC, a prefix tag which holds the packet Length is written prior to the actual packet write, i.e. the packet Length is prefixed to the actual packet and only then is it written into the DDR2 memory. 5. Occasionally, the MWC may also receive an assertion over the input signal qm mwc drop along with the other three input signals from the QM. In response to this, the MWC doesn’t actually read the packet out from the enqueue FIFO, instead it merely updates its read pointer i.e. mwc enqfifo rdptr by the packet length of the packet that is about to be dropped. Such a scenario occurs when there is no sufficient memory left in the DDR2 memory for a new packet to be written. Poovaiah M P and Sushanth Kini

61/145

June 8, 2010

6.10. DDR

Chapter 6. Proposed Architecture

6.10

DDR

6.11

Egress Packet Accumulator

The egress packet accumulator is the final block in the overall design. Its primary role is to accumulate the data, format it and send it out in the desired pattern. The I/O diagram of the block is as shown in the figure 6.9

Figure 6.9: Interface diagram of Egress Packet Accumulator

1. Whenever the local FIFO inside the EPA is not full, the signal epa mrc ready is asserted. On seeing this, if the MRC has data to send, it does so by sending data on the mrc epa wdata signal validated by the mrc epa we. 2. Upon receipt of such a command, the block honours the same by writing the data into its input FIFO. The EPA now looks for the assertion on the signal txq epa out ready. Whenever this is high, it means that the succeeding block i.e. TxQueue has sufficient space within itself to accept packets from the EPA. 3. If the TxQueue is ready to accept the data, the EPA transmits the data over the 64-bit output bus called epa txq data in which is validated by the signal epa txq wr in signal. 4. To indicate end of frame / packet, a signal called epa txq control in [7:0] is used. This is a one-hot encoding bit pattern, each bit representing corresponding byte in the 8-byte wide output data bus. Only at the frame boundaries, the bit corresponding to the ending byte shall be high, while all other bits will remain low.

6.12

DDR2 SDRAM Controller

This is a very important block in the overall design. It interfaces to the DDR2 module on the NetFPGA board. The block diagram of the DDR2 module with the I/O pins is as shown in figure 6.10 1. The DDR2 module being used is Micron MT47H16M16 which is a 32MB sized chip. Two such chips are in place yielding a total memory size of 64MB. Poovaiah M P and Sushanth Kini

62/145

June 8, 2010

Chapter 6. Proposed Architecture

6.12. DDR2 SDRAM Controller

Figure 6.10: I/O Diagram of DDR2 SDRAM Controller 2. The DDR2 controller issues read - write commands to the DDR2 in which the packets are stored. 3. It is evident that the DDR2 has two interfaces over which it receives the R/W commands. i.e. one interfacing to the MWC and the other to the MRC. Hence, an arbiter is needed inside the DDRC to handle all the commands. The write commands will be given a higher preference over the read commands coming from the MRC. There is a FIFO at both the interfaces which stores the commands from the MWC and the MRC. An arbiter pops out commands from the FIFOs and services the requests. 4. The DDR2 Controller shown in the diagram above actually wraps around the core DDR2 Controller inherited as a Logic Core from the NetFPGA code database. This wrapper actually has all the fifos and implements the logic talked about in the paragraph above. The wrapper translates the commands and data into the pattern in which the Logic Core expects it. 5. Output interface of the DDR2 Controller to the actual DDR2 memory module can be logically divided into command interface and data interface. • The command interface is made up of signals namely – cntrl0 ddr2 casb ’ Memory column access strobe – cntrl0 ddr2 cke ’ Clock enable for DDR2 memory – cntrl0 auto ref req ’ Auto refresh request from the controller to the DDR2 memory module – cntrl0 ddr2 csb ’ Chip select signal to the DDR2 – cntrl0 ddr2 rasb ’ Row address strobe – cntrl0 ddr2 web ’ DDR2 write enable signal – cntrl0 ddr2 dm ’ Data mask signal for the write data – cntrl0 ddr2 dqs, cntrl0 ddr2 dqs n ’ Differential data strobe signals • The data interface is made up of signals namely – cntrl0 ddr2 dq [31:0] ’ Bidirectional DDR2 memory data – cntrl0 ddr2 ba [1:0] ’ Memory bank address – cntrl0 ddr2 address [12:0] ’ Memory address, row and column address being time multiplexed Poovaiah M P and Sushanth Kini

63/145

June 8, 2010

6.12. DDR2 SDRAM Controller

Chapter 6. Proposed Architecture

6. Further details about the DDR2 Controller Logic Core are mentioned in Chapter 7.

Poovaiah M P and Sushanth Kini

64/145

June 8, 2010

Chapter 7

Target Specifications • Monitors downstream network traffic • Store-and-Forward architecture • Packets queued up in on-board DDR2 memory of size 32 MB • Maximum 32 queues and 3 queue-groups supported • Max space of 1 MBytes per queue • Estimated throughput of 200Mbps • Supports peak input burst rate of 1 Gbps • Packet Classification based on Layer 2, Layer 3 and Layer 4 parameters to provide classbased QoS • Configurable shaping for every queue (Token Bucket) • Provision to edit list of Privileged IP Addresses in Packet Classifier • Provision to implement dynamic memory allocation technique in Queue Manager • Configurable, Weighted Round Robin scheduling of the packet queues • Provision in Packet Classifier to update weights to form a desired queue number

65

Part III

IMPLEMENTATION

66

Chapter 8

Micro-Architecture Details 8.1

Packet Accumulator

The I/O diagram of the Packet Accumulator (PA) is as shown in Figure 8.1. The PA has three interfaces, one each for RxQueue, Packet Classifier (PC) and Memory Write Controller (MWC). The working of PA can be summarized as follows:

Figure 8.1: I/O diagram of Packet Accumulator

• The PA accepts packets from the RxQueue which are saved in an implemented DPRAM. • The presence of data in DPRAM is intimated to the PC. • The PA responds to PC’s readNewPkt request by sending the packet bytes to the PC by reading them out of the DPRAM. • The PA also responds to MWC’s getNewPkt request by sending the packet bytes to MWC by reading the DPRAM. 67

8.1. Packet Accumulator

8.1.1

Chapter 8. Micro-Architecture Details

Brief Description of I/O ports:

1. rxq pa data[63:0]: The 8-bytes wide data that comes in from the RxQueue to the PA. 2. rxq pa cntrl [7:0]: An encoding on this bus indicates if it is the start of frame or end of frame with respect to the data over rxq pa data. This encoding is explained in detail later in this chapter. 3. rxq pa write: The signal that validates the values over rxq pa data and rxq pa cntrl. 4. pa rxq ready rxq0 : A ready signal asserted from PA to RxQueue which indicates that the PA has space to accommodate more packets. 5. pa mwc pktDone: This signal is asserted high for one clock cycle when PA sends the last 8 bytes of a packet to the MWC. 6. pa mwc rdata[63:0]: The 8 byte wide bus over which the PA sends packet data to the MWC. 7. pa mwc rdvalid : The data valid signal corresponding to the data over pa mwc rdata. 8. mwc pa getNewPkt: A request from the MWC to read the next packet out of the PA. This is asserted high for 1 clock cycle only. 9. mwc pa drop: A one-clock cycle high signal which is asserted by MWC, if it instructs PA to drop the next packet that is present in the enq dpram in the PA. 10. mwc pa incrPtr [7:0]: This signal carries a value whose value is interpreted in two circumstances. i.e. If mwc pa drop is set high, the value on this bus indicates by how much should the locally maintained mwc rdptr (for the DPRAM) be advanced to drop the next packet. If however, mwc pa drop is low, the value on this bus indicates how many entries the PA should read out from the enq dpram to completely send the packet to the MWC. 11. pa pc data ready: A high on this signal indicates that at least one packet has accumulated in the enq dpram and the PC may start processing headers of this packet. 12. pa pc pktLength[10:0]: The 11-bit wide signal over which the packet length of the next packet is sent to the PC from the PA. 13. pa pc pktLength valid : The valid signal for the value on the bus pa pc pktLength. 14. pc pa readNewPkt: A request signal from the PC to the PA asserted high for one clock cycle to start reading out the next packet from the PA. 15. pa pc rdata[63:0]: The PA sends packet bytes to the PC over this 8 bytes wide bus. 16. pa pc rdvalid : The signal that validates the data being sent by the PA to the PC. 17. pc pa stopCurrentPkt: A one clock-cycle high signal asserted by the PC to indicate the PA to stop sending the next packet to the PC. The implementation of PA is described in the sections below.

8.1.2

Interface to RxQueue

Data from RxQueue can come into the Packet Accumulator at a peak rate of 1 Gbps. The data flows in over the signals rxq pa data, validated by rxq pa cntrl and rxq pa write. The value on rxq pa cntrl is 0xFF for the first 8 bytes, 0x00 for the remaining octets and has a value as per the following table for the last octet. In the last octet, only n, (n ≤ 8) bytes may be valid. The valid data appears in the big-endian format. Hence, depending on n, the rxq pa cntrl may have one of the following values.

Poovaiah M P and Sushanth Kini

68/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.1. Packet Accumulator

Table 8.1: The control signal values for the End-of-frame in rxq pa cntrl n 1 2 3 4 5 6 7 8

rxq pa cntrl 0x80 0x40 0x20 0x10 0x08 0x04 0x02 0x01

Figure 8.2: State Machine of the controller in PA The data from the RxQueue flows into the PA only when pa rxq ready rxq0 is asserted HIGH. The accumulation of data occurs into a DPRAM which is 2K deep and is 64 bits wide to suit the data width of the incoming rxq pa data. The accumulation into DPRAM is controlled by an FSM whose state diagram is as shown in Figure 8.2. The state machine in ”initialize” state keeps polling for the start of frame which is identified by 0xFF on the rxq pa cntrl bus. When identified as 0xFF, it transits to the ”rcv Pkt” state Poovaiah M P and Sushanth Kini

69/145

June 8, 2010

8.1. Packet Accumulator

Chapter 8. Micro-Architecture Details

wherein it stays till the end of frame is reached. All the accumulated bytes are written to the enq dpram. As the packet bytes are flowing in, the FSM also computes the packet length. For every octet received, the packet length is incremented by 0x8. At the end of frame, the packet length is incremented as per the value in the rxq pa cntrl field. The computed packet length is now written into a FIFO called pktLength FIFO. The pktLength FIFO is 256-entries deep with each entry being 11 bits wide. Since the packet length can never exceed 1536 in a practical scenario, the choice of 11 bits per FIFO entry was made. In order to prevent the unread bytes in the DPRAM being overwritten by the incoming packet bytes, a handshake mechanism is maintained using the pa rxq ready rxq0 signal. This signal is asserted as pa rxq ready rxq0 = dpram space sufficient && !pktLength FIFO full Two read pointers are maintained within the PA. One keeps track of the reads from PC while the other keeps track of the reads from the MWC. The dpram space sufficient is computed such that it indicates that there is space left in the DPRAM for atleast one more full sized packet (1536 bytes) to accumulate further.

8.1.3

Interface to the PC

• The PA indicates the presence of an accumulated packet to the PC over the signal pa pc data rdy. • If this is high, the PC responds with a pc pa readNewPkt command. • The PA provides the packet length information to the PC over the signals pa pc pktLength and pa pc pktLength valid. • The PA also starts sending out the packet bytes in units of 8 bytes over the signals pa pc rdata validated by pa pc rdvalid. • The PA continues to do so until either the PC issues a stop command over pc pa stopCurrentPkt or the PA completely reads out the packet. • For every read of the DPRAM, the PA increments the pc rdptr maintained locally which is used to keep track of the reads towards PC. • The pc rdptr is suitably altered in the case when pc pa stopCurrentPkt is issued. This is because, the PC might have issued the stop command when the PA is still in the middle of sending the packet to the PC. Hence, the current value in the pc rdptr would point to somewhere in the middle of a packet. When the PC next issues a read command, the PA should start reading out the bytes of the next packet. Hence this calls for an alteration of the pc rdptr which is done accordingly. • The PA also asserts pa pc pktDone when it sends out the last 8 bytes of any packet. This signal is asserted along with pa pc rdvalid.

8.1.4

Arbiter for DPRAM access

Since, there will be simultaneous read requests to the DPRAM from both PC and MWC, there is an arbiter in the PA to handle this case. The arbiter arbitrates between the PC and MWC read requests in a round-robin fashion in case of simultaneous reads. In the case where there is no simultaneous read requests, the arbiter services the source of the read, be it MWC or PC. Poovaiah M P and Sushanth Kini

70/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.2. Packet Classifier

The mwc rdptr or pc rdptr maintained locally is incremented for every serviced read.

8.2

Packet Classifier

The IO diagram of the module Packet Classifier (PC) is as shown in Figure 8.3. The functioning of the PC can be summarized as follows: • The PC requests packet headers from the PA. • The packet header parsers operate on the headers and generate a queue number and a group number. • The PC now issues commands to the QM along with the queue number and group number that was just generated.

Figure 8.3: I/O diagram of PC The PC has three I/O interfaces, one each for the Packet Accumulator (PA), the Queue Manager (QM) and the PCI from where user configurability is provided.

8.2.1

Brief Description of I/O Ports

1. pc pa readNewPkt: A request signal from the PC to the PA, asserted high for one clock cycle to start reading out the next packet from the PA. Poovaiah M P and Sushanth Kini

71/145

June 8, 2010

8.2. Packet Classifier

Chapter 8. Micro-Architecture Details

2. pc pa stopCurrentPkt: A one clock-cycle high signal asserted by the PC to indicate the PA to stop sending the next packet to the PC. 3. pa pc rdata[63:0]: The PA sends packet bytes to the PC over this 8 bytes wide bus. 4. pa pc rdvalid : The signal that validates the data being sent by the PA to the PC. 5. pa pc pktLength[10:0]: The 11-bit wide signal over which the packet length of the next packet is sent to the PC from the PA. 6. pa pc pktLength valid : The valid signal for the value on the bus pa pc pktLength. 7. pa pc data ready: A high on this signal indicates that at least one packet has accumulated in the enq dpram and the PC may start processing headers of this packet. 8. qm pc ready: A high on this signal indicates that the QM is ready to accept more commands from the PC. 9. pc qm enqueue: A one-clock high cycle signal which validates the value over the signals pc qm pktLength, pc qm queueNum and pc qm groupNum. This signal is used to give commands to the QM. 10. pc qm pktLength[10:0]: The packet length of the packet for which an enqueue command is being issued by PC is sent over this signal. 11. pc qm queueNum[4:0]: Value on this signal indicates the queue number to which the packet will belong to. 12. pc qm groupNum[1:0]: This signal is used to describe which of the 3 groups the packet belongs to. 13. pci pc update: This signal is a one CAM-clock high signal that validates the signal pci pc IP. 14. pci pc IP [31:0]: This signal is used to update an entry in the list of privileged IP addresses maintained in the CAM. 15. pci pc weight update: This signal validates the values on the signals pci pc weight and pci pc weight index. 16. pci pc weight[3:0]: The value on this signal is a weight which will overwrite the default weight in the weight array at an index equal to pci pc weight index[5:0].

8.2.2

Interface to PA

This has been implemented using a Finite State Machine (FSM). The FSM state diagram is as shown in the Figure 8.4. The pa pc readFSM looks for the assertion of the input signal pa pc data rdy from the PA. The assertion of this signal indicates that the next and subsequent packets have accumulated in the PA and the PC may read it out to process their headers. The FSM generates an output pc pa readNewPkt in response. The PA then sends the packet length of the next packet to the PC. Following this, the PA also starts sending out the packet in units of 8 bytes to the PC. Since the PC doesnt need all the packet bytes to decide upon a queue number and instead only requires the packet headers upto layer 4, the PC has a mechanism to prevent the PA from sending out the entire packet to the PC. The PC needs headers upto Layer 4. Calculations suggest that worst-case maximum header length in bytes that the PC needs will be 232 bytes. Hence, it is enough if the PC reads these many bytes out of the PA to decide on a queue number. The PC reads a maximum of 256 bytes out of the PA after which it issues the pc pa stopCurrentPkt signal. The pa pc boundary counter is needed to ensure that PC reads out data from the PA in multiples of 256 bits (or 32 bytes) because the packet header parsers in the PC operate Poovaiah M P and Sushanth Kini

72/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.2. Packet Classifier

Figure 8.4: pa pc readFSM in PC

on 256 bits of data at a time. state machines ready is a signal generated internally from the packet parser state machines when they have completed operating on the headers of the previous packet. All the data read out from the PA by PC is stored in a local FIFO called pa pc FIFO which is 64 bits wide and is 64 entries deep. The packet lengths which are read out from the PA are stored in another FIFO called pa pc pktLength FIFO which is 16 entries deep and has a bit-width of 11 per entry. This state machine is also responsible for resetting the pa pc FIFO before reading out the next new packet. This is because, the parsers might have read out only few bytes from the pa pc FIFO and the queue number might have got decided. Hence, there would be residual bytes left behind in the pa pc FIFO which the parsers should not read out thinking that these are as starting bytes of the next packet. Hence, to avoid this, this state machine resets the pa pc FIFO as soon as the queue number is generated for the current packet being processed.

8.2.3

L2 Header Parser

The working of the Layer 2 header parser can be summarized using Figure 8.5. The Layer 2 header parser starts when it sees the pa pc FIFO empty signal to be low indicating presence of data. The L2 parser reads out 4 entries from this FIFO to form a 256-bits wide field, called Poovaiah M P and Sushanth Kini

73/145

June 8, 2010

8.2. Packet Classifier

Chapter 8. Micro-Architecture Details

L2 reg, to operate upon. The parser first looks up the bits [159:144]. The value here indicates the kind of frame that this particular L2 Frame is i.e. it could be a frame carrying an ARP packet, a RARP packet, an IPv4 packet etc.

Figure 8.5: Flowchart for the L2 Parser • If the frame carries an ARP or RARP packet, the L2 Header parser doesnt trigger the L3 Parser or the L4 Parser. Instead the final queue number of 0x0 is generated. The corresponding group number 0x0 is also generated. • If the frame is VLAN single tagged or VLAN double tagged, then the L2 parser moves forward its L2 reg read indices by 16 bits and 32 bits respectively and the parsing is run once again. • If the frame carries an IPv4 packet, the do L3 parsing is set so that the L3 parser starts operating. • If the frame carries Length field, it implies that there is a LLC-SNAP header following this field and hence the L2 reg indices are moved ahead by a total of 8 bytes so that the L2 parsing runs once again on this. • If the frame carried any other Ethertype value, then the frame is directed towards the lowest priority queue i.e. queue numbered 31. The group number is set as 0x2. If do L3 parsing is not set, then, once the queue number is generated at this stage of parsing, the state machines ready signal is set so that the pa pc FSM can proceed with reading out the next packet from the PA into the pa pc FIFO. However, if do L3 parsing is set, then the L3 parser will begin operation.

8.2.4

L3 Header Parser

The L3 header parser starts functioning when do L3 parsing is set by the L2 parser. The L3 header parser looks for values in the Protocol field inside the IP Header. The Protocol field Poovaiah M P and Sushanth Kini

74/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.2. Packet Classifier

could lie anywhere inside L2 reg, which was being used by the L2 Parser, or would have to be found out in the data that will be read out next from the pa pc FIFO. The decision whether to read the FIFO or not, is informed by the L2 state machine. The L2 parser also communicates the indices inside the L2 reg or L3 reg within which the Protocol field values shall appear. Accordingly, the L3 parser does not read or reads out the pa pc FIFO and inspects the value in the Protocol field. The L3 Parser is illustrated in Figure 8.6.

Figure 8.6: Flowchart for the L3 Parser • If the value in the Protocol field is 0x01 (ICMP) or 0x02 (IGMP), then the queue number generated is 0x1. The corresponding group number generated is 0x0. • If the upper layer protocol is TCP or UDP, then do L4 parsing is set so that the L4 parser starts functioning. • However, if the Protocol field has values other than the ones mentioned above, then the queue number gets generated accordingly at this stage of parsing itself. And the state machines ready is set high so that the pa pc FSM reads the next packet out of the PA. Meanwhile, the L3 parser extracts the destination IP address from the IP header and subjects this to a PrivilegedIP-CAM lookup. A CAM with 128 entries, each entry being 32 bits wide, is maintained within the PC. This CAM is initialized with 128 IP addresses which shall be the Poovaiah M P and Sushanth Kini

75/145

June 8, 2010

8.2. Packet Classifier

Chapter 8. Micro-Architecture Details

privileged addresses. The CAM initialization happens at boot-up. User configurability of the privileged IP addresses is provided through PCI which will be explained in section 8.2.7. If the IP address placed at the cmp din bus of the CAM matches with any of the 128 IP addresses within the CAM, then, the Match output is asserted. This output is recorded by the L3 Parser. The CAM is operated at half of the core clock frequency (i.e. at 62.5 MHz) so that the CAM logic doesnt appear in the critical path. The L3 parser combines the output of CAM (match) and the sum of the weights generated at L2 stage and L3 stage to generate the final queue number. If the CAM match was high, then the group number is 0x1 else the group number is 0x2.

8.2.5

L4 Header Parser

The L4 parser inspects the fields in the L4 headers and is set off the moment L3 parser finishes its job and asserts the flag do L4 parsing. The L4 Parser basically inspects the source port number in the case of both TCP and UDP and generates weights accordingly. The indices within L3 reg or L4 reg within which Protocol field can be found is informed by the L3 Parser. The L3 Parser also indicates whether the L4 Parser needs to read entries out of the pa pc FIFO or not. The final queue number and group number is generated as a combination of the CAM match output, L2 level weight, L3 level weight and L4 level weight. The same is illustrated in the flowchart as shown in the Figure 8.7.

Figure 8.7: Flowchart for the L4 Parser

Poovaiah M P and Sushanth Kini

76/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.2.6

8.2. Packet Classifier

Interface to QM

The packet parsers operate on the packet headers and come up with a queue number and a group number which should now be communicated to the QM. The same is done using the signals pc qm enqueue, pc qm queueNum and pc qm groupNum. The PC also reads out the pa pc pktLength FIFO and asserts the pc qm pktLength with the value read out from the FIFO.

8.2.7

Interface to PCI

In order to provide user-configurability, the PCI to host PC interface is used in the design. The user inputs commands through the PCI to either update the list of Privileged IP addresses maintained in the CAM or to update the default weights that have been assigned to the various protocols. If user wants to edit the list of privileged IP addresses, he executes the corresponding command. The PCI translator transfers this user command and drives the pci pc IP and pci pc update signals. The CAM entries get updated to the ones that were just supplied. The pci pc weight update, pci pc weight and pci pc weight index are driven by the PCI translator when user wishes to update the weights. The default weights, their indices and the protocols are mentioned in the table below. Table 8.2: Protocols with their corresponding index and weight Protocol ARP/RARP IPv4 L2 OTHERS ICMP IGMP ESP AH L2TP TLS TCP SCTP UDP UDP Lite L3 OTHERS FTP SSH Telnet SMTP WHOIS DNS HTTP

Poovaiah M P and Sushanth Kini

weight index weight 0 0 1 1 2 15 3 0 4 0 5 1 6 1 7 1 8 1 9 2 10 2 11 3 12 4 13 13 14 4 15 1 16 1 17 2 18 11 19 0 20 3 Continued on next page

77/145

June 8, 2010

8.3. Queue Manager

Chapter 8. Micro-Architecture Details Table 8.2 – continued from previous page Protocol weight index weight POP2 21 2 POP3 22 2 NTP 23 0 IMAP 24 2 IRC 25 5 IMAPv3 26 2 HTTPS 27 3 SMTPS 28 2 RTSP 29 6 RPC HTTP 30 11 FTPS 31 4 TELNETS 32 1 IMAPS 33 2 POP3S 34 2 Kazaa 35 8 VLC 36 6 IPSec 37 1 WINS 38 0 L2F 39 1 H.323 40 7 PPTP 41 1 Windows Media Player 42 65 AOL Instant Messenger 43 3 Squid 44 6 RTP 45 5 Yahoo Messenger 46 7 SIP 47 7 SIPS 48 5 Webcam 49 5 AOL 50 5 XMPP 51 5 Torrent Clients 52 8 HTTP Alternate Ports 53 3 Skype 54 7 L4 OTHERS 55 11

8.3

Queue Manager

The Queue Manager (QM) is responsible for maintaining the information about the queues of packets that are stored in the DDR2 memory. In brief, the functionality of QM can be explained as follows: Poovaiah M P and Sushanth Kini

78/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.3. Queue Manager

• The PC gives an enqueue command to the QM, for a particular packet, along with the corresponding packet length and the queue number into which the packet should be enqueued. • The QM looks up its tables which maintain the Queue Head Pointers and Queue Tail Pointers on a per queue basis and determines if there is enough space in the DDR2 memory to house the next packet. • If there is space available in the DDR2 memory to enqueue the next packet, then it issues a write command to the MWC. • However, if there is not enough space to hold the next packet inside the memory, the QM will issue a drop command to the MWC. • In addition to the above, when the DDRC responds with an enqueue acknowledgement for a particular packet, the QM informs the same to the SCH. • The QM also receives the dequeue requests from the SCH which it subsequently processes and issues read commands to the MRC.

Figure 8.8: I/O diagram of Queue Manager

8.3.1

Brief Description of I/O Ports

The I/O Diagram of QM is as shown in Figure 8.8. 1. qm mwc write: A one-clock cycle high signal which is asserted when QM issues write commands to the MWC. This signal also validates the qm mwc waddr, qm mwc queueNum, qm mwc groupNum and qm mwc pktLength signals. 2. qm mwc waddr [24:0]: A 25-bit wide DDR2 memory address starting from which the current packet should be written is asserted on this bus. 3. qm mwc pktLength[10:0]: A 11-bit wide bus on which the packet length of the current packet is driven by the QM. 4. qm mwc queueNum[4:0]: Value on this signal indicates the queue number to which the current packet belongs to. Poovaiah M P and Sushanth Kini

79/145

June 8, 2010

8.3. Queue Manager

Chapter 8. Micro-Architecture Details

5. qm mwc groupNum[1:0]: The group number to which the current packet belongs to is indicated by the value on this bus. 6. qm mwc drop: This signal is asserted high for one clock cycle along with qm mwc write if the QM indicates the MWC to drop the current packet and inform this to PA also. 7. mwc qm ready: A ready signal asserted by the MWC to the QM to inform the QM that it is ready to accept more write commands from the QM. 8. qm pc ready: A high on this signal indicates that the QM is ready to accept more commands from the PC. 9. pc qm enqueue: A one-clock high cycle signal which validates the value over the signals pc qm pktLength, pc qm queueNum and pc qm groupNum. This signal is used to give commands to the QM. 10. pc qm pktLength[10:0]: The packet length of the packet for which an enqueue command is being issued by PC is sent over this signal. 11. pc qm queueNum[4:0]: Value on this signal indicates the queue number to which the packet will belong to. 12. pc qm groupNum[1:0]: This signal is used to describe which of the 3 groups the packet belongs to. 13. qm mrc dequeue: This signal is asserted high for one clock-cycle when QM sends commands to MRC to read out the next packet from the DDR2 memory. This signal also validates the values on the buses qm mrc pktLength, qm mrc deq raddr and qm mrc queueNum. 14. qm mrc deq raddr [24:0]: The 25-bit DDR2 address starting from which the current packet should be read out is informed to the MRC by the QM by asserting that value on this bus. 15. qm mrc pktLength[10:0]: The MRC is informed about the packet length of the current packet by the value on this signal. 16. qm mrc queueNum[4:0]: The queue number to which the current packet belongs is informed by the QM to the MRC by asserting this bus. 17. mrc qm pause: A low on this signal indicates that MRC is in a position to accept more read requests from the QM. 18. qm sch queueNum[4:0]: This signal is driven with the queue number to which a packet belongs whose successful write into the DDR2 memory was just acknowledged by the DDRC wrapper. 19. qm sch update: A one-clock cycle high driven on this bus validates the value on the qm sch queueNum bus. 20. qm sch ready: A high on this signal indicates readiness of the QM to accept more commands from SCH. 21. sch qm deq req: A one-clock cycle high assertion is done by SCH when it issues a dequeue request to the QM. 22. sch qm queueNum[4:0]: This signal is driven with the queue number from which the next packet should be dequeued out by the QM. 23. ddrc qm enq ack qNum[4:0]: The DDRC wrapper acknowledges the successful write of a packet into the DDR2 memory by driving this signal with the queue number to which the packet belonged. 24. ddrc qm enq ack : A one-clock cycle high assertion on this signal validates the value on the bus ddrc qm enq ack qNum.

Poovaiah M P and Sushanth Kini

80/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.3.2

8.3. Queue Manager

Tables in DPRAMs used in the design

1. Queue Head Pointer table (a) This is maintained in a DPRAM that is 64 entries deep with each entry being 25 bits wide. (b) Each entry in this table holds a 25-bit DDR2 memory address. This entry is the value of the head pointer of a queue which points to the first 4 bytes of the packet in that queue inside the DDR2 memory. (c) A DPRAM is needed because this table is very often concurrently accessed during a packet enqueue and a packet dequeue 2. Queue Tail Pointer table (a) This is maintained in a 64x25 SPRAM. (b) Each entry in this table reads a 25-bit DDR2 memory address that is the tail pointer of a queue. This tail pointer points to the last 4 bytes of the packet that is stored in that queue. (c) SPRAM suffices in this case because this table is going to be accessed only during the enqueue of a packet. 3. Queue Start Address table (a) This is maintained in a 64x25 SPRAM. (b) Each entry in this table holds a 25-bit DDR2 memory address that is the starting address of the memory quota allocated to this queue. (c) Since this table will be accessed only during the enqueue of a packet, it is housed in a SPRAM. 4. Queue End Address table (a) A 64x25 SPRAM maintains the Queue End Address table. (b) Each entry in this table holds a 25-bit DDR2 memory address that is the ending address of the memory quota allocated to this queue. (c) Since this table is accesses only during the enqueue of a packet, an SPRAM suffices to hold this table. 5. Packet Length table (a) This is a 32K deep DPRAM with each row being 27 bits wide. Table 8.3: Packet Length Table in Queue Manager 26:16 Packet Length

15:0 Address in this DPRAM where packet length of the next packet in this queue is recorded

(b) This table will be accessed very often concurrently during enqueue and dequeue operations and hence it is maintained in a DPRAM. (c) This table is used to maintain the packet lengths of the packets belonging to a certain queue in the form of a linked list within this DPRAM. 6. Enqueue Queue Next Pointer table Poovaiah M P and Sushanth Kini

81/145

June 8, 2010

8.3. Queue Manager

Chapter 8. Micro-Architecture Details

(a) A 64x16 DPRAM maintains this table. Table 8.4: Enqueue Queue Next Pointer Table in Queue Manager 15 Valid

14:0 Enqueue Next Pointer

(b) This table is accessed concurrently during the enqueue and dequeue processes and hence a DPRAM is needed to hold this table. (c) ”Valid” bit indicates whether that entry holds a valid data or not. (d) ”Enqueue Next Pointer” is an address into the Packet Length table which holds the packet lengths of all packets currently enqueued. 7. Dequeue Queue Next Pointer table (a) A 64x16 DPRAM maintains this table Table 8.5: Dequeue Queue Next Pointer Table in Queue Manager 15 Valid

14:0 Dequeue Next Pointer

(b) This table is accessed concurrently during the enqueue and dequeue processes and hence a DPRAM is needed to hold this table. (c) ”Valid” bit indicates the validity of that particular entry (d) ”Dequeue Next Pointer” is an address into the Packet Length table where the packet length of the next packet to be dequeued from this queue is stored. A point to be noted is that there are 64 entries in 6 out of the 7 tables, above which are on a per queue basis even though the architecture supports 32 actual queues of packets. This ’disparity’ is to provide for the future implementation of the Dynamic Memory Allocation scheme. In this scheme, when it is seen that a certain queue runs out of memory space in the DDR2, it will be allocated additional space which is obtained by taking a chunk of memory space out of the quota allocated to a lower priority queue. The algorithm shall start looking at the queue space remaining for queue numbered 31 and upwards. At the point when it sees that there is space available which is needed by a higher priority queue, this chunk of memory is deallocated from the lower priority queue and is allocated to the higher priority queue. The extra space allocated is provided a new queue number which is 1 more than the actual queue number. i.e. Say queue numbered 0 ran out of memory space, then, the extra space which was taken out of a lower priority queue will be allocated to queue numbered 1. In the design, only even numbered queues are operational within the Queue Manager at any time when dynamic memory allocation is not in effect. So, only queues numbered 0,2,4, ... are actually holding the packets. These queues are reflected as queues 0,1,2, ... to all the other blocks in the design including the DDR2 memory. Poovaiah M P and Sushanth Kini

82/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.3.3

8.3. Queue Manager

Space Calculation algorithm

During an enqueue of a packet into a queue, it is necessary to calculate if there is sufficient space left in that queue inside the DDR2 memory to hold the next packet. This algorithm proceeds as follows. 1. The Queue Start Address table, Queue End Address table, Queue Head Pointer table, Queue End Pointer table are read for the queue where the next packet is to be enqueued. 2. Now, the algorithm proceeds as per the following pseudo-code: //Legend: // QTP // QSA // QEA

QHP = QueueHeadPointer = QueueTailPointer = QueueStartAddress = QueueEndAddress

if (QHP[QueueNum] == QTP[QueueNum]) { space available = 1; } else if (QHP[QueueNum] < QTP[QueueNum]) { if ((QEA[QueueNum] QTP[QueueNum]) >= packet length) { space available = 1; } else { if ((QHP[QueueNum] - QSA[QueueNum] >= packet length) { space available = 1; } else { space available = 0; } } } else //Implies QTP < QHP due to rollover of QTP { if ((QHP[QueueNum] - QTP[QueueNum] >= packet length) { space available = 1; } else { space available = 0; } } Poovaiah M P and Sushanth Kini

83/145

June 8, 2010

8.3. Queue Manager

Chapter 8. Micro-Architecture Details

} 3. If the queue Head Pointer is equal to the queue tail pointer, it means that all packets have been dequeued out and no more unread packets exist in this queue. Hence, there is space certainly available for the next packet to be enqueued into the queue. 4. However, if the queue head pointer less than the queue tail pointer, it means that the tail pointer has moved ahead because of some packet enqueues into this queue which are yet to be read out. Hence, in this case, the difference between the Queue End Address and the queue tail pointer is calculated to check if there exists sufficient space to hold the next packet whose packet length has been intimated by the PC. In case, there is not enough space left to enqueue the next packet, the QM issues a drop command to the MWC which acts accordingly. 5. If the queue tail pointer is lesser than the queue head pointer, it is the scenario wherein the queue tail pointer has rolled over after reaching the queue end address limit. In this case, the space that is left in the queue is to be calculated as queue head pointer diminished by the queue tail pointer. If this space is lesser than the space required corresponding to the packet length the QM will issue a drop to the MWC.

8.3.4

Interface to PC and MWC

Whenever there is a command from the PC to the QM, the QM queues up the same into a FIFO called pc qm cmdFIFO. This FIFO has been generated using Xilinx CoreGen and is a 32 entries deep, 18-bits wide FIFO. The QM also issues a ready signal, which is a handshake signal, to the PC called as qm pc ready generated as an inversion of the pc qm cmdFIFO prog full. The pc qm cmdFIFO prog full is set to go high at a FIFO write depth level of 26. It is necessary to generate qm pc ready signal using the pc qm cmdFIFO prog full and not using the pc qm cmdFIFO full signal so that the commands from PC to QM which are already on the way, get honoured while the PC subsequently recognizes the qm pc ready going low and stops sending the enqueue commands to the QM. Each entry in the FIFO holds data in the format {11’d pc qm pktLength, 2’d pc qm groupNum, 5’d pc qm queueNum}. FSM to read out the entries from the pc qm cmdFIFO An FSM called pcqm FSM reads out the entries from the pc qm cmdFIFO in order and process these. The FSM state diagram is shown in Figure 8.9. s0 1. If the pc qm cmdFIFO empty is low, which means that there exists some command in the FIFO and if mwc qm ready is high, the FSM transits ahead to s1 by issuing a read enable to the pc qm cmdFIFO. The mwc qm ready is a handshake input signal which indicates the readiness of the MWC to accept write commands from the QM. s1 1. Once the pc qm cmdFIFO rdvalid is seen, a read command is issued to the tables Queue Head Pointer table, Queue Tail Pointer table, Queue Start Address table and Queue End Address table with the address as the queue number got from pc qm cmdFIFO rdata. Poovaiah M P and Sushanth Kini

84/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.3. Queue Manager

Figure 8.9: FSM to read out entries from the pc qm cmdFIFO 2. The state machine then transits to state s2. s2 1. Once the data is read out of all the tables which is flagged using a generated signal called Poovaiah M P and Sushanth Kini

85/145

June 8, 2010

8.3. Queue Manager

Chapter 8. Micro-Architecture Details

all rdvalids 1 seen, the state transits to s3. s3 1. The space calculation algorithm indicates the presence or absence of sufficient space for the next packet to be enqueued. 2. If there is space available for the packet to be enqueued, the pointers related to this queue must be suitably altered. Hence, a write is done into the queue tail pointer to move the tail pointer ahead by an amount proportional to the packet length. 3. In case the queue head pointer was equal to the queue tail pointer before enqueue but they were not equal to the queue start address, it is necessary to restore the queue head pointer to the queue start address. Hence, a write is done into the queue head pointer table to do this. 4. A flag called record pktLength is set to record the packet length of the packet being enqueued. This will be done by the record pktLength FSM. s4 1. Here the commands are issued to the MWC. The signals used for this purpose are qm mwc write, qm mwc drop, qm mwc waddr and qm mwc pktLength. The qm mwc waddr is driven with the value of the queue tail pointer just before its updation in the queue tail pointer in the previous state. FSM to record the packet length of the packets being enqueued An FSM called record pktLength FSM does the job of saving the packet lengths of the packets being enqueued. The state diagram of this FSM is as shown in Figure 8.10. s0 1. When record pktLength is issued by the pcqm FSM a read is issued into the Enqueue Queue Next Pointer table at an address equal to the queue number. The queue number is obtained from pc qm cmdFIFO rdata. The state then transists to s1. s1 1. If this is the very first packet to be enqueued into this queue, then the Enqueue Queue Next Pointer table rdata will be invalid. (a) Hence, the current packet length is recorded into the Packet Length DPRAM at its current write address. The actual data to be written into the DPRAM is {11’d current pktLength, -1}. The ”-1” is needed to signify that this is the last packet in this queue. (b) The current Packet Length DPRAM write address is then stored into the Enqueue Queue Next Pointer table at an index equal to the Queue Number. The entry is validated by setting the ”Valid” bit high in the entry. (c) Also, the same entry is made into the Dequeue Queue Next Pointer table at an address equal to the queue number. The validation of the entry is also done by setting the ”Valid” bit. Poovaiah M P and Sushanth Kini

86/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.3. Queue Manager

Figure 8.10: FSM to record the packet length of the packets being enqueued

(d) The state then transits to s0. 2. However, if it was not the first packet for this queue: (a) The current packet length is recorded into the Packet Length DPRAM at its current write address. The actual data to be written into the DPRAM is {11’d current pktLength, -1} . The ”-1” is needed to signify that this is the last packet in this queue as every packet is being appended to a queue. (b) The current Packet Length DPRAM write address is then stored into the Enqueue Queue Next Pointer table at an index equal to the Queue Number. Poovaiah M P and Sushanth Kini

87/145

June 8, 2010

8.3. Queue Manager

Chapter 8. Micro-Architecture Details

(c) Also, a read is issued to the Packet Length table at an address equal to the value on the bus enq qnp table rdata[14:0]. (d) The state then transits to s2. s2 1. At an address equal to dpram rdata[14:0], the Packet Length table DPRAM is written with {dpram rdata[26:16],1b0,previous dpram waddr [14:0]}. The previous dpram waddr is nothing but the value on the dpram waddr that was present in the previous state i.e. s2. 2. The whole idea of this task here is to form and maintain packet lengths of the packets in the form of linked list, all these packets belonging to the same queue. 3. Hence, in the design, a maximum of 32 linked lists could be active at the same time. 4. The state then transits to s3. s3 1. No activity happens in this state other than waiting for a clock cycle so that the DPRAM write issued in the previous state completes. 2. The state then transits to s0.

8.3.5

Interface to DDRC and SCH

The DDRC acknowledges the successful write of the packets into the DDR2 memory to the QM by asserting the signals ddrc qm enq ack and ddrc qm enq ack qNum. The ddrc qm enq ack qNum is an aggregation of the queue number and the corresponding group number and is hence 7 bits wide. When this is received by the QM, it issues update commands to the Scheduler to inform about this successful enqueue event. The signals used for this are qm sch update and qm sch queueNum. In reply, the scheduler runs its own algorithm and keeps issuing the dequeue requests to the QM. All these commands are written into the sch qm cmdFIFO. This FIFO is 64 entries deep with each entry being 7 bits wide. The 7 bit write-data into this FIFO is nothing but sch qm queueNum which is asserted along with sch qm deq req. The commands issued by the SCH to QM are executed by the dequeue FSM in the QM. The QM generates a handshake signal called qm sch ready which is generated as a inversion of the sch qm cmdFIFO prog full signal. The programmable full threshold for this FIFO has been set at 56 for a total depth of 64. subsubsectionFSM to dequeue the packets out of the queues The deq FSM takes care of dequeuing the packets out of the queues and issues commands to the MRC. The state diagram for this FSM is as shown in Figure 8.11. s0 1. If there are unserviced commands in the sch qm cmdFIFO and if mrc qm pause is low, then a sch qm cmdFIFO re is issued. The state then transits to s1. The mrc qm pause is a handshake signal issued by the MRC to QM to indicate its readiness to accept commands from the QM. If this signal is high, it means MRC is not in a position to accept any more commands from the QM. Poovaiah M P and Sushanth Kini

88/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.3. Queue Manager

Figure 8.11: FSM to dequeue the packets out of the queues s1 1. If sch qm cmdFIFO rdvalid is 0x1, then a read is issued to the Dequeue Queue Next Pointer table and the Queue Head Pointer table at an address equal to the queue number which was obtained from the sch qm cmdFIFO rdata. 2. The state then transits to s2 s2 Poovaiah M P and Sushanth Kini

89/145

June 8, 2010

8.4. Memory Write Controller

Chapter 8. Micro-Architecture Details

1. A read is issued to the Packet Length DPRAM at an address equal to the deq qnp table rdata. 2. The state now transits to s3. s3 1. The commands to MRC are issued over the signals qm mrc deq req, qm mrc raddr, qm mrc queueNum and qm mrc pktLength. The qm mrc raddr is driven with the data read out of the queue head pointer table. The qm mrc pktLength is driven with the upper 11 bits of the data read out from the Packet Length DPRAM in the previous state. 2. A write is issued to the Queue Head Pointer table to move the head pointer by an amount proportional to the packet length of the packet being dequeued. 3. If the dpram rdata[15:0] = 0xFFFF, it means that this is the last packet in the queue which is being dequeued. Hence, a write is done into the Dequeue Queue Next Pointer table to invalidate the entry corresponding to this queue number. A corresponding invalidation is also done in the Enqueue Queue Next Pointer Table. 4. But, if this is not the last packet in the queue, then, the Enqueue Queue Next Pointer table is left untouched. Also, the Dequeue Queue Next Pointer table is written with {1’b1,dpram rdata[14:0]}. i.e. the dequeue queue next pointer is updated to point to the location in the packet length DPRAM where the packet length of the next packet is stored. 5. The state then transits to s0.

8.4

Memory Write Controller

The Memory Write Controller (MWC) is responsible for issuing write commands to the DDR2 Controller (DDRC). Its functionality can be summarized as follows. 1. 2. 3. 4.

The MWC accepts write commands from QM. In response, it issues read commands to the PA. Data coming in from the PA is locally stored in the FIFOs. The MWC then issues commands to the DDRC to actually write into the DDR2 memory.

8.4.1

Brief Description of I/O Ports

The I/O Diagram of MWC is as shown in Figure 8.12. 1. qm mwc write: A one-clock cycle high signal which is asserted when QM issues write commands to the MWC. This signal also validates the qm mwc waddr, qm mwc queueNum, qm mwc groupNum and qm mwc pktLength signals. 2. qm mwc waddr [24:0]: A 25-bit wide DDR2 memory address starting from which the current packet should be written is asserted on this bus. 3. qm mwc pktLength[10:0]: A 11-bit wide bus on which the packet length of the current packet is asserted by the QM. 4. qm mwc queueNum[4:0]: Value on this signal indicates the queue number to which the current packet belongs to. Poovaiah M P and Sushanth Kini

90/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.4. Memory Write Controller

Figure 8.12: I/O diagram of MWC

5. qm mwc groupNum[1:0]: The group number to which the current packet belongs to is indicated by the value on this bus. 6. qm mwc drop: This signal is asserted high for one clock cycle along with qm mwc write if the QM indicates the MWC to drop the current packet and inform this to PA also. 7. mwc qm ready: A ready signal asserted by the MWC to the QM to inform the QM that it is ready to accept more write commands from the QM. 8. pa mwc pktDone: This is asserted high for one clock cycle when PA is sending the last 8 bytes of a packet to the MWC. 9. pa mwc rdata[63:0]: The 8 byte wide bus over which the PA sends packet data to the MWC. 10. pa mwc rdvalid : The valid signal corresponding to the data over pa mwc rdata. 11. mwc pa getNewPkt: A request from the MWC to read the next packet out of the PA. This is asserted high for 1 clock cycle only. 12. mwc pa drop: A one-clock cycle high signal which is asserted by MWC if it instructs PA to drop the next packet that is present in the enq dpram in the PA. 13. mwc pa incrPtr [7:0]: This signal carries a value whose value is interpreted in two circumstances. i.e. If mwc pa drop is set high, the value on this bus indicates by how much should the locally maintained mwc rdptr (for the DPRAM) be advanced to drop the next packet. If however, mwc pa drop is low, the value on this bus indicates how many entries the PA should read out from the enq dpram to completely send the packet to the MWC. 14. mwc ddrc waddr [24:0]: The DDR2 memory address to which the 128-bits on the bus mwc ddrc wdata[127:0] should be written is driven over this bus. 15. mwc dddrc wdata[127:0]: The 128-bit write data value for the 4 DDR2 memory addresses starting from the current value on the bus mwc ddrc waddr is driven on this bus. This bus is 128-bits wide because the DDR2 controller operates at a burst length of 4 and needs 32 bits per DDR2 memory address. 16. mwc ddrc we: This signal is driven high to validate the values over the bus mwc ddrc waddr and mwc ddrc wdata. 17. mwc ddrc record qNum[6:0]: This signal is an aggregation of the 5-bit queue number and 2-bit group number to which the current packet being written into the DDR2 belongs. Poovaiah M P and Sushanth Kini

91/145

June 8, 2010

8.4. Memory Write Controller

Chapter 8. Micro-Architecture Details

18. mwc ddrc record endAddr en: A high driven on this signal validates the value on the bus mwc ddrc record qNum. 19. ddrc mwc ready: This signal is asserted high by the DDRC wrapper when it wishes to express its readiness to accept more write commands from the MWC.

8.4.2

Interface to QM and PA

All commands from QM are written into local FIFOs. There are 3 FIFOs at this interface. 1. qm mwc cmdFIFO (a) This is a 16-deep 12-bits wide FIFO. (b) It stores the qm mwc drop bit aggregated with qm mwc pktLength. 2. qm mwc qNumFIFO (a) This is a 16-deep 7-bits wide FIFO. (b) It saves the qm mwc groupNum aggregated with qm mwc queueNum. 3. qm mwc waddrFIFO (a) This is a 16-deep 25-bits wide FIFO. (b) It stores the qm mwc waddr bits. mwc qm ready is asserted as an OR of the prog fulls of the above FIFOs. The prog full threshold is set at 10 for the above FIFOs. The mwc qm ready indicates the readiness of the MWC to accept more commands from the QM. To service the commands stored in the qm mwc cmdFIFO, there is an FSM designed called as cmdFIFO FSM. FSM to execute commands from qm mwc cmdFIFO The state diagram of this cmdFIFO FSM is shown in Figure 8.13. This FSM is mainly for reading out the data from PA and save it in another local FIFO called mwc ddrc FIFO. s0 1. The FSM issues a qm mwc cmdFIFO re when there are unserviced commands left in the FIFO indicated by qm mwc cmdFIFO empty. 2. The state then transits to s1. s1 1. If the drop bit is set in the qm mwc cmdFIFO rdata, the MWC issues a mwc pa drop to the PA. This drop is accompanied by a value on the mwc pa incrPtr which is the value by which the PA should advance its local mwc rdptr. 2. However, if the drop bit is not set in the qm mwc cmdFIFO rdata, then, the MWC issues a mwc pa getNewPkt output to the PA to read the next packet out of PA. The mwc pa incrPtr is driven with a value proportional to the packet length. This is used by the PA as the number of reads it should do onto the enq dpram in the PA. 3. Also, the packet length is written into the mwc ddrc FIFO in this state. The packet length is zero-extended to a 64-bit value because the mwc ddrc FIFO is a 1024-deep, 64-bit wide FIFO. The prog full threshold for this FIFO is set at 1020. Poovaiah M P and Sushanth Kini

92/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.4. Memory Write Controller

Figure 8.13: FSM to execute commands from qm mwc cmdFIFO

4. The state then transits to s2 s2 1. The mwc ddrc FIFO is written with the values incoming on the bus pa mwc rdata validated by pa mwc rdvalid. 2. The state transits to s0 when pa mwc pktDone is asserted by the PA. The pa mwc pktDone is asserted when the last 8 bytes of the packet is being sent out by the PA.

8.4.3

Interface to DDRC

The MWC has to send out the data from the mwc ddrc FIFO out to the DDRC so that the packets get written into the DDR2 memory. There is an FSM to take care of this task. The state diagram of the ddrc FSM is shown in Figure 8.14. s0 Poovaiah M P and Sushanth Kini

93/145

June 8, 2010

8.4. Memory Write Controller

Chapter 8. Micro-Architecture Details

Figure 8.14: FSM to read out the data and address to the DDRC 1. If mwc ddrc FIFO empty is low and ddrc mwc ready is high, then a read is issued to the mwc ddrc FIFO. ddrc mwc ready is a handshake signal issued by the DDRC to the MWC to indicate its readiness to accept commands from the MWC. 2. The state then transits to s1. s1 1. The mwc ddrc FIFO rdata bus in this state will have packet length of the packet to be sent out as the value on it. This is made use of by the MWC to initialize a counter, num ddrc FIFO pending re, proportional to the packet length. The value in this counter Poovaiah M P and Sushanth Kini

94/145

June 8, 2010

Chapter 8. Micro-Architecture Details

2.

3. 4. 5.

8.5. Memory Read Controller

indicates how many more number of read enables should be issued to the mwc ddrc FIFO to completely transfer this packet out. This counter is needed because the DDRC expects a 128-bit write data and in case the packet length is not a multiple of 128 bits, the zero-appending will have to be done by MWC. So, when the counter reaches zero, if only 64-bits of packet is left to be sent out, the FSM will prepend 64-bits of zeros and form a 128-bit data and send this out. Also, in this state another mwc ddrc FIFO re is issued to read out the next bytes of the packet. A read is also issued to the qm mwc waddrFIFO to read out the starting DDR2 address to which these packet bytes must be driven to. The state then transits to s2.

s2 1. Here, the MWC issues commands to the DDRC over the signals mwc ddrc we, mwc ddrc waddr and mwc ddrc wdata. 2. A read is issued to the mwc ddrc FIFO to read out the next bytes of the packet. 3. The state then transits to s3. s3 1. The FSM remains in this state and issues mwc ddrc FIFO re as long as the counter, num ddrc FIFO pending re, reaches zero, the inversion of which is the mwc ddrc FIFO readDone signal. 2. The commands to DDRC are continued to be given by incrementing the mwc ddrc waddr by 4 everytime and by forming 128-bit write data once in 2 successive clock cycles. 3. Once all the bytes are sent out, the state transits to s0. 4. Before transiting to s0, in the last clock cycle, as the MWC is sending out the last bytes of the packet and the last write address for this packet, the MWC asserts the output mwc ddrc record endAddr en along with the mwc ddrc record qNum. The bus mwc ddrc record qNum is driven with the value on the qm mwc qNumFIFO rdata bus for which a read would have been issued in the previous clock cycle in this state.

8.5

Memory Read Controller

Memory Read Controller (MRC) issues the read commands to the DDRC WRAPPER unit. MRC receives the frame length and the starting address where the frame is stored, from the QM. On receipt of the commands from QM, the MRC issues ’DDR2 addresses’ that has to be read out. This data is then sent out to the TXQ with proper control signals. The SHAPER and the QM is informed when a frame is completely read out from the DDR-RAM to the TXQ. The I/O Diagram of MRC is as shown in Figure 8.15.

8.5.1

Interface to QM

Signals from QM to MRC Poovaiah M P and Sushanth Kini

95/145

June 8, 2010

8.5. Memory Read Controller

Chapter 8. Micro-Architecture Details

Figure 8.15: I/O diagram of MRC 1. qm mrc dequeue : This signal is a valid signal for qm mrc queueNum, qm mrc pktLength and qm mrc deq raddr. When this signal goes high, then the data in the above mentioned ports are stored for processing. When this signal is low then the data in these buses are not valid. 2. qm mrc queueNum[4:0] : This bus contains the queue number of the packet that is requested to be dequeued. Whenever the data in this bus is valid, it is stored in the qm mrc queueNum FIFO. 3. qm mrc pktLength[10:0] : This bus contains the frame-length of the frame that is to be dequeued. Whenever the data in this bus is valid, it is stored in the qm mrc pktLengthFIFO. 4. qm mrc deq raddr [24:0] : This bus contains the address from where the frame can be dequeued. This is the starting address of the memory area where the frame is stored. Whenever the data in this bus is valid, it is stored in the qm mrc cmdFIFO. Signals from MRC to QM 1. mrc qm deq ack : This signal is generated when a frame is completely read out from the DDR2 module to the TXQueue. This signal is the valid signal for mrc qm queueNum and mrc qm pktLength. When this signal is low then the data in these buses are not valid. 2. mrc qm queueNum[4:0]: This bus contains the queue number of the frame that has be successfully dequeued. 3. mrc qm pktLength[10:0]: This bus contains the queue number of the frame that has be successfully dequeued. 4. mrc qm pause: When this signal goes high the QM suspends issuing commands to the MRC. This signal is used to prevent the QM from overflowing the MRCs FIFOs.

8.5.2

Interface to DDRC WRAPPER

Signals from DDRC WRAPPER to MRC 1. ddrc mrc ready: This signal goes high when the DDRC is ready to take commands from Poovaiah M P and Sushanth Kini

96/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.5. Memory Read Controller

MRC. Commands should not be issued when this signal is low. 2. ddrc mrc rdvalid : This signal is the valid for ddrc mrc rdata. MRC reads the ddrc mrc rdata bus only when this signal is high. 3. ddrc mrc rdata[63:0]: This bus carries the data that is sent from the DDRC WRAPPER. The data appears in the order of the address that was issued. The valid data from this bus is stored in the FIFO - mrc txq FIFO. Signals from MRC to DDRC WRAPPER 1. mrc ddrc re: This is the valid signal for mrc ddrc raddr. When this signal is high, the DDRC latches the address from mrc ddrc raddr and reads the data out. 2. mrc ddrc raddr [24:0]: This bus contains the address from which the data has to be dequeued. For every address that is issued, 16 bytes of data will be dequeued.

8.5.3

Interface to TXQueue

Signals from TXQ to MRC 1. txq mrc in ready: MRC proceeds to write to the TXQueue only when this signal is high. When this is low, no transactions are made by MRC with TXQ. Signals from MRC to TXQ 1. mrc txq wr in: This signal is the valid for mrc txq control in and mrc txq data. This signal issues 1 whenever the data on the above mentioned buses are to be sampled by TXQ. 2. mrc txq data[63:0]: This bus is used to send the frame data from MRC to TXQ. 3. mrc txq control in[7:0]: This bus is used to indicate the Start and end of a frame. The bus carries 0xFF when it is the first byte of the data and contains a unique number (needed to identify the byte boundary) when it is the end of the frame. It stays at 0x00 during the transmission of the other bytes.

8.5.4

Interface to SHAPER

Signals from QM to SHAPER 1. mrc shp rddone: This signal goes high when the frame is completely read out from the DDRC WRAPPER to the TXQ. This is the valid for the following signals - mrc shp queueNum and mrc shp pktLength. 2. mrc shp queueNum[4:0]: This bus carries the queue number of the last read out frame from MRC to SHAPER. The data in this bus is valid only when mrc shp rddone is high. 3. mrc shp pktLength[10:0]: This bus carries the frame length of the last read out frame from MRC to SHAPER. The data in this bus is valid only when mrc shp rddone is high. Poovaiah M P and Sushanth Kini

97/145

June 8, 2010

8.5. Memory Read Controller

8.5.5

Chapter 8. Micro-Architecture Details

Interface to SCHEDULER

1. mrc qm pause: This signal is asserted to inform the Scheduler to pause from making scheduling decisions. When this signal is high the scheduler is suspended.

8.5.6

Microarchitecture

The MRC basically has the following parts: 1. Address Generation Unit 2. TXQ interface Address Generation Unit The address generation unit reads the commands from the FIFOs and issues right number of mrc ddrc re to the DDRC. Since every frame contains the frame length at the head of the queue, this also should be accounted for. The FSM for the controller is shown in Figure 8.16. Description of FSM shown in Figure 8.16: S0 In this state, the controller waits until there is any command available for execution. Whenever the qm mrc cmdFIFO and the qm mrc pktLengthFIFO has some data, it moves to the next state. S1 In this state, the controller initializes the counters for frame length and the base address of the frame. The controller however waits in this state till the data from the FIFO is valid and then proceeds to the next state. S2 In this state, the address issued is incremented by 4 and the ddrc mrc pktLen counter is decremented by 1. When it reaches the last address to have read out for the packet, then the state goes back to s0.

TxQ Interface The TxQ interface has 2 functions: 1. Since for every assertion of mrc ddrc re, there are 2 assertions of ddrc mrc rdvalid, the last data asserted needs to be asserted in ddrc mrc rdata of some packets if the intended read is only odd number of ddrc mrc rdvalid. Eg: Say, frame length = 64 bytes , which implies that the it is stored in the DDR as 8+1(frame length) = 9 x 64 bytes. This can be read out by issuing 5 read commands. However, the 10th data has to be discarded. 2. When communicating with the TxQ, the start of frame and end of frame signals should be neatly issued through the mrc txq control in. Also the mrc txq wr in should be issued properly. With reference to the FSM in Figure 8.17 S0 Poovaiah M P and Sushanth Kini

98/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.5. Memory Read Controller

Figure 8.16: FSM for Address Generation in MRC

In this state the controller waits until there is any command that is unserviced. This is done by polling the qm mrc rdvalid pktLengthFIFO empty. This will tell us if there is a frame that needs our attention. S1 In this state the controller issues the command to read the mrc txq FIFO. The controller then moves to the next state. S2 The first data of any frame is its length. Hence the controller loads its counter with the output of the mrc txq FIFO. This counter keeps decrementing every time there is a read command to the FIFO so that the controller knows when the frame has been read out completely. S3 In this state the controller stays only during the start of the frame. It issues the start of frame signal in the mrc txq control in and transits to the next state. S4 The controller stays in this state until the end of transmission. After the last 64 bits of the frame is transmitted, the controller moves to the next state. The controller also keeps the mrx txq control =0x00 during transmission and also drives the end-of-frame value into the bus. Poovaiah M P and Sushanth Kini

99/145

June 8, 2010

8.5. Memory Read Controller

Chapter 8. Micro-Architecture Details

Figure 8.17: FSM for controlling TxQ Interface in MRC

S5 This state is necessary if there is any extra 64 bits remaining in the FIFO due to the section TxQ interface bullet 1 which will have to be flushed out of the FIFO.

Poovaiah M P and Sushanth Kini

100/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.6

8.6. Scheduler

Scheduler

The scheduler maintains the status of each queue and issues dequeue requests to the QM. The scheduling action is a function of the responses from QM and Shaper.

8.6.1

Interface to Shaper

The I/O Diagram of Scheduler is as shown in Figure 8.18.

Figure 8.18: I/O Diagram of SCH Signals from Shaper to Scheduler 1. shp sch queueOutOfTokens set : This signal is a valid for shp sch queueOutOfTokens set qNum. This signal is polled every clock cycle and when High, the bus shp sch queueOutOfTokens set qNum is read in to be processed. The signal is used by the scheduler to know which queue has exhausted all its tokens. 2. shp sch queueOutOfTokens reset : This signal is a valid for shp sch queueOutOfTokens reset qNum. This signal is polled every clock cycle and when High the bus shp sch queueOutOfTokens reset qNum is read in to be processed. The signal is used by the scheduler to know which queue has gained tokens. 3. shp sch queueOutOfTokens set qNum[4:0] : This port is used to inform the scheduler about the queue that exhausted all its tokens. The data in this port is sampled only when shp sch queueOutOfTokens set is high. 4. shp sch queueOutOfTokens reset qNum[4:0] : This port is used to inform the scheduler about the queue that has gained tokens recently. The data in this port is sampled only when shp sch queueOutOfTokens reset is high.

8.6.2

Interface to QM

Signals from the QM to the Scheduler 1. qm sch ready: When this signal is high, the scheduler knows that the QM is ready to take commands from the scheduler. When this signal is low, there wont be any requests made to the QM. Poovaiah M P and Sushanth Kini

101/145

June 8, 2010

8.6. Scheduler

Chapter 8. Micro-Architecture Details

2. qm sch queueNum[4:0]: This port is used by the QM to inform the scheduler about the queue number of the frame that got queued in the Memory. This signal is sampled only when qm sch update is high. 3. qm sch update: This signal is the valid signal for the qm sch queueNum. Signals from the Scheduler to the QM 1. sch qm deq req: This signal goes high whenever the scheduler has to inform the QM about the next queue to be dequeued. 2. sch qm queueNum[4:0]: This port is used by the scheduler to inform the QM the queue number of the queue that is to be dequeued. The data in this bus is valid only when sch qm deq req is high.

8.6.3

Interface to PCI

Signals from the PCI to the Scheduler 1. update weight: This is a valid signal for the data coming from the PCI to the scheduler. The scheduler keeps polling for the signal to go high. When this signal goes high then the weight of that particular queue is modified. 2. update weight qNum[4:0]: This is the port used by the PCI to inform scheduler, the queue number whose weight it shall modify. The data in this bus is valid only when update weight is high. 3. update weight weight[15:0]: This is the port used by the PCI to inform scheduler, weight of the queue number that it shall modify. The data in this bus is valid only when update weight is high.

8.6.4

Microarchitecture

The scheduler has the following important parts in it. 1. Weights and tokens table 2. The scheduler Engine 3. Scheduler FIFO Weights and tokens table The scheduler stores the weight of a queue and the token status of all the queues in the arrays weight and shp sch queueHasTokens respectively. The weight gets initialized to a default value on reset and can be modified by the PCI interface through software. The value in the weight for a queue is the maximum number of frames that can be scheduled in one scheduling decision. The sch queueHasTokens gets modified by the commands issued by the shaper. A queues gets scheduled only if the bit corresponding to that queue is set in the array shp sch queueHasTokens. The scheduler Engine The scheduler engine does Poovaiah M P and Sushanth Kini

102/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.7. Shaper

1. Group scheduling The group scheduling scheme is Round Robin. There are 3 groups. So every scheduling decision picks the next non-empty group. After a group has been picked up, the engine proceeds for the queue scheduling of that group. 2. Queue scheduling Queue scheduling that is implemented is weighted round robin. i.e., when a non-empty queue in a group is selected by round robin, the number of frames that are dequeued from it is equal to the value in the array weight. The scheduler engine proceeds to schedule a queue only when the prog full of the Scheduler FIFO = 0 and mrc sch pause = 0. The scheduling decision for a queue is made in a single clock cycle. Scheduler FIFO The scheduler FIFO is the FIFO that stores the schedule decisions of the scheduler engine and the accompanying logic decodes the FIFO data and issues the command on the interface to the QM. The use of a FIFO makes it possible for the scheduler engine to work without having to bother about issuing commands to the QM interface. This makes the scheduler engine to operate in a single clock cycle

8.7

Shaper

The I/O diagram of Shaper is as shown in Figure 8.19. The Shaper is responsible for shaping the traffic on a per-queue basis by employing the Token Bucket mechanism. The functionality of Shaper can be summarized as follows: 1. The Shaper maintains Token Buckets for each of the 32 queues into which the tokens are added based on the token bucket rate and maximum burst threshold set for that queue. 2. A token update cycle is initiated every X clocks where X is a user configurable value. 3. During an update cycle, if the tokens in a bucket go from negative to positive, the Shaper informs about this to the Scheduler. 4. The shaper decrements the tokens from a bucket in response to the commands coming in from MRC which indicates the packet length of the packet that was dequeued from some particular queue. 5. Whenever the tokens in a bucket go negative or zero, the shaper informs the Scheduler about this by asserting its outputs.

Figure 8.19: I/O Diagram of SHP

Poovaiah M P and Sushanth Kini

103/145

June 8, 2010

8.7. Shaper

8.7.1

Chapter 8. Micro-Architecture Details

Brief Description of I/O Ports

1. shp sch queueOutOfTokens set: This signal is asserted high for one clock cycle when the Shaper wants to set the QueueOutOfTokens flag in SCH for a particular queue number. 2. shp sch queueOutOfTokens reset: This signal is asserted high for one clock cycle when the Shaper wants to unset the QueueOutOfTokens flag in SCH for a particular queue number. 3. shp sch queueOutOfTokens set qNum[4:0]: The value on this bus indicates the queue number for which the Shaper wants to set the QueueOutOfTokens flag in SCH. 4. shp sch QueueOutOfTokens reset qNum[4:0]: The value on this bus indicates the queue number for which the Shaper wants to unset the QueueOutOfTokens flag in SCH. 5. pci write update timer reg: This signal is driven high if the user wants to update the update timer reg by means of PCI. 6. pci update timer reg value[15:0]: The value with which the update timer reg in SHP should be updated is put on this bus. 7. pci write qtokens table: If user wishes to update the Queue Tokens table by means of PCI, then this signal is driven high. 8. pci write qtokens table index [4:0]: The index into the Queue Tokens table at which the updation of the tokens should be done is driven on this bus. 9. pci write qtokens table value[15:0]: The value with which the Queue Tokens table should be updated at a particular index is driven on this bus. 10. mrc shp pktLength[10:0]: The MRC informs the SHP about successful read out of a packet from the DDR2 memory by driving this signal with the packet length of the packet just read out. 11. mrc shp queueNum[4:0]: The MRC also informs the SHP about the queue number to which the recently read out packet belongs by driving it over this bus. 12. mrc shp rd done: A one-clock cycle high assertion on this signal validates the buses mrc shp pktLength and mrc shp queueNum.

8.7.2

DPRAMs used in the design

The Queue Tokens table is maintained in a DPRAM in the Shaper. This DPRAM is a Xilinx Coregen core. The DPRAM has 32 entries, one per queue, and each entry is 64-bits wide. Table 8.6: Queue Tokens Table in Shaper 63:48 Last Update Time

47:44 Maximum Burst Threshold

43:12 Queue Tokens

11:0 Bucket Rate

• The entry ”Last Updated Time” stores the value of the free running timer when the tokens for this particular entry were updated. free running timer is a timer which is running at the core clock frequency of 125 MHz. • ”Maximum Burst Threshold” is a 4-bit value which decodes to the maximum possible number of tokens that can accumulate in this bucket. Poovaiah M P and Sushanth Kini

104/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.7. Shaper

• ”Queue Tokens” is a 32-bit value which indicates the number of tokens for this queue. The MSB of this field indicates if the value present here is negative or positive depending on whether it is set or reset respectively. • ”Bucket Rate” is a 4-bit value which indicates the rate at which tokens can accumulate into this token bucket. The value stored here may be viewed as tokens per unit time. The queue tokens table will be initialized to some default values on reset. Only the Maximum Burst Threshold and Bucket Rate would be assigned some non-zero values while the Last Update Time and Queue Tokens would be initialized to zero on reset.

8.7.3

Interface to Scheduler

The interface of Shaper towards Scheduler has ports over which the shaper informs the presence or absence of tokens in a bucket for some queue.

8.7.4

Token Update Cycle

A token update cycle is initiated periodically in the Shaper. The update cycle is triggered when an update timer which keeps running, equals the value in the update timer reg. The update timer reg would have been initialized to some default value on reset. The flag called begin update cycle is set to trigger the update FSM. FSM to do the token bucket updates This FSM called update FSM is triggered when the begin update cycle flag is set. This FSM performs a token update for all the queues in a single update cycle. The state diagram of this FSM is as shown in Figure 8.20. s0, 1. If begin update cycle is seen to be high, the state transits to s1 after issuing a read to the queue tokens update for address 0x0. s1, 1. When the read on the queue tokens table is successful which is indicated by the qtokens table rdvalid signal, the read data is latched and the state transits to s2. s2, 1. In the first clock cycle in this state, the time difference between now and last update-time in terms of clock cycles is obtained. This is done by subtracting the ”Last Update Time” from the current free running timer. The ”Last Update Time” would be obtained as the upper 16 bits of the data read out from the Queue Tokens table for the current queue being updated. The FSM remains in this state after this time difference calculation. The following pseudo-code illustrates the activity here: time diff = free running timer - last updated time;

Poovaiah M P and Sushanth Kini

105/145

June 8, 2010

8.7. Shaper

Chapter 8. Micro-Architecture Details

Figure 8.20: Token Bucket Update FSM in Shaper

2. In the next clock cycle, the time difference obtained in the previous clock is multiplied with the ”Bucket Rate” obtained from the Queue Tokens Table read data. This multiplication will yield the total number of tokens to be added to the token bucket. The FSM continues to stay in this state after this multiplication. tokensToadd = Bucket Rate * time diff; 3. In the next clock cycle, a comparison is done between the sum of the existing tokens in the bucket and the tokens to be added with the maximum permissible tokens in the bucket which is proportional to the ”Maximum Burst Threshold” for this queue token bucket. The following pseudo-code explains the activity in this cycle. if ((current qTokens + tokensToadd) > max possible tokens) { qTokens = max possible tokens; } else { qTokens = current qTokens + tokensToadd; } Poovaiah M P and Sushanth Kini

106/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.7. Shaper

The FSM continues to remain in s2 even after this clock cycle. 4. In the subsequent clock cycle, a write is issued to the Queue Tokens Table at the current queue as the write address to update the entry with the new Queue Tokens and with the new ”Last Updated Time”. The FSM continues to stay in s2 after this clock cycle. 5. In this clock cycle, if a write acknowledge is seen for the write issued to the Queue Tokens Table in the previous clock, then, the Queue Tokens Table read address is incremented by 1 which is done to update the next queue in succession. 6. The FSM then continues to stay in s2 until all the 32 queues get updated in this manner after which the state transits to s0. In the queue token updation step in state s2 above, if the tokens for any queue changed from negative count to positive or from zero to positive, then the outputs to Scheduler are asserted over the signals shp sch queueOutOfTokens reset and shp sch queueOutOfTokens reset qNum. If however, the status of tokens was positive even before updating, then these outputs to Scheduler are not asserted.

8.7.5

Interface to MRC

The shaper receives commands from MRC which tell about the successful read out of a packet in a queue from the DDR2 memory. An FSM called deqFSM is triggered when the MRC commands come in. MRC commands are validated by mrc shp rd done and the accompanying information of queue number is sent over mrc shp queueNum and packet length is transmitted over the bus mrc shp pktLength. Dequeue FSM This FSM is used to alter the tokens in the bucket for a particular queue from which the last packet was read out as informed by the MRC to Shaper. The state diagram of this FSM is as shown in Figure 8.21. s0 1. When mrc shp rd done is seen, the state transits from s0 to s1, in the process issuing a read to the Queue Tokens Table at an address given by mrc shp queueNum. s1 1. When the Queue Tokens Table has been read out successfully which is indicated by qtokens table rdvalid signal, the state transits to s2 having latched this data. s2 1. The packet length of the packet that was last dequeued from this queue is now subtracted from the data read out of the Queue Tokens Table. 2. The state then transits to s3. Poovaiah M P and Sushanth Kini

107/145

June 8, 2010

8.7. Shaper

Chapter 8. Micro-Architecture Details

Figure 8.21: Dequeue FSM in Shaper

s3 1. To write the resultant tokens into the table, a write is issued to the Queue Tokens Table. 2. In the next clock cycle, when the write acknowledge signal is seen for this table, the state transits to s0. During the process of subtracting tokens from the queue token bucket above, if the token count changes from positive to negative or from positive to zero, the outputs to Scheduler are asserted. However, if the token count was negative even before the subtraction, then outputs to Scheduler are not asserted. The outputs to Scheduler are driven over the signals shp sch queueOutOfTokens set and shp sch queueOutOfTokens set qNum. Poovaiah M P and Sushanth Kini

108/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.7.6

8.8. DDRC Wrapper

Interface from PCI

User configurability has been provided through PCI if the user wishes to edit any entry in the Queue Tokens Table or if he wishes to update the update timer reg. If the user wishes to change an entry in the Queue Tokens Table for some queue, then the signals pci write qtokens table, pci write qtokens table index and pci write qtokens table value must be appropriately asserted. The bit-width of pci write qtokens table value is only 16 bits because the user can only update the fields ”Maximum Burst Threshold” and ”Bucket Rate” for any entry. When an entry is updated by PCI, the other fields in that entry i.e. ”Last Updated Time” and ”Queue Tokens” are set to 0x0. If the user wishes to change the value in the update timer reg, then the signals pci update timer reg must be set along with the value to be updated on the bus pci update timer reg value. Since there is a good chance of Queue Tokens Table being accessed simultaneously by the PCI and the regular update FSMs, to handle this case and prevent a clash, a clash-handling arbiter has been incorporated into the design.

8.8

DDRC Wrapper

The I/O diagram of the DDRC wrapper and the DDR2 Controller is shown in Figure 8.22. The DDRC wrapper interfaces to the DDR2 Controller which is a Xilinx Coregen core. The functionality of the DDRC wrapper can be summarized as follows:

Figure 8.22: I/O Diagram of DDRC Wrapper

1. It accepts both write and read commands from MWC and MRC respectively. 2. The arbitrator inside the DDRC wrapper arbitrates among these commands and services them. 3. The wrapper issues commands and data to the actual DDR2 controller in the format that it desires. 4. For every successful write of a packet into the DDR2 memory, the wrapper acknowledges to the QM. 5. The wrapper transfers the data read out from the memory to the MRC over the output signals. Poovaiah M P and Sushanth Kini

109/145

June 8, 2010

8.8. DDRC Wrapper

8.8.1

Chapter 8. Micro-Architecture Details

Brief Description of I/O Ports

1. mwc ddrc waddr [24:0]: The DDR2 memory address to which the 128-bits on the bus mwc ddrc wdata[127:0] should be written is driven over this bus. 2. mwc dddrc wdata[127:0]: The 128-bit write data value for the 4 DDR2 memory addresses starting from the current value on the bus mwc ddrc waddr is driven on this bus. This bus is 128-bits wide because the DDR2 controller operates at a burst length of 4 and needs 32 bits per DDR2 memory address. 3. mwc ddrc we: This signal is driven high to validate the values over the bus mwc ddrc waddr and mwc ddrc wdata. 4. mwc ddrc record qNum[6:0]: This signal is an aggregation of the 5-bit queue number and 2-bit group number to which the current packet being written into the DDR2 belongs. 5. mwc ddrc record endAddr en: A high driven on this signal validates the value on the bus mwc ddrc record qNum. 6. ddrc mwc ready: This signal is asserted high by the DDRC wrapper when it wishes to express its readiness to accept more write commands from the MWC. 7. mrc ddrc re: The MRC asserts this signal high to issue read commands to the DDRC wrapper. 8. mrc ddrc raddr [24:0]: The 25-bit DDR2 address from which the next 16-bytes should be read out is informed by the MRC by driving a value on this bus. The DDR2 will be read at 4 successive locations starting from the value on this bus. 9. ddrc mrc rdata[63:0]: The 8-byte wide bus which carries the data read out of the DDR2 memory for which the address was issued previously. 10. ddrc mrc rdvalid : A high on this signal validates the value on the bus ddrc mrc rdata. 11. ddrc mrc ready: The DDRC wrapper asserts this signal high to indicate that it is ready to accept commands from the MRC. 12. ddrc qm enq ack qNum[4:0]: The DDRC wrapper acknowledges the successful write of a packet into the DDR2 memory by driving this signal with the queue number to which the packet belonged. 13. ddrc qm enq ack : A one-clock cycle high assertion on this signal validates the value on the bus ddrc qm enq ack qNum. 14. dip1 : The active-low clock enable signal for the DDR2 controller which is driven by the DDRC wrapper. 15. dip3 : This is an active-high DDR2 chip enable signal which should be driven by the DDRC wrapper. 16. burst done: A signal which should be driven high continuously for two DDR2 clock cycles to indicate the end of the current read or write burst. 17. input addr [21:0]: The aggregation of the Row and Columns of a DDR2 memory address is put on this bus by the DDRC wrapper. 18. bank addr [1:0]: The bank bits of a DDR2 memory address is driven on these two bits. 19. cmd reg[3:0]: The 4-bits wide command bus over which the DDRC wrapper drives the read, write or Init commands. 20. input data[63:0]: The data to be written into DDR2 memory is to be issued in units of 8-bytes which is done over this bus by the DDRC wrapper. 21. data mask [7:0]: It is an encoded bus with each bit corresponding to one byte on the input data bus. If any bit is set, the corresponding byte in input data is masked out. The DDRC wrapper always drives 0x00 on this bus. Poovaiah M P and Sushanth Kini

110/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.8. DDRC Wrapper

22. cmd ack : The DDR2 controller acknowledges receipt of any command driven over cmd reg by asserting this signal high. 23. wait 200us: This signal is asserted high by the DDR2 controller on reset to allow for initial 200s for the DDR2 memory to initialize and come to a stable state. After the initial 200s, this signal is stuck to low. 24. output data[63:0]: The 8-bytes wide bus over which the DDR2 controller sends the output read data from the DDR2 to the DDRC wrapper. 25. data valid : A signal that is driven high to validate the bus output data. 26. init val : This signal is driven high once the DDR2 memory has been initialized following the user Init Command.

8.8.2

Interface to MWC

1. The commands coming in from MWC are put into two FIFOs called wr cmdFIFO and wr cmdFIFO clk90. Each of this FIFO is 128-deep with each entry being 153 bits wide. 2. The write data into this FIFO is an aggregation of 25-bit mwc ddrc waddr and 128-bit mwc ddrc wdata. 3. The queue number that is asserted over the input bus mwc ddrc record qNum validated by mwc ddrc record endAddr en is put into another FIFO called mwc ddrc qNumFIFO. This FIFO is 32-deep with each entry being 33-bits wide. 4. Each entry in this FIFO holds the aggregation of mwc ddrc record qNum along with the value on the mwc ddrc waddr that was present when mwc ddrc record endAddr en was asserted. This is needed because, later, this is made use of to generate the acknowledgements to the MWC. 5. All the three FIFOs mentioned above are dual-clock FIFOs. The write clock for these FIFOs is the 125 MHz core-clock while the read clock varies. The wr cmdFIFO and mwc ddrc qNumFIFO are read out with a faster clock of 200 MHz while the wr cmdFIFO clk90 is read out using a 200 MHz clock but having a 900 phase shift with respect to the 200 MHz clock. 6. These two clocks are outputs of the DDR2 controller and are made use of inside the DDRC wrapper. 7. A handshake signal called ddrc mwc ready is generated as an output by inverting the wr cmdFIFO prog full signal. The programmable full threshold for the FIFOs, wr cmdFIFO and wr cmdFIFO clk90, are set at 120 for a total depth of 128. This signal indicates the readiness of the DDRC wrapper to accept commands from the MWC.

8.8.3

Interface to MRC

The commands coming in from MRC are put into a FIFO called rd cmdFIFO. This FIFO is 256 deep and is 25 bits wide to hold the 25-bit mrc ddrc raddr coming in. The prog full threshold for this FIFO is set at 120. A handshake signal called ddrc mrc ready is set using the inversion of rd cmdFIFO prog full to indicate whether the DDRC wrapper is in a position to accept more commands from MRC or not. Poovaiah M P and Sushanth Kini

111/145

June 8, 2010

8.8. DDRC Wrapper

Chapter 8. Micro-Architecture Details

Memory Initialization FSM Before the read and write commands are issued to the DDRC, the DDR2 memory must be initialized using the memory initialization command. Hence, an FSM is designed to take care of this at the very beginning. After power on reset, the DDR2 memory needs about 200 µs to come to a stable state. No commands must be issued to the DDR2 until then. Hence, the DDRC maintains a counter which counts the initial 200 µs. The completion of the wait time is indicated to the DDRC wrapper using the signal mem interface top wait 200s. Once this signal is deasserted, the memory initialization FSM starts operating. The state diagram of this FSM is as shown in Figure 8.23.

Figure 8.23: Memory Init FSM in DDRC Wrapper s0 1. The FSM continues to be in this state until the mem interface top wait 200s is deasserted by the DDRC. Once this happens, the config register1 is driven with 0x1032 while config register2 is driven with 0x0. 2. config register1 holds the configuration data for DDR2 memory initialization. The contents of this register are loaded into the Load Mode register during this initialization process. The format for config register1 and the bitwise description is described in the Xilinx Application Document, XAPP549 [33]. The config register1 is loaded so that the memory initializes for a Burst length of 4, burst type of Sequential only and for a CAS Poovaiah M P and Sushanth Kini

112/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.8. DDRC Wrapper

latency of 3. 3. The state then transits to s1. s1 1. The command register is driven with 0x2 which is the value corresponding to the memoryinit command. 2. The state then transits to s2. s2 1. The FSM stays in this state forever from now until a system reset. Once the init val output from the DDRC is seen, the signal called memInitFSM done is set high to indicate that the DDR2 SDRAM memory initialization is now complete. Figure 8.24 shows the behaviour of the DDRC when the memory initialization FSM is in operation.

Figure 8.24: Timing Diagram of DDR2 Memory Initialization Command

FSM to issue write commands to DDRC This FSM has been designed in accordance with the requirements of the DDRC as specified in the XAPP549 document. The FSM conforms to this datasheet and issues commands as per the instructions in it. The state diagram of this FSM is as shown in Figures 8.25 and 8.26. This FSM runs with the 200 MHz clock generated as an output by the DDRC. s0 1. If the write FSM has gained the grant from the read-write arbiter which is indicated by the assertion of the signal green signal to writeFSM, the state transits to s1 by issuing a read to the wr cmdFIFO. Poovaiah M P and Sushanth Kini

113/145

June 8, 2010

8.8. DDRC Wrapper

Chapter 8. Micro-Architecture Details

Figure 8.25: Write FSM in DDRC Wrapper - Part 1

s1 1. The FSM just saves the wr cmdFIFO rdata into suitable registers like addr to write. Poovaiah M P and Sushanth Kini

114/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.8. DDRC Wrapper

Figure 8.26: Write FSM in DDRC Wrapper - Part 2 2. The state then transits to s2. s2 1. The FSM issues the write command onto the command register. Poovaiah M P and Sushanth Kini

115/145

June 8, 2010

8.8. DDRC Wrapper

Chapter 8. Micro-Architecture Details

2. The first address to be written to is also asserted on the bus user input address[24:0]. 3. The state then transits to s3. s3 1. The FSM stays in this state until the cmd ack is asserted by the DDRC in response to the write command in the previous state. 2. The state then transits to s4. 3. Before transiting, a read is issued to the wr cmdFIFO to read the next address to write to. s4 1. This state is just to spend two clock cycles so that the write address remains asserted on user input address for these two clock cycles which is as per the requirement of the DDR2 controller. 2. The state then transits to s5. 3. Before transition, an address comparison is done here. The 25-bit address coming in from the MWC or the MRC is in the form of {1’bX,2’bBANK,13’bROW,9’bCOL}. There are 4 banks in the DDR2 memory with each bank having 8K rows and each row having 512 columns. Each {ROW,COL} combination points to one location which holds 4 bytes of data. 4. The DDRC is best utilized when it does reads and writes in Bursts. The burst writes and reads can be done only onto the entries for the same ROW and BANK. Thus maximum burst length can be 512. 5. So, an address comparison in this state between the {BANK,ROW} bits of the bus wr cmdFIFO rdata and the current write address. If the next address to write to belongs to the same ROW and BANK, then a flag called wr addr comparison passed is set, so that this new address can be issued as a command immediately hereafter. If however, the address comparison fails, the address cannot be driven now and the FSM should start from state s2 all over again. s5 1. If the wr addr comparison passed flag is set, the user input address is driven with the next address to write else the state transits to s7. 2. A read is issued to the wr cmdFIFO to read the next address to write to. 3. The state then transits to s6. s6 1. Here, another address comparison is done between the current write address and the next address to write to which was just read out from the FIFO. The result of the address comparison is recorded in the flag wr addr comparison paased. 2. The state transits to s5 irrespective of the result of the address comparison. s7 Poovaiah M P and Sushanth Kini

116/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.8. DDRC Wrapper

1. This state is needed to spend two clock cycles so that the last address in the previous burst stays on the user input address bus for atleast two clock cycles. 2. The state then transits to s8. s8 1. In this state, the burst done signal is asserted. This is to be done once the burst of commands has been given completely. 2. The burst done should be asserted for two clock cycles after which the state transits to s9. s9 1. The command register is now de-asserted of the write command. 2. The state now transits to s10 s10 1. The DDRC will acknowledge the deassertion of the write command by deasserting user cmd ack signal which was high all this while. 2. The state then transits to s11. s11 1. In this state, a decision is done whether to transit to state s0 or s2 depending on the status of the signal wr cmd read out from FIFO. This signal is asserted throughout the state machine at all points when a command is read out from the FIFO and is deaasserted at all those instances when the data read out from FIFO was made to drive the input address bus. If in this state, wr cmd read out from FIFO is set, which happens when the next address read out fails address comparison in s6, then the state transits to s2 else the state transits to s0 to start afresh by reading out the next command from the wr cmdFIFO. FSM to issue write data to DDRC This FSM has been designed in accordance with the requirements of the DDRC as specified in the XAPP549 document. The FSM conforms to this datasheet and issues write data as per the instructions in it. This FSM runs with the 900 phase shifted 200 MHz clock generated as an output by the DDRC. This FSM called as state wrFSM clk90 is exactly similar to the state wrFSM described in the previous section. The only difference being that this FSM runs with the 900 phase shifted clock. This is because the write input data to the DDRC must be asserted with the rising edge of the clk90 and not with clk0. The timing diagram when both these FSMs are in operation is as shown in Figure 8.27. Issuing of enqueue acknowledgements to QM The DDRC wrapper acknowledges the successful write of a packet into the DDR2 memory to the QM. This is done by an address comparator which is active every clock. This comparator Poovaiah M P and Sushanth Kini

117/145

June 8, 2010

8.8. DDRC Wrapper

Chapter 8. Micro-Architecture Details

Figure 8.27: Timing Diagram of Write command in DDRC Wrapper

compares the user input address being driven by the write FSM with the address in the data that was just read out from the mwc ddrc qNumFIFO. When the address comparison passes, the acknowledgement is sent over the signals ddrc qm enq ack and ddrc qm enq ack qNum[6:0]. After every comparison, the next entry from the mwc ddrc qNumFIFO is read out and the comparator starts all over again. However, since the comparator is operating at 200 MHz frequency, while the QM needs acknowledgements asserted with 125 MHz core clock, a dual clock FIFO is made use of. This FIFO called ddrc qm enq ack FIFO is a dual-clock, 32-deep, 7-bits wide FIFO. This FIFO is written when the outputs of the address comparator goes high. The corresponding queue number which would have been read out of the mwc ddrc qNumFIFO is written into this FIFO. Another block operating at rising edge of slower 125 MHz core clock, keeps looking if this FIFO has data or not. Whenever it does, a FIFO read is issued and the read data is sent out as acknowledgement to the QM at the desired clock frequency of 125 MHz.

Arbiter This round-robin arbiter arbitrates between granting access to the write FSM and the read FSM to access the address and data bus to DDRC. The arbiter remembers the last serviced FSM and gives higher priority to the other FSM in the next round if it requests for access. Both read and write FSMs release control of the address and data bus only when they are done with their current set of burst commands. At this point, the arbiter comes into picture and plays its part to grant the go-ahead to the other FSM or continue giving the go-ahead to the currently operating FSM. The pseudo-code of the arbiter is furnished here for a more clearer understanding: if (last serviced FSM == WRITE && writeFSM current done == 1 && Poovaiah M P and Sushanth Kini

118/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.8. DDRC Wrapper

green signal to writeFSM == 0) if (rd cmdFIFO empty == 0 || rd cmd read out from FIFO == 1) { green signal to readFSM = 1; last serviced FSM = READ; } else if (wr cmdFIFO empty == 0 || wr cmd read out from FIFO == 1) { green signal to writeFSM = 1; last serviced cmd = WRITE; } else if (last serviced FSM == READ && readFSM current done == 1 && green signal to readFSM == 0) { if (wr cmdFIFO empty == 0 || wr cmd read out from FIFO == 1) { green signal to writeFSM = 1; last serviced FSM = WRITE; } else if (rd cmdFIFO empty == 0 || rd cmd read out from FIFO == 1) { green signal to readFSM = 1; last serviced FSM = READ; } } if (state wrFSM == s2 && green signal to writeFSM == 1) green signal to writeFSM = 0; if (state rdFSM == s2 && green signal to readFSM == 1) green signal to readFSM = 0;

FSM to issue read commands to DDRC This FSM issues read commands (i.e. addresses) to the DDRC and operates using the 200 MHz clock with no phase shift. This FSM issues outputs as per the requirements mentioned in the Xilinx XAPP549 datasheet. The state diagram of this state machine is as shown in Figures 8.28 and 8.29. s0 1. If there are commands present in the rd cmdFIFO and if the arbiter has granted access rights to this FSM by asserting green signal to readFSM signal, then the state transists to s1 by issuing a read to the rd cmdFIFO. s1 1. The rd cmdFIFO rdata is latched in this state after which the state transits to s2. Poovaiah M P and Sushanth Kini

119/145

June 8, 2010

8.8. DDRC Wrapper

Chapter 8. Micro-Architecture Details

Figure 8.28: Read FSM in DDRC Wrapper - Part 1 s2 1. The cmd register is driven with 0x6 (value for READ command). 2. The input address is driven with the address to read from. Poovaiah M P and Sushanth Kini

120/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.8. DDRC Wrapper

Figure 8.29: Read FSM in DDRC Wrapper - Part 2

3. The bank address is driven with BANK bits from the read address. 4. The state then transits to s3. Poovaiah M P and Sushanth Kini

121/145

June 8, 2010

8.8. DDRC Wrapper

Chapter 8. Micro-Architecture Details

s3 1. When cmd ack is seen, the state transits to s4. The cmd ack is asserted by DDRC in response to the assertion of the READ command in the previous state. 2. Before transition, the next read is issued to the rd cmdFIFO if that FIFO is not empty. s4 1. This state is used to spend 4 clock cycles so that the read address stays asserted on the input address bus for these many clock cycles which is a requirement of the DDRC inputs. 2. During the last of the 4 clock cycles, an address comparison is done between the {BANK,ROW} bits of the data read out from FIFO and that of the current read address. The result of this comparison is recorded in the flag rd addr comparison passed. s5 1. If rd addr comparison passed flag has been set, then the input address is driven with the next address to read from. 2. If rd cmdFIFO empty is low, a read is issued to the FIFO to read the next address to read from. 3. If ˜rd addr comparison passed flag has been set, the state transits to s6 else it transits to s7. s6 1. The address comparison is done here again and consequently the rd addr comparison passed flag is properly updated. 2. The state then transits back to s5. s7 1. Two clock cycles are spent here so that the last address of the current burst stays asserted on the input address bus for two clock cycles. 2. After this, the state transits to s8. s8 1. burst done is asserted for two clock cycles here after which the state transits to s9. s9 1. This state is used to deassert the burst done which was asserted in the previous state. 2. The state then transits to s10. s10 1. The READ command on the cmd register is de-asserted in this state. Poovaiah M P and Sushanth Kini

122/145

June 8, 2010

Chapter 8. Micro-Architecture Details

8.8. DDRC Wrapper

2. The state transits to s11. s11 1. The state transits to s12 when cmd ack is de-asserted by the DDRC in response to the deassertion of the READ command in the previous state. s12 1. If a command read out from the rd cmdFIFO is unserviced at this point, then the state transits to s2 else the state transits to s0 to start over by reading the FIFO. When the read FSM is in operation, the timing diagram is as shown in Figure 8.30.

Figure 8.30: Timing Diagram of Read Command in DDRC Wrapper

8.8.4

Interface to MRC

In response to the read commands issued by read FSM, the DDRC reads the DDR2 memory and sends the read data over the signals user output data validated by user data valid. This happens at the 200 MHz clock frequency. However, the MRC operating at core-clock frequency of 125 MHz expects the data at that frequency. Hence, the data coming out of the DDRC is written into a dual clock FIFO called ddr2 ddrc FIFO. This FIFO is read out at the slower core clock and is 1024-deep with each entry being 64 bits wide. The outputs of the FIFO is used to drive the signals ddrc mrc rdvalid and ddrc mrc rdata. The inversion of the ddr2 ddrc prog full generates the signal ddrc mrc ready. The prog full threshold is set at 512 for this FIFO. The assertion of ddrc mrc ready indicates the readiness of the DDRC wrapper to accept more read commands from the MRC. Poovaiah M P and Sushanth Kini

123/145

June 8, 2010

Chapter 9

Host Software 9.1

Introduction

The Network Access Traffic Manager (NATM) core inside the NetFPGA board can be controlled using the Host Interface. The Software takes care of sending the commands to the device for configuring its parameters. The configurable parameters are: 1. The list of privileged-IP users 2. The weights of different protocols in the PC 3. The weights of different queues in the scheduler 4. The Tokens’ table in the Shaper 5. The Timer register of the Shaper

9.2

Software-Hardware Interface

All the user configurations are communicated to the core using the CPCI interface implemented in the Spartan FPGA on the NetFPGA board. The commands from the CPCI bus are read by the reg top unit and then decoded. The reg top unit then asserts the correct signals and configures the registers of each module. Every command is mapped to a PCI register and hence decoding of the commands would be by parsing the PCI address for the module. The software abstracts all these details from the user and provides an interface for the user to configure these values. The PCI address for various registers are tabulated in 9.1. Table 9.1: Configurable registers of NATM and their PCI address Register name PCI Address SCHEDULER UPDATE 0x2000118 PRIV IP UPDATE 0x200011C Continued on next page

124

Chapter 9. Host Software

9.3. Software features

Table 9.1 – continued from previous page Register name PCI Address PC WEIGHT UPDATE 0x2000120 SHAPER TABLE UPDATE 0x2000124 SHAPER TIMER UPDATE 0x2000128

9.3

Software features

The User Interface is a bash shell script with a ’C’ code for writing into PCI registers. It has the following features incorporated in it: 1. Help: by using the help feature(-h) the user can understand the usage of the software and also the serial numbers of various protocols. 2. Data edit: The software provides various editing options to the user. After selecting the proper table to be edited, the software invokes vi editor for editing the file. The user should then save the file and the software makes these corresponding changes in the harware also. 3. Data Format: Once the user edits the table, the software converts the data into proper format(suitable for decoding by hardware) and sends the commands 4. Self-Initialization: The initial values of all the parameters have been set to the default values configured in the hardware.

9.4

Data formats of the configuration registers

The data formats for the configuration registers are presented below.

9.4.1

SCHEDULER UPDATE

• SCHEDULER UPDATE[31:27] = Queue number • SCHEDULER UPDATE[5:0] = Weight

9.4.2

PRIV IP UPDATE

• PRIV IP UPDATE[31:0] = IP address which should be added to the list

9.4.3

PC WEIGHT UPDATE

• PC WEIGHT UPDATE[31:28] = Weight to be assigned • PC WEIGHT UPDATE[5:0] = Protocol Number Poovaiah M P and Sushanth Kini

125/145

June 8, 2010

9.4. Data formats of the configuration registers

9.4.4

Chapter 9. Host Software

SHAPER TABLE UPDATE

• SHAPER TABLE UPDATE[31:27] = Queue Number • SHAPER TABLE UPDATE[15:0] = Token update value

9.4.5

SHAPER TIMER UPDATE

• SHAPER TIMER UPDATE[15:0] = Timer Value Note: The bit positions which have not been assigned any significance should be left ’0’ for enabling future modifications.

Poovaiah M P and Sushanth Kini

126/145

June 8, 2010

Chapter 10

Testing, Verification and Results The implemented design was tested and verified at different stages of the design heirarchy and was ascertained to function as intended.

10.1

Functional Verification

Functional Verification was done at 2 stages in the design. They are: 1. Module Level: After each block was coded a testbench was developed to test the block for its functionality. The testbenches were designed to have processes that compared the Module output and flag and errors that were detected. Any errors detected were debugged at this level itself. 2. System Level: After Module level testing, this module was integrated with the other blocks and system level testing was done using automated testbenches. It is necessary to test the outputs of every block whenever a new block is added to the design. Any errors detected were fixed in the module and tested at both the levels.

10.2

Timing simulation

Once the functional level verification is over, the designed is synthesized, translated, mapped to hardware blocks, placed and routed. Proper constraints have to be specified to meet the timing requirements. After this the post-place and route model was simulated and tested for timing errors. This verification is conducted at Module and system levels. After the timing simulation was done and the errors were fixed, the bit file was generated to be implemented on hardware.

10.3

Hardware Testing

Once all the levels of simulation is over, the design is programmed onto the FPGA. To do hardware testing, Xilinx Chipscope was used and the waveforms were observed. Any errors 127

10.4. Tools Used

Chapter 10. Testing, Verification and Results

found at this level was rectified in the design. After the hardware testing is done, application level testing was done to verify the specifications of the product.

10.4

Tools Used

1. Xilinx ISE 10.1.03: Used for Design entry, Synthesis, Implementation, Generation of Timing Model and Bitfile generation 2. Xilinx Impact 10.1: Used to program the NetFPGA through the JTAG 3. NF2 download (developed by NetFPGA Community): Used to program the NetFPGA through the PCI interface 4. ModelSim 6.4c: Used for functional and timing simulations 5. Xilinx Chipscope 10.1: Used for hardware co-simulation through the JTAG port.

10.5

Resource Utilization

The amount of resource utilized by the design is extracted from the Synthesis report generated by the Xilinx ISE. The FPGA used is Viretex 2 Pro, FF1152, speed grade -7. The details are mentioned in the following table Table 10.1: Protocols with their corresponding index and weight Protocol Logic Utilization Number of Slice Flip Flops Number of 4 input LUTs Logic Distribution Number of occupied Slices Number of Slices containing only related logic Number of Slices containing unrelated logic Total Number of 4 input LUTs Number used as logic Number used as a route-thru Number used for Dual Port RAMs Number used as Shift registers Number of bonded IOBs Number of bonded IOB Flip Flops IOB Master Pads IOB Slave Pads Number of RAMB16s

Poovaiah M P and Sushanth Kini

weight index Used 12,553 13,283

weight Available 47,232 47,232

weight Utilization 26% 28%

11,086 11,086 0 14,141 12,730 858 288 265

23,616 11,086 11,086 47,232

46% 100% 0% 29%

136 198 5 5 149

692

19%

128/145

232 64% Continued on next page

June 8, 2010

Chapter 10. Testing, Verification and Results Table 10.1 – continued from previous page Protocol weight index weight Number of BUFGMUXs 11 16 Number of DCMs 6 8 Number of BSCANs 1 1 Number of RPM macros 29

10.6

10.6. Test Setup weight 68% 75% 100%

Test Setup

The device is tested with a server and 2 hosts 10.1. The hosts had different traffic classes and the Action of QoS was tested.

Figure 10.1: Test Setup

Poovaiah M P and Sushanth Kini

129/145

June 8, 2010

Chapter 11

Industrial Design for the Project The project aims at developing an IP core. Hence the full product cannot be made in its current scope. However, we shall propose an Industrial Design for its enclosure. The Industrial Design for the project is a suggestive enclosure for the device which can be used for the complete product. The points of consideration while designing the enclosure are • The surface finish should be pleasing to the eye and to touch. • Enough ventilation has to be provided for the heat dissipation. • Holes should be provided for LEDs. • The device should have the flexibility in mounting positions. The proposed design was modelled using ’SolidWorks’ and rendered. The rendering of the device from the front is as shown in the Fig 11.1. The key points to observe are: • The Smooth tapering finish for the front that gives a good visual appeal. • The Heat Sinks are provided at the top in the form of slots. These slots have filletted edges for ease of manufacture. • The LEDs have a star shaped look that increases its visibility while making it more attractive. • The left and right sides are curved; again for visual appeal. • The mounting positions are Table top-horizontal(as shown in Fig 11.1), Table Top-vertical and wall mountable. For mounting in the latter two positions a handle-like structure is provided at the rear of the device. This is clearly visible in the Fig 11.2. This handle opens to a maximum of 1800 and therefore acts as a stand for the Table Top-vertical position. The same handle in its 900 position can be used as a structure for mounting on a wall. • There is a rubber grip at the hind part of the device that provides a pleasing touch and also an anti-splip grip for holding the device.

130

Chapter 11. Industrial Design for the Project

Figure 11.1: The Front-Top-Left Perspective rendering of the device

Figure 11.2: The Back-Top-Right Perspective rendering of the device

Poovaiah M P and Sushanth Kini

131/145

June 8, 2010

Chapter 12

Future Scope and Conclusion 12.1

Future Scope

• An attempt has been made in the project to identify and classify network traffic by means of packet parsing and processing header information up to and inclusive of Layer 4 headers. • This means of classification is fairly good and helps to easily identify various classes of traffic near accurately. However, the classification can further be improved if packet parsing and header processing proceeds beyond Layer 4 headers. • Hence, a very strong additive feature to the project could be inclusion of HTTP header lookup and analysis. The design will then need to involve pattern-matching units which try to match the Content-Type field in the HTTP header with various possible values that identify several classes of applications [3]. • If HTTP header search and lookup is incorporated into the design, then, there will arise a necessity of flow-maintenance. HTTP header is only present in few of the TCP packets (typically present only in the connection-time packets) while the rest of the TCP packets which carry HTTP payload will only have fragments of HTTP packet with no headers. Hence, to identify such HTTP packets as those belonging to an earlier flow, the concept of flow-maintenance sets in. So, the design should maintain flows and one method for this purpose is hash-based flow checks. • In hash-based flow checks, the hash of a certain 4-tuple or 5-tuple is calculated. This ntuple might have been formed out of various fields in the packet headers. If the hash turns out to be an already existing one, the database of which would have been maintained in an aging-enabled buffer locally, then it means that the current packet belongs to an earlier flow. If the hash doesn’t match with any of the hashes in the database, then the packet is identified to belong to a new flow. A record of this new flow is added to the database. • One limiting point in the current design could be lack of hardware resources such as buffer space and memory space. In the currently used NetFPGA board, the buffer space and DDR2 memory size is limited. Hence, if there are packet drop scenarios encountered, it could very well be due to lack of FIFO space or lack of DDR2 memory space. However, if the design is ported onto a board with a larger memory size and FPGA offering more buffer space, then, it shall function even better. 132

Chapter 12. Future Scope and Conclusion

12.2. Conclusion

• In the current design, a large number of transport layer protocols are supported. However, this number could be further enhanced by identifying more crucial and important protocols that shall be of significance in the future. • 32 Queues of packets are maintained in the design with these queues being collectively a part of 3 queue groups. As a future enhancement, more number of queue groups could be formed by proper demarcation amongst the queues. Also, more number of queues could be supported if memory size increases. If more groups are being supported, the group shaping could be added to the design as well. • Currently, in the design, static allocation of memory chunk has been done to the individual queues. But, in future, as an additional feature, dynamic allocation of additional memory to the queues could be performed in the design. The way this would be done is that, a chunk of memory could be taken out of the quota that was allocated statically to a lower priority queue, and this chunk could now be assigned to the quota of a higher priority queue. • The overall throughput in the design can be certainly improved if a DDR2 module operating at a higher clock frequency can be obtained. Current markets host DDR2 modules with speeds up to 1033 MHz while the DDR2 module on the NetFPGA board runs at a maximum speed of only 200 MHz.

12.2

Conclusion

The Network Traffic Manager shall be designed with as ingenious an architecture as possible. The architecture shall be scalable with respect to the hardware resources. The NTM shall be implemented on the NetFPGA board which has a Virtex-II Pro FPGA on it. Post-implementation, the design shall monitor the downstream traffic and shall provide QoS guarantees to the various classes of users of the network as well as to the various classes of applications being run by the users. The future enhancements suggested above, if implemented, will result in formation of a superior top-class product.

Poovaiah M P and Sushanth Kini

133/145

June 8, 2010

Appendix A

Logic Cores A.1 A.1.1

Rx Queue Functionality

The functionality implemented by this block is:

1. Accept Frames from the Tri-Mode Ethernet MAC (Logic Core) interface. The frames are being received at the output of the TEMAC at the rate of one byte per receive clock (125MHz). 2. Store the Frames till the Packet Accumulator picks them up. The size of the buffer needs to be determined depending on the worst case rate at which the Packet Accumulator picks up the data. 3. Pass only the good frames downstream to the Packet Accumulator, discard the bad frames. As soon as a full good frame is available to be read out the availability of the same must be signaled to the Packet Accumulator so that it can pick up that frame. Also bad frame must be dropped internally. There must be a feature to pause the transfer of the frame if the Packet Accumulator runs low on its internal buffer space. The transfer should resume once the Packet Accumulator’s internal buffer starts clearing up. This involves some handshaking control signals from the Packet Accumulator side indicating the willingness to accept the data. 4. Manage the clock domain crossing of data and control signals. The Rx clock and core clock are both 125 MHz clocks but they are not in phase. So, data crossing from one clock domain to another has to be done. In addition to the data, all the TEMAC signals needed in the core clock side must also under-go clock domain crossing. Possible loss of a synchronization of these signals by one cycle due to meta-stability must be factored during the design. The data cross over must not involve delays or handshaking because of cycle by cycle data arrival. 134

Chapter A. Logic Cores

A.1. Rx Queue

Figure A.1: Interface diagram of the RxQueue

A.1.2

Interface Signals

The interface in a block schematic view is as shown in Figure A.1. Details of the interface signals are as follows: 1. mac rx data[7:0] : Byte wide data input from the tri-mode ethernet MAC logic core. During frame reception in 1Gbps mode, one byte of data will be active for a single clock cycle. 2. mac rx dvld : The data valid signal for mac rx data[7:0]. Goes active high when frame reception starts and remain high till end of the frame. When this signal is high mac rx data[7:0] contains the received frame data. 3. mac rx goodframe: The signal goes high one clock cycle after the end of a good frame reception (mac rx dvld going low) and stays high for one clock cycle. If a bad frame is received this signal will remain at zero. 4. mac rx badframe: The signal goes high one clock cycle after the end of a bad frame reception (mac rx dvld going low) and stays high for one clock cycle. If a good frame is received this signal will remain at zero. 5. out ready: When this signal is high, the downstream unit (the Packet Accumulator) is ready to accept the data from Rx Queue unit. This signal going low is a pause request from the Packet Accumulator to Rx Queue. Data will be output from the Rx Queue for one more clock cycle after this signals goes low. 6. data out[63:0] : The data of good frames are available to input stream engine through this interface. This data is valid only if out wr en signal is high. The data changes cycle by cycle during a typical transfer. Poovaiah M P and Sushanth Kini

135/145

June 8, 2010

A.2. Tx Queue

Chapter A. Logic Cores

7. control out[7:0] : This is the control bus, identifying frame boundary. The bus is a one hot encoding indicating byte boundary of the frame ending within the current data out[63:0]. All the bits of control out are low (logic zero) everywhere except at frame boundaries. 8. out wr en: Active high data valid signal for the data out[63:0] and control out[7:0]. Both data out[63:0] and control out[7:0] must be accepted while out wr en is high. 9. reset: Active high Reset signal synchronous with the core clock. Internally crosses a clock domain through a double synchronizer, hence the reset must be more than 2 clock cycles active high to be effective. All of the designs (except logic cores) uses synchronous reset methodology. 10. rx clk : The MAC rx clock. 11. clk : Core clock.

A.2 A.2.1

Tx Queue Functionality

Figure A.2: Interface diagram of Tx Queue The functionality implemented by this block is: • Pass frames to the Tri-Mode Ethernet MAC (Logic Core) interface. The frames are being transmitted to the input of the TEMAC at the rate of one byte per transmit clock, tx clk. • Store the Frames passed from the Egress Packet Accumulator till the Tri-Mode Ethernet MAC picks up the data. Poovaiah M P and Sushanth Kini

136/145

June 8, 2010

Chapter A. Logic Cores

A.2. Tx Queue

• Manage the clock domain crossing of data and control signals. The tx clock and core clock are both 125 MHz clocks but they are not in phase. So, data crossing from one clock domain to another has to be done. The data cross over must not involve delays or handshaking because of cycle by cycle data arrival. • Conform to interface timings of both the MAC and Transfer Engine.

A.2.2

Interface

The interface in a block schematic view is as shown in Figure A.2. Details of the interface signals are as follows. 1. mac tx data[7:0] : Byte wide data output to the tri-mode ethernet MAC logic core. During frame transmission in 1Gbps mode, one byte of data will be active for a single clock cycle. 2. mac tx dvld : The data valid signal for mac tx data[7:0]. Goes active high when frame transmission starts and remain high till end of the frame, when this signal is high mac tx data[7:0] contains the transmitted frame data. This signal is synchronous with the tx clock. 3. mac rx ack : Once data transfer request is made with mac tx dvld going high, if the TEMAC is ready to transmit a frame this signal goes high for one clock cycle. The mac tx data and mac tx dvld must not be changed until transmission starts as indicated by mac rx ack going high. After receiving the mac rx ack cycle by cycle data change-over happens. 4. in ready: When this signal is high, it indicates availability of Tx Queue buffer space. Should remain high during normal operation. 5. data in[63:0] : The data to be transmitted, passed from the transfer engine. This data is valid only if wr in signal is high. The data changes cycle by cycle during a typical transfer. 6. control in[7:0] : This is the control bus, identifying frame boundary. The bus is a one hot encoding indicating byte boundary of the frame ending within the current data in[63:0]. All the bits of control in are low (logic zero) everywhere except at frame boundaries. 7. wr in: Active high data valid signal for the data in[63:0] and control in[7:0]. data in[63:0] and control in[7:0] must be accepted while wr in is high.

Both

8. reset: Active high Reset signal synchronous with the core clock. Internally crosses a clock domain through a double synchronizer, hence the reset must be more than 2 clock cycles active high to be effective. 9. tx clk : The MAC tx clock. 10. clk : Core clock. Poovaiah M P and Sushanth Kini

137/145

June 8, 2010

A.3. RGMII I/O Unit

A.3

Chapter A. Logic Cores

RGMII I/O Unit

This unit follows the recommendations in Xilinx Application notes [18] and [19]. Refer these for the circuit structure, timing and control of this GMII to RGMII protocol translation unit. This was used to reduce the clocking resources so as to support all the four Ethernet ports [18].

A.4

CAM

CAM (Content Addressable Memory) aids in faster comparison or searching of a keyword among a set of words. The set of words is written into the CAM while the keyword to be searched is written into a search register. The CAM does the comparison in one clock cycle and outputs the line number / line numbers where the search-word matched. Compared to a linear search, this is a faster option. In the design, the CAM will be used in two places. Firstly, a CAM will be used to match the destination IP address in a packet with a list of privileged IP addresses. Secondly, a CAM will be used to match the TCP/UDP Source port number with a list of known port numbers as a step towards the generation of queue number for a packet. Some details about the CAM are mentioned in the paragraphs below. A CAM implemented with SelectRAM primitives has a single clock-cycle latency on its read operation, and two clock-cycle latency on its write operation [17].

A.4.1

CAM Core Signals

• CLK (Clock): The CAM module is fully synchronous with the rising edge of the clock input. All input pins have setup time referenced to the CLK signal. All output ports have clock-to-out times referenced to the CLK signal. • EN (Enable): When active, the optional enable signal allows the CAM to execute write and read operations. If the enable is inactive during normal operation of the core, the output pins hold their previous state and all internal states freeze. Any new input signal is ignored until the enable is driven active, at which time the CAM resumes all of its halted operations. • DIN[n:0] (Data In Bus): The DIN bus provides the data to be written into or read from the CAM core, depending on the operation. If the simultaneous read/write option is selected, this bus is used only for the write operation, and CMP DIN bus is used exclusively for the read operation. • CMP DIN[n:0] (Compare Data In Bus): When simultaneous read/write option is selected, this optional input bus provides the data for the read operation of the CAM. When simultaneous read/write option is not selected, this bus is not available. . • WE (Write Enable): The optional write enable signal allows data on the DIN bus to be written into the CAM. When this signal is asserted, the contents on the DIN bus are written into the location selected by the write address bus WR ADDR. This signal is not present if the read-only CAM option is selected. This signal is optional when the CAM initialization option is selected. Poovaiah M P and Sushanth Kini

138/145

June 8, 2010

Chapter A. Logic Cores

A.4. CAM

• WR ADDR[m:0] (Write Address Bus): The optional write address bus determines the memory location to be written to during the CAM’s write operation. This bus is not present if the read-only CAM option is selected. This bus is optional when the CAM initialization option is selected. • BUSY (Busy): The busy signal indicates that the write operation is currently being executed. It remains asserted until the multiple clock cycle write operation is completed. The user cannot start a new write operation while this signal is active. • MATCH ADDR[j:0] (Match Address Bus): This output bus indicates the address that matches the contents of the DIN bus, or the CMP DIN bus if the simultaneous read/write option is selected. Match address can be encoded (binary), single-match un-encoded (onehot), or multiple-match un-encoded. • MATCH (Match): The match signal is asserted for one clock cycle when data on the DIN bus matches data in one or more locations in the CAM. If simultaneous read/write option is selected, data on the CMP DIN bus is used to search for a match instead of the DIN bus. • READ WARNING (Read Warning): The optional read warning signal is asserted when data for the write in progress of the CAM is the same as data for the read initiated for the CAM. Since write operations take multiple cycles, writes performed prior to reads may not have been completed when the read is executed. READ WARNING is asserted to let the user know that the match address and match signals do not reflect the results of the most recent write operation being executed.

Figure A.3: CAM Schematic Symbol

Poovaiah M P and Sushanth Kini

139/145

June 8, 2010

A.4. CAM

Chapter A. Logic Cores

Figure A.4: CAM Read Operation

A.4.2

Read Operation

Figure A.4 shows three consecutive read operations of a Block SelectRAM memory CAM with the second operation not having a match. Three of the possible configurations for the MATCH ADDR and MATCH signals are displayed. By default, the Block SelectRAM memory CAM has a single-clock read latency. However, the user can add an extra clock cycle to the read latency by selecting the Register Outputs option in the CORE Generator GUI. The CAM we implemented have unregistered output. New data written into the CAM is available to be read on the second rising edge of the clock after a write operation begins.

A.4.3

Write Operation

Figure A.5 shows three consecutive write operations of a Block SelectRAM memory CAM. The figure also shows when the new data is available to be read by the read operation. The Block SelectRAM Memory CAM has a two-clock-cycle write latency. When executing consecutive write operations, each write operation must be two clock cycles apart. The design uses the ‘READ WARNING’ option. ‘Read Warning’ signal indicates that the data applied to the CAM for a read operation matches the data that is currently being written into the CAM by the unfinished write operation. This ensures that CAM look up can be completed in one cycle irrespective of whether that content is going to write or already present in CAM. Poovaiah M P and Sushanth Kini

140/145

June 8, 2010

Chapter A. Logic Cores

A.5. TEMAC (Tri-mode Ethernet MAC)

Figure A.5: CAM Write Operation

A.4.4

Specifying CAM Contents

The initial contents of the CAM can be specified in a .coe file. In our design CAM is initialized with 0. A location with content 0 indicates that the location is empty. In order to make an entry into the CAM, a search for ’0’ is done, and if found, the corresponding location is written with the desired data.

A.5

TEMAC (Tri-mode Ethernet MAC)

This is one of the Logic Cores used in the design. It is generated using the Xilinx Core Generator tool. This block is placed at the input and the output interface of the overall design and hence two instantiations of the same are done. This block interfaces to the RGMII unit and the TxQueue-RxQueue pair. Some details about this block is presented in the paragraphs below. Figure A.6 shows a typical frame transmission at 1Gbps. The client asserts clientemactxdvld and puts the first byte of frame data on the clientemactxd bus. The client then waits until the TEMAC asserts emacclienttxack before sending the rest of the data. At the end of the frame, clientemactxdvld is de-asserted. At 1Gbps, clientemactxenable should be set high. Figure A.7 shows the reception of a good frame at the client interface at 1Gbps. The core asserts emacclientdvld for the duration of the frame data. At the end of the frame, the emacclientrxgoodframe signal is asserted to indicate that the frame passed all error checks. The receiver output is only valid when the clientemacrxenable input is high. At 1Gbps, clientemacrxenable should be set high. All receive client logic should also be enabled using the clientemacrxenable signal. Poovaiah M P and Sushanth Kini

141/145

June 8, 2010

A.5. TEMAC (Tri-mode Ethernet MAC)

Chapter A. Logic Cores

Figure A.6: Normal Transmission at 1 Gbps

Figure A.7: Normal Frame Reception at 1Gbps

Poovaiah M P and Sushanth Kini

142/145

June 8, 2010

Bibliography [1] Efficient Fair Queuing using Deficit Round Robin by M Shreedhar, Microsoft Corp. and George Varghese, Washington University in St. Louis. SIGCOMM 95 Cambridge, MA USA [2] Packet Classification Using Extended TCAMs by Ed Spitznagel, David Taylor, Jonathan Turner, Applied Research Laboratory, Washington University, St Louis Proceedings of the 11th IEEE International Conferance on Network Protocols (ICNP’03) [3] Large-Scale Wire-Speed Packet Classification on FPGAs by Weirong Jiang, Ming Hseih Department of electrical engineering, University of Southern California, Los Angeles and Viktor K. Prasanna, Ming Hsiegh Department of electrical engineering, University of Southern California, Los Angeles FPGA ’09, Feb 22-24, 2009, Monterey, California, USA [4] DPPC-RE: TCAM-Based Distributed Parallel Packet Classification with Range Encoding Kai Zheng, Student Member, IEEE, Hao Che, Member, IEEE,Zhijun Wang, Bin Liu, Member, IEEE, and Xin Zhang IEEE TRANSACTIONS ON COMPUTERS, VOL. 55, NO. 8, AUGUST 2006 [5] http://www.bluecoat.com/doc/direct/7942 [6] Computer Networks, 4e, Prentice-Hall India Andrew S.Tannenbaum, Vrije Universiteit, Amsterdam [7] End to End QoS Network Design by Cisco press, Cisco Systems [8] Internetworking with TCP/IP principles, protocols and architectures, 4e, Prentice-Hall India Douglas E. Comer [9] HTTP The Definitive Guide David Gourley, Brian Totty, Marjorie Sayer, Sailu Reddy, Anshu Aggarwal, OReilly Publications [10] Engineering Internet QoS, Artech House By Sanjay Jha & Mahbub Hassan 143

BIBLIOGRAPHY

BIBLIOGRAPHY

[11] Campus LAN Design Guide - Design Considerations for the High-Performance Campus LAN By Juniper Networks [12] End-to-End QoS Network Design By Tim Szigeti, CISCO [13] Campus LAN Design Guide - Design Considerations for the High-Performance Campus LAN By Juniper Networks [14] Internetworking Technologies Handbook Chapter 49 ’Quality of Service Networking’ By CISCO [15] CISCO IP Telephony Network Design Guide [16] Non-intrusive TCP Connection Admission Control for Bandwidth Management of an Internet Access Link Anurag Kumar, Malati Hegde, S. V. R. Anand, B. N. Bindu, Dinesh Thirumurthy, and Arzad A. Kherani published in IEEE Communications Magazine, May 2000 [17] Internet QoS: Architectures and Mechanisms for Quality of Service, Morgan Kaufmann Publishers Zheng Wang [18] On Packet Switches with Infinite Storage John B Nagle IEEE Transactions on Communications, 1987 [19] Analysis and Simulation of a Fair Queueing Algorithm Alan Demers, Srinivasan Keshav, Scott Shenker ACM, 1989 [20] Network Processor Design: Issues and Practices: Volume 1, Morgan Kaufmann Publishers Patrick Crowley, Mark Franklin, Haldun Hadimioglu, Peter Z Onufryk ACM, 1989 [21] Internet QoS: Architectures and Mechanisms for Quality of Service by Zheng Wang Morgan Kaufmann Publishers [22] Engineering Internet QoS by Sanjay Jha & Mahbub Hassan, Artech House, London, 2002 [23] CISCO QoS www.cisco.com [24] HTTP- The Definitive Guide by O’Reilly Publications Poovaiah M P and Sushanth Kini

144/145

June 8, 2010

BIBLIOGRAPHY

BIBLIOGRAPHY

[25] Computer Networks by Andrew Tannebaum Prentice Hall India [26] Internetworking with TCP/IP: Principles, Protocols and Architecture Douglas E Comer, 4th ed. Prentice Hall [27] NetFPGA board manuals www.netfpga.org [28] http://support.microsoft.com/kb/927847 – For Windows Live Messenger [29] http://support.microsoft.com/kb/189416 – For Windows Media Services [30] http://www.videolan.org/doc/videolan-howto/en/ch09.html – For VLC Media Player [31] http://help.yahoo.com/l/uk/yahoo/messenger/messenger10/messenger/ mstafireconfig.html – Yahoo! Messenger [32] http://www.aim.com/help faq/common problems.adp – For AOL Instant Messenger [33] Xilinx Application Note XAPP549 - DDR2 SDRAM Memory Interface for Virtex-II Pro FPGA [34] Xilinx Application Note XAPP688 - Creating High-Speed Memory Interfaces With Virtex-II and Virtex-II Pro FPGAs [35] Micron Data Sheet MT47H16M16FG-37E – www.download.micron.com [36] Xilinx IP Release Note Guide, XTP025 - Xilinx Core Generator [37] Xilinx Virtex-II Pro Datasheet, DS083 - www.xiilnx.com [38] Xilinx DS253, Content Addressable Memory datasheet

Poovaiah M P and Sushanth Kini

145/145

June 8, 2010