Logical Path Delay Distribution And Transistor Sizing A. Kabbani, D. Al-Khalili+, and A. J. Al-Khalili* Dept. of Electrical and Computer Engineering Ryerson University, Toronto, CANADA + Dept. of Electrical and Computer Engineering Royal Military College of Canada, Kingston, CANADA * Dept. of Electrical and Computer Engineering Concordia University, Montreal, CANADA Abstract -- The merits of high performance design are high speed, low power consumption, and small silicon area. Area optimization could be achieved at different levles of the design abstraction. In this paper area-delay optimization technique that depends on library-free synthesis and transistor sizing is presented. This technique can be used to optimize the path delay or to minimize the path area for a specific given required time. It is generated depending on the CMOS inverter delay model, Modefied Logical Effort (MLE) model [1] and the CMOS gate transition time model [2]. The proposed technique achieves better performance as compared to Synopsys’s Design Compiler. For a given required time, the presented technique saves on area-delay product by about 50% on the average.

I. INTRODUCTION Standard cell libraries have appeared almost three decades ago to facilitate the digital system design. They provide well designed and characterized cells, which represent the most common functions used in digital designs. These may be as simple as basic logic functions, or more complicated such as registers and counters. Of course, the library cells come with different driving strengths to optimize the design speed, area, power, or a combination, which is the normal case. Since the library cells may be repeated thousands of times during the digital design process, their quality determine the final product performance [3]. The number of driving strengths available for each cell also, have a crucial impact on the design performance. For instance, when a design is implemented by a single driving strength library, its performance degrades by up to 27%. This is compared to its performance when it is implemented using a library that uses three levels of driving strengths [4]. Hence, there is a need for multiple libraries for each technology process, which is impractical.The situation is exacerbated when there is a need for a diversity of libraries from different suppliers where each one has its own tools and documentations. As a result, virtual library concept has emerged as a solution for this problem [5]. Virtual library or library-free mapping terminology means, mapping the design’s Boolean functions to the transistor level directly instead of using pre-characterized cells. Usually, the Boolean functions in this mapping technique are realized using Static CMOS Complex Gates (SCCGs) [5]. The number of SCCGs (Boolean functions) in a virtual library is determined by the allowed number of serially connected transistors. It has been reported [5] that the circuit’s total number of transistors can be reduced by up to 30% when library-free mapping technique with s(4,4) is used compared to simple gate library mapping (simple gate library has three cells: inverter, 2-input NAND gate, and 2-input NOR gate). The previous comparison also shows that the maximum number of transistors that should be crossed by a signal flow from a primary input to a primary output may be reduced by up to 20%. The first and second results can be translated as area and delay improvements respectively.

0-7803-8935-2/05/$20.00 ©2005 IEEE.

Input capacitance Logical effort parasitic delay

G ate 1

G ate 2

C1 g1 p1

C2 g2 p2

CL

Fig. 1: A two-stage logical path.

II. THE PROPOSED TECHNIQUE The delay of a logical path is the sum of the delays of its individual gates. Using the Modified Logical Effort technique [1], the delay of the two-stage path shown in Fig. 1 measured in units of τ is D = ( g 1 h 1 + ϑ 1 )M f1 + ( g 2 h 2 + ϑ 2 )M f2

(1)

where h 1 = C 2 ⁄ C 1 , h 2 = C L ⁄ C 2 , CL is the path load capacitance, M f1 and Mf2 are the modification factors for gate 1 and gate 2 respectively and ϑ i = r i + q i + p i where ri, qi, pi are: the gate normalized ramp, internodal, and parasitic delays respectively. To write the delay equation in terms of device sizes, all values of input capacitances are expressed as functions of the transistor widths i.e., C i = αW i . This assumes that all the devices have the same length. Expressing the electrical efforts h1 and h2 in terms of the transistor widths gives h 1 = W 2 ⁄ W 1 , h 2 = C L ⁄ ( αW 2 )

(2)

Substituting (1) in (2) yields W2 CL D = g 1 ------- + ϑ 1 M f1 + g 2 ----------- + ϑ 2 M f 2 W1 αW 2

(3)

Taking the derivative of (3) with respect to W 2 and set it to zero gives g 2C L g1 ∂D (4) = -------M f 1 – ------------M f 2 = 0 2 ∂W2 W1 αW 2 (5)

g 1 h 1 M f1 = g 2 h 2 M f2

Thus, to minimize the delay between input and output while minimizing the area, each stage should bear the same production of effort f = gh and modified factor. The previous conclusion can be generalized to a path with N gates [6], and (5) may be written as g i h i M f i = g j h j M fj

∀( i, j ) ∈ 1…N

(6)

Considering a chain of buffers where g i = g j = 1 , and M fi = M fj = 1 , equation (6) becomes h i = h j . This is compatible with the result introduced in [7] for minimizing the overall area of a chain of buffers between input and output. For the networks with loads off logical path, as shown in Fig. 2, branching effort b should be introduced. The branching effort b at the

A, B, and C represent the gate sizes

For the case where only inverters are used in the path, the solution for this problem is when the minimum delay equals exactly the required time [7]. The switching input of a CMOS gate, can be roughly approximated as the input of the switching inverter. This switching inverter consists of the gate’s switching NMOS and PMOS transistors. Hence, the proposed solution in [7] can be generalized to CMOS gates. Equating the minimum delay with the required time yields

branching NAND_C

NAND_B NAND_C NAND_A

T R = N ( GBMH )

NAND_B NAND_C

b=2/1

Fig. 2: A logical path with branching

(7)

where Con-path is the load capacitance of a logical gate along the considered path, and Coff-path is the load capacitance of a logical gate(s) off the path. The branching effort along an entire path B is the product of the branching effort at each of the stages along the path. N

∏ bi

(8)

i=1

Thus, to generalize (6) taking into consideration the branching effort, the path electrical and branching efforts can be related to the electrical effort of each stage [6] as (9) h 1 h 2 …h N = BH where H is the ratio of the load capacitance of the last stage to the input capacitance of the first stage.

H = C L ⁄ C in

(10) The path logical effort G, and the path modified factor M can be defined as

g 1 g 2 …g N = G

(11)

N

M =

∏ M fi

(12)

i=1

Multiplying (9), (11) and (12) together yields ( g 1 h 1 M f1 ) ( g 2 h 2 M f2 )… ( g N h N M fN ) = GHMB = F

(13)

Under the condition presented in (6), the stage effort ˆf will be equal for all stages ˆf = ( GBMH ) 1 ⁄ N = F 1 ⁄ N = g h M i i fi

(14)

ˆ

The path minimum delay d can be expressed as 1⁄N dˆ = N ( GBMH ) + ϑM

(15)

N

where

ϑM =

M fi ( r i + q i + p i ) i=1

(16)

culated using (16). Re-arranging (15) and (16), and considering (14) yields 1⁄N (17) TˆR – ϑ M = N ( GBMH R )

output of a logical cell can be defined as [6].

B =

+ ϑM

ˆ The required electrical effort h iR of each gate in the path can be cal-

b=3/1

C on – pat h + C of f – path b = ----------------------------------------------------C on – path

1⁄N

Now consider the problem of finding the optimal gate sizes of a logical path that is characterized by a user specified required time TR, the number of stages, the path input capacitance Cin, and the path load capacitance CL.

hˆiR = ( TˆR – ϑ M ) ⁄ ( NM fi g i )

(18)

III. DESIGN STEPS In this section a six-step process is presented for sizing a logical path to a achieve a required time with minimized area. 1. Determine the input capacitance of the logical path and calculate H and, F, using (10) and (13) respectively, and the optimal number of gates N [6]. If N is greater than the actual number of gates, add buffers to the path to match N. 2. Calculate, pi, gi [6], and Mfi as shown in [1] for each gate. 3. Calculate ˆf and roughly estimate fanout for each gate using (8), (10), (11), (12) and (14). The calculation should start from the last gate towards the first gate (at the input). This allows a rough estimation of the transistor sizes for each gate C i = ( g i C i + 1 ) ⁄ f where Ci is the input capacitance of the current gate and Ci+1 is the input capacitance of the next gate that is closer to the path output. qi can also be calculated at this stage. 4. Determine the worst case expected transition time at the path input. Starting from the first gate, calculate the transition time at the output of each gate using the developed gate transition time models in [2] and [8]. 5. Use the delay model developed in [1], to calculate the delay of each gate caused by the transition time. To do that, use the developed inverter delay model presented in [9] (calculate the delay for standard inverter that has the same loading conditions of the considered gate). 6. Evaluate hiR for each gate using (18). This allows the calculation of the input capacitance of each gate C i = ( C i + 1 ) ⁄ h i R , and consequently the transistor widths.

IV. MODEL VALIDATION The developed technique has been validated using 0.18μm TSMC technology. Three types of logical paths are considered. As shown in Fig. 3, these paths have different combinations of gates and different logic depths to account for the cases where the logic depth is less than, equal to or greater than the optimal depth. These paths are synthesized using the Design Compiler (DC) from Synopsys. Upon the synthesis process, the DC changes the gate types and/or the logic depth of a logical path depending on the loading conditions, and the design constraints such as the timing constraints. Fig. 4 (a), (b) and (c) show logical path no. 1 when the loads are set to 20fF, 150fF, and 300fF respectively and when the DC is directed to provide the smallest delay

regardless of the area. Over the validation process, the mapped logic is maintained as obtained from the DC. To estimate the performance of the DC, the mapped path netlists are extracted at the transistor level and Spectre is used to predict the delay. To determine the performance of the LE and MLE, the transistor sizes of the extracted netlists are modified as dictated by LE and MLE respectively. This procedure is repeated for each path and for each load. The validation results are summarized for the four scenarios in Table 1. Throughout the synthesis process, the DC was directed to achieve the smallest possible delay regardless of the area as shown in Synopsys row in Table 1. In order to have a complete control over transistor sizing, the LE algorithm was used with the objective of achieving minimum delay. The scenario is referred as LE min. delay. Then the best delay achieved by DC is considered to be the required time. Accordingly transistor sizes in the logic path are determined using LE and MLE, which are identified as LE req. delay and MLE req. delay respectively. During the optimization using LE technique, the effect of the transition time on the delay was considered to improve the model performance. It is worth mentioning, that the area is calculated as the sum of widths of the path’s transistors.

In 1 In 2

NAN2D4 NAN2D4

In 3 In 4

NAN2D4 O R2D4

In 5

INVD4 NAN2D4

In 6

INVD4

(a) In 1 In 2 In 3 In 4

NAN2D4

NAN2D4

NAN4D4

In 1 In 2 In 3

A O I2 2 1 D 1

Out

In 5

INVD2

In 6

INVD4

INVDA

Out

Out

In 4 In 5 In 6

O R 2D 0

(b)

path no. 1 In 1 In 2

In1 In2 In3

NAN3D1

Out

In 3

AOIR21D1

In4

In 4

NAN2D4 O ut NA N4D4

In5

IN VD7

AOI21M20D1

In6 In7

path no. 2.

In 5

INVD2

In 6

INVD4

(c)

In1

AOI22D1

In2 In3

NAN3D1

In4 In5

INVD7

AOI211D1

In6 In7 In5 In6 In7

NAN2D4

OR2D0

AOI21M20D1 AND2D0

In7

path no. 3. Fig. 3: path no.1, Path no. 2 and path no. 3.

Out

Fig. 4: logical path no. 1 as obtained for different design constraints and loading conditions.

The metrics presented in Table 1 are delay, area, area-delay product, and the ratio of the input capacitance of the path achieved by the first scenario (Cs) to the input capacitances achieved by the other three scenarios. The individual capacitances are defined as follow: CLE is the path input capacitance as obtained by the LE for the minimum possible delay.CLEr is the path input capacitances as achieve by the LE when it is used to produce the required time.CMLE is the path input capacitance as obtained by the MLE when it is used to attain the required time. This last metric gives an indication on the path loading effect of each of the last three scenarios on the driving circuit, which is an indication of the speed improvement.

TABLE 1: Delay and area comparison between Synopsys, LE and MLE.

Logical paths

path no. 1

path no. 2

path no. 3

Cload [fF]

20

150

300

20

150

300

20

150

300

min. delay [ps]

207

323

347

292

332

361

412

376

406

area [μm]

81

83

109

97

143

155

43

188

200

16767 26809 37823 28324 47476 55955 17716 70688 81200

area * delay min. delay [ps]

189

293

324

267

296

318

400

364

379

72

96

120

43

152

218

53

151

243

area [μm]

13608 28128 38880 11481 44992 69324 21200 54964 92097

area * delay

12/0.5 12/4.9

Cs/CLE min. delay [ps]

12/2

200

348

331

51

29

105

area [μm]

10200 10092 34755

area * delay Cs/CLEr

12/5.8

12/1

min. delay [ps]

220

347

12/0.9 36.4/3.8 36.4/3.8 20/0.8 301 221 6381

293

349

414

373

394

91

115

39

92

141

29302 40135 16146 34316 55554 90/5

90/5

330

371

413

375

400

60

92

26

85

111

area [μm]

33

24

101

21

area * delay

7260

8328

33532

6153

Cs/CMLE

12/3.1

12/1

12/4.8 12/0.8 36.4/1.5 36.4/1.5 20/0.8

V. CONCLUSION

19800 34132 10738 31875 44400

Path # 1 100000

Area x delay

90/5

322

12/4.8 12/0.9 36.4/1.5 36.4/1.5 20/0.8 332

90/5

Fig. 5 shows the area-delay products (ADP) produced by the DC, the LE model and MLE model. As shown in Table 1 and Fig. 5 the developed technique consistently achieves better performance compared to Synopsys DC and LE technique. LE is always able to produce less delay compared to the DC. Though, in many cases this comes at the expense of area. For the minimum delay scenario, LE technique improves the ADP by about 4% on the average. On the other hand, when the DC delay is used as the required time, both LE and MLE sizing techniques reduce the design area. In a few cases, the delay is also, increased slightly. Nevertheless, these two techniques attain very good performance compared to the DC regarding the ADP. On the average, the ADP improvement achieved by LE is around 38% and that achieved by MLE is around 50%. Furthermore, both the LE and MLE techniques, reduce the path input capacitance on the average by 15 times compared to the one attained by DC as shown in the Table 1. In conclusion, the developed optimization technique using MLE model has provided better performance compared with LE model and the DC. The significant area reduction allows more functions to be integrated on the same chip. Also, less area can be translated into less parasitic capacitances and consequently less power consumption.

90/5

90/5

Synopsys

80000

LE min.

60000

LE req.

40000

MLE req.

In this paper, a technique has been proposed to optimize the area of a logical path for a given required time. As compared to Synopsys DC, this technique reduces the area-delay product by over 50% on the average. Combining this technique with the transition time and the delay models that developed in the previous chapters, allows the characterization of the virtual cells and assists the library-free synthesis.

VI. REFERENCES

20000 0 20

150

[1]

300

Cload [f F]

[2]

Area x delay

Path # 2 50000 40000 30000 20000 10000 0

Synopsys

[3]

LE min. LE req. MLE req. 20

150

[4] [5]

300

Cload [f F]

[6] Path # 3 Area x delay

80000

Synopsys

60000

[7]

LE min.

40000

LE req. 20000

[8]

MLE req. 0 20

150

300

Cload [f F]

Fig. 5: Area delay comparison between Synopsys DC, LE and MLE.

[9]

A. Kabbani, D. Al-Khalili, and A. J. Al-Khalili, “Delay macro modeling of CMOS gates using modified logical effort technique,” proc. of the Int. Conf. on Sem. Elec., Dec. 2004. A. Kabbani, Timing driven IP block design methodology with emphasis on reusability, Royal Millitary College of Canada, 2004. J. Maxey, K. A. Wolf, J. Lewis, M. Lefebvre, and D. Pietromonaco, “PANEL: cell libraries build vs. buy; static vs. dynamic,” Proceedings of 36th IEEE DAC, pp. 341-342, 1999. K. Scott, and K. Keutzer, “Improving cell libraries for synthesis,” Proc. of IEEE CIC Conference, PP. 128-131, 1994. A. Reis, M. Robert, and R. Ries, “Topological parameters for library free technology mapping,”. IEEE proc. of XI Brazilian Symp. on IC Design, pp. 213-216, 1998. I. Sutherland, B. Sproull, and D. Harries, Logical effort: design fast CMOS circuits, Morgan Kaufmann publishers, January 1999. P. Rezvani, A. H. Ajami, M. Pedram, and H. Savoj, “LEOPARD: A logical effort-based fanout optimization for area and delay” Proc. of IEEE Int. Conf. on CAD, 1999, pp. 516-518. A. Kabbani, D. Al-Khalili, and A. J. Al-Khalili, “Technologyportable analytical model for DSM CMOS inverter transition time estimation,” IEEE trans. on CAD on IC and Sys., vol. 22, no. 9, pp. 1177-1187, Sept. 2003. A. Kabbani, D. Al-Khalili, and A. J. Al-Khalili, “Technology Portable Delay Model for DSM CMOS Inverters,” Proc. of the second IEEE NEWCAS , pp. 13-16. June 2004.

Logical Path Delay Distribution And Transistor Sizing

These may be as simple as basic logic functions, or more complicated such ... Virtual library or library-free mapping terminology means, map- ping the design's ...

223KB Sizes 3 Downloads 227 Views

Recommend Documents

Sleep Transistor Sizing in Power Gating Designs
Email: [email protected]. Abstract ... industrial design where the MIC of each cluster occurs at ... obtained through extensive post-layout simulations, it.

Transistor Sizing Issues And Tool For Multi-threshold Cmos Technology
Multi-threshold CMOS is an increasingly popular circuit approach that enables high performance and low power operation. However, no methodologies have ...

Fine-Grained Sleep Transistor Sizing Algorithm for ...
transistors from cluster MICs in a temporal perspective for sleep transistor size ... represented as a resistance network, which is a linear system as shown in Figure 4. ..... we obtain the DEF file to extract the location of each gate. The gates in

Path Stitching: Internet-Wide Path and Delay Estimation from Existing ...
[10] and Akamai's core points [9]. They derive estimates by composing performance measures of network segments along the end-to-end path. Our approach ...

Path Stitching: Internet-Wide Path and Delay Estimation from Existing ...
traceroute 50 times a day between 184 PlanetLab (PL) nodes during the same ..... In Figure 3 we draw the CDF of the number of stitched paths per host pair.

Scalable and systematic Internet-wide path and delay ...
Akamai's core points [7]. They derive estimates by compos- ing performance measures of network segments along the end-to-end path. Our approach differs ...

Scalable and systematic Internet-wide path and delay ...
CAIDA's Ark project collects traceroutes from 18 moni- tors to every /24 routable .... order to handle potentially large data sets efficiently, we pre-process and ...

Path Consolidation for Dynamic Right-Sizing of ... - Semantic Scholar
time of a reducer vs size x of input. We compute the number of map and reduce tasks by dividing the input size S and output size S by the HDFS (Hadoop Distributed File System) block size respectively. HDFS block size is typically 64MB. We now determi

Path Consolidation for Dynamic Right-Sizing of Data ...
nificant energy savings can be achieved via path consolidation in the network. We present an offline formulation for the flow assignment in a data center network and develop an online algorithm by path consolidation for dynamic right-sizing of the ne

Path Consolidation for Dynamic Right-Sizing of ... - Semantic Scholar
is the number of reducers assigned for J to output; f(x) is the running time of a mapper vs size x of input; g(x) is the running time of a reducer vs size x of input. We compute the number of map and reduce tasks by dividing the input size S and outp

Logical Effort Model Extension to Propagation Delay ...
starting from the alpha power law model, we first physically justify the logical ..... also defines the first data point to be reported in the look-up table to accurately ...

Improving Delay Estimation with Path Stitching
charted points in the Internet, and our work is orthogonal to existing ..... produce good estimates. .... Proceedings of the IEEE INFOCOM, San Francisco, USA,.

Self-Sizing of Clustered Databases
We used Jade to implement self-sizing in a cluster of replicated databases. Here, self-sizing consists in dynamically increasing or decreasing the number of database replica in order to accommodate load peaks. The remainder of the paper is organized

Self-Sizing of Clustered Databases
(1) Institut National Polytechnique de Grenoble, France. (2) Université Joseph Fourier, Grenoble, France. (3) Institut National Polytechnique de Toulouse, France.

Logical-And-Relational-Learning-Cognitive-Technologies.pdf ...
This first textbook on multi-relational data mining and inductive logic programming provides a complete overview of the field. It is self-contained and easily accessible for graduate students and practitioners of data mining and machine learning. Thi

Sizing Up Repo
‡Graduate School of Business, Stanford University, NBER, and CEPR. §Graduate School ... Financial crises in the U.S. during the 1800s, as well as the. Great Depression ..... Until 2008Q2, this number is of comparable magni- tude as the total ...

Ideal Rationality and Logical Omniscience - PhilPapers
Our best formal theories of rationality imply that it does, but our ... In a slogan, the epistemic role of experience in the apriori domain is not a justifying role, but ...

Method and apparatus for sizing and forming dough bodies
Nov 14, 1977 - ber of overlapping conveyor belts used to define the dough forming .... 11, 13, 16, 18, 20, 23, 26, 28, and 30, interchange able body means 37 ...

Logical Fallacies.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Logical ...

Leerprogramma lichtdiode - weerstand - diode - transistor ...
Leerprogramma lichtdiode - weerstand - diode - transistor - condensator.pdf. Leerprogramma lichtdiode - weerstand - diode - transistor - condensator.pdf. Open.