652

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 4, APRIL 2000

Low-Power CMOS Digital Design with Dual Embedded Adaptive Power Supplies Tadahiro Kuroda and Mototsugu Hamada

Abstract—A low-power CMOS design methodology with dual embedded adaptive power supplies is presented. A variable supply-voltage scheme for dual power supplies, namely, the dual-VS scheme, is presented. It is found that the lower supply voltage should be set at 0.7 of the higher supply voltage to minimize chip power dissipation. This knowledge aids designers in decision of the optimal supply voltages within a restricted design time. An MEPG-4 video codec chip is designed at 2.5 and 1.75 V for internal circuits that are generated from an external power supply of 3.3 V by the dual-VS circuits. Power dissipation is reduced by 57% without degrading circuit performance compared to a conventional CMOS design. Index Terms—Adaptive power-supply system, clustered voltage scaling, low-power CMOS design, multiple supply voltages.

I. INTRODUCTION

L

OWERING the supply voltage is an effective way to reduce power dissipation, but it causes two design problems. One problem is that chip throughput is degraded due to increased circuit delay at reduced voltages. Employing only for noncritical multiple supply voltages to lower circuits is one approach to maintain the chip throughput. A clustered voltage scaling (CVS) scheme has been proposed [1] in order to minimize area and delay penalties caused by insertion of level-converters at boundary from lower to higher ’s. The other problem in lowering is that many power supplies are required on a board, because optimal voltages vary chip by chip depending upon performance requirements and circuit types, and they change even with time as workload changes. Furthermore, interface between chips under different supply voltages requires complicated and expensive circuits and device structures. One solution is to employ a universal supply voltage for the interface circuits, while generating the individual optimal voltages for internal circuits by an embedded dc–dc converter [2], [3]. A variable supply-voltage (VS) scheme [3] monitors circuit speed using a critical path replica and generates the lowest supply voltage by a feedback control so as to adjust the monitored circuit delay to the cycle time of an input clock. In theory, the VS scheme and the CVS scheme should be able to be applied together to yield a multiplier effect, but in practice, design issues may arise. For instance, in the CVS scheme, a critical path may consist of circuits under multiple supply voltages. If a critical path replica is implemented with a combination of the multiple supplies, the feedback control Manuscript received December 11, 1998; revised November 30, 1999. The authors are with the System ULSI Engineering Laboratory, Mobile & Network LSI Development Group, Toshiba Corp., Kawasaki 210-8520 Japan (e-mail: [email protected]). Publisher Item Identifier S 0018-9200(00)02869-9.

Fig. 1. Dual-VS scheme.

in the VS scheme may be unstable because increased delay caused by lower voltage of one of the supplies may be offset by higher voltage of another supply. Even with multiple different replicas, interaction may occur between the multiple supplies due to delay in the feedback control. Another issue in the CVS scheme is that design cannot start until supply voltages are decided, and it takes a long time to determine optimal supply voltages. In using dual supply voltand , for example, the lower the , the lower ages of the power dissipation per gate under , but the fewer the gates due to the slower circuit speed. It is therefore considfor for minimizing total power ered that there exists an optimal dissipation. In order to find the optimal , all the design tasks should be performed repeatedly, changing , characterizing library, partitioning circuits, designing layout, and monitoring the power dissipation. In this paper, a low-power CMOS design methodology is presented where dual supply voltages are adaptively generated and optimally provided to internal circuits. The VS scheme for dual power supplies, namely, the dual-VS scheme, is proposed. A is presented, which aids designers theory about the optimal in deciding on the optimal supply voltages. In Section II, the dual-VS scheme is proposed. The theory of is presented in Section III. An MPEG-4 video the optimal codec chip is developed, and its evaluation results are reported in Section IV. Section V is dedicated to conclusions. II. DUAL-VS SCHEME The dual-VS scheme is illustrated in Fig. 1. There are two ( cell) and the other circuit clusters: one operating under ( cell). and are generated by an embedded under

0018–9200/00$10.00 © 2000 IEEE

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 4, APRIL 2000

653

When

, the power dissipation is given by (1)

where is the operating frequency and is the capacitance. under dual power supplies, the power dissipaWhen tion becomes (2) is total capacitance of the cells and is total where cells. The power dissipation ratio is then capacitance of the given by Fig. 2.

Level-conversion flip-flop.

(3) power supply and , respectively. Each of them monitors circuit speed of a critical path replica under its generating voltage for an adaptive control. Both flip-flops and clock circuits to reduce power dissipation. It is thereare operated under fore necessary to insert level-converters between the flip-flops cells. and the A circuit that functions as both the flip-flop and the level converter, the level-conversion flip-flop (LC-F/F) in Fig. 2, is developed [4]. When a clock CLK is high ( ), n-channel transistors and are turned off, and the slave is equivalent to a conventional level-conversion circuit. At this time, the master is holding data that passes to the level-conversion circuit. When CLK is low, and are turned on, and the slave is equivalent to a latch. At this time, the master is transparent but disconnected from the slave. Therefore, data when CLK was high are stored in the slave and output. In this way, at the rising CLK edge, -swing data -swing are captured which will be stored and converted to the signal for output till the next CLK edge. Power, delay, and area of LC-F/F are smaller than those of the conventional flip-flop plus level-converter by 14, 41, and 26%, respectively. For stable control, consider two virtual critical paths, each of cells or the cells and whose which is composed of only the delay is equal to or a little slower than the real critical path. Now consider controlling the virtual critical paths by putting them and . The dual supplies can in the replica circuits in be controlled independently and hence stably. The critical path and are implemented by a gate chain. The replicas in number of the stages of the gate chains is designed such that delay of the two critical path replicas is equal to (in practice a little bigger for safety margin than) the cycle time at the optimal and . In other words, the ratio of the number of the stages of the gate chains is inversely proportional to the ratio of the gate and . Consequently, the gate delay delay at the optimal is proportional to the gate delay under , and thus the under relative delays of all the real paths are maintained. It is, therefore, guaranteed that no new critical path appears that is slower than the virtual critical path. The stability and error of this control scheme are examined through chip evaluation in Section IV. III. THEORY should be chosen to minimize power dissipation of circuits. In this section, a theory that deals with optimal is studied.

From an observation that the slower path may often contain many more cells, it can be assumed as a first-order approximation that capacitance in a path is proportional to the delay of the is given by path. Consequently,

(4)

is a path-delay distribution function and is a ratio of the total delay of the cells ( ) to . represents the normalized the total path delay at . The path number of path whose delay is when ), and is delay is normalized by the cycle time ( normalized as

where

(5) At

,

is slower, and the total path delay becomes (6)

is a representative delay function of the supply where voltage and can be obtained by measurement or simulation. cells As many cells as possible should be assigned as the to minimize the power dissipation within a budget of the cycle ). Given in (6), is given by time ( (7)

From (3) to (5) and (7), the power ratio can be calculated as when is provided. a function of The power ratio is calculated for five artificial examples depicted in Fig. 3. Interestingly, becomes minimum of for all the examples, even though at ’s between 0.6–0.7 . This means that the minimum value of depends on should always be set at around 0.6–0.7 to minimize the power dissipation. In order to verify this theory, a discrete cosine transform (DCT) block composed of 5466 cells in an MPEG-4 video codec [4] is designed by using a proprietary electronic design

654

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 4, APRIL 2000

Fig. 3. Power reduction ratio versus V =V

Fig. 4.

Fig. 5. Path-delay distribution before and after dual-VS scheme in MPEG-4 video codec submodules. calculated from theory.

Simulated power dissipation dependence on V =V

in DCT block.

automation (EDA) tool [5] for the CVS scheme at various ’s, and the power dissipation is monitored. As shown in Fig. 4, the experimental result shows a good agreement with the theory of lambda-shape is assumed. when IV. MPEG-4 VIDEO CODEC CHIP An MPEG-4 (Moving Picture Experts Group phase 4) video codec chip [4] has been designed by employing the dual-VS of 2.5 scheme. The dual-VS scheme typically generates 5% V and of 1.75 5% V for 30-MHz operation from 10% V. All the memories an external power supply of 3.3 . Threshold voltage is controlled by the are operated under of 0.1 VTCMOS technology [6]. The chip is fabricated for 0.1 V, and the VTCMOS technology adjusts to 0.2 0.05 V in the active mode and 0.5 0.05 V in the standby mode. The control circuits for the dual-VS scheme and the VTCMOS technology are placed at four corners of the chip. A cell row cells in the standard cell layout is dedicated either for the row) or the cells (called the row). The (called the row. level-conversion flip-flops (LC-F/F’s) are placed in the for power supply, which The master latch of LC-F/F requires

is provided from the adjacent row by a signal interconnection. Since LC-F/F can receive a -swing clock, clock buffers row to reduce power dissipation for the are placed in the clock distribution. The layout result of the MPEG-4 chip for the dual-VS scheme is only 5% larger than that for the conventional layout. Details of chip design and design methodology may be found in [4] and [5]. Cell number, cell area, and row number in the logic layout of the chip are summarized in Table I. 68% of the cells are recells. The number of the cells is about placed with the cells, which agrees with the simulation result in 1/3.5 of the Fig. 4. Since relatively large LC-F/F’s are also placed in the rows, total cell area in the rows is about the same as that rows. Accordingly, the total number of the row in the row. If they differ much in number, inbalances that of the terconnection length between them will be increased. From the of 0.7 layout viewpoint, it is more desirable to choose rather than 0.6 . Power dissipation of the MPEG-4 chip is simulated by a transistor-level power analysis tool with test vectors for practical operations. With the VS scheme and the VTCMOS technology, the power-supply voltage can be lowered to 2.5 from 3.3 V so that power dissipation is reduced by 43% in all the circuits. When the dual-VS scheme is applied to further lower the supply voltage of noncritical circuits to 1.75 V, power dissipation is further reduced by 25%. The breakdown of the power reduction is: 30% in logic gates, 37% in flip-flops, and 51% in clock distribution. The fabricated chips are measured by a tester with test vectors for practical operations. The average power dissipation is 62 mW in the chip with the VS scheme and 45 mW in the chip with the dual-VS scheme, excluding power penalties of dc–dc converters. The power penalty is 10 and 15 mW for the VS scheme and the dual-VS scheme, respectively. In order to investigate how much a surplus of timing in the noncritical path is exploited to reduce power dissipation, path delay distribution is investigated by a static timing analyzer in nine submodules in the MPEG-4 chip, and the result is depicted in Fig. 5. The horizontal axis is path delay normalized by the cycle time, and the vertical axis is the normalized number of paths. The average path delays increase from 0.31–0.51 to 0.41–0.69 of the cycle time by the dual-VS scheme. It can be

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 4, APRIL 2000

655

TABLE I MPEG-4 LAYOUT RESULT. F/F: SLLC-F/F; LC: LEVEL CONVERTER

aids designers in deciding on the optimal supply voltages, which is essential for short design time. An MEPG-4 video codec chip is designed at 2.5 and 1.75 V for internal circuits that are generated from an external power supply of 3.3 V. Sixty-eight percent of the cells, which occupy 50% in area, are operated under . Half of the total power dissipation is therefore reduced to 0.7 , which results in 25% power reduction compared to a design at 2.5 V with the conventional VS scheme. The power is 57% less than that for a conventional CMOS design at 3.3 V. The chip area overhead is only 5%. ACKNOWLEDGMENT

Fig. 6.

Shmoo plot of DCT module in MPEG-4 video codec.

understood that the amount of the shift of the path delay distribution is utilized for power reduction by lowering the supply voltage. and in A shmoo plot in Fig. 6 is obtained by changing the DCT module. It is shown that the minimum supply voltages and with safety margin for 30-MHz are generated for operation. No problem in terms of stability and error of the is lowered dual-VS scheme has been observed. When to around 0.5, function error occurs. This function error may be to . As long as caused by crosstalk noise from is around 0.7, the function error does not occur. It is true, how, the bigger the crosstalk noise ever, that the lower the and its influence on signal propagation delay. Maximum operating speed may be degraded to some extent. V. CONCLUSIONS The dual-VS scheme is presented. Implementation of a critical path replica for stable control of dual power supplies is investigated. A level-conversion flip-flop is developed. It is found in theory, by simulation, and through a real design that the lower should be set at 0.7 of the higher supply supply voltage to minimize chip power dissipation. This knowledge voltage

The authors would like to thank H. Takahashi, H. Arakida, T. Nishikawa, T. Fujita, F. Hatori, K. Suzuki, S. Mita, H. Hara, M. Ashino, F. Sano, A. Chiba, S. Kitabayashi, T. Terazawa, and Y. Watanabe for help with the chip design and evaluation; T. Ishikawa, M. Kanzawa, M. Igarashi, and K. Usami for EDA tool support; T. Sakurai for technical advice and discussion; and T. Furuyama, M. Saito, S. Nishio, T. Mitsuhashi, and Y. Unno for encouragement. REFERENCES [1] K. Usami and M. Horowitz, “Clustered voltage scaling technique for low-power design,” in Proc. ISLPD’95, Apr. 1995, pp. 3–8. [2] V. Gutnik and A. Chandrakasan, “An efficient controller for variable supply-voltage low power processing,” in Symp. VLSI Circuits Dig. Tech. Papers, June 1996, pp. 158–159. [3] T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, A. Chiba, Y. Watanabe, K. Matsuda, T. Maeda, T. Sakurai, and T. Furuyama, “Variable-supply voltage scheme for low-power high-speed CMOS digital design,” IEEE J. Solid-State Circuits, vol. 33, pp. 454–462, Mar. 1998. [4] M. Takahashi, M. Hamada, T. Nishikawa, H. Arakida, T. Fujita, F. Hatori, S. Mita, K. Suzuki, A. Chiba, T. Terasawa, F. Sano, Y. Watanabe, K. Usami, M. Igarashi, T. Ishikawa, M. Kanazawa, T. Kuroda, and T. Furuyama, “A 60-mW MPEG4 video codec using clustered voltage scaling with variable supply-voltage scheme,” IEEE J. Solid-State Circuits, vol. 33, pp. 1772–1779, Nov. 1998. [5] K. Usami, M. Igarashi, T. Ishikawa, M. Kanazawa, M. Takahashi, M. Hamada, H. Arakida, T. Terazawa, and T. Kuroda, “Design methodology of ultra low-power MPEG4 codec core exploiting voltage scaling techniques,” in Proc., DAC’98, June 1998, pp. 483–488. [6] T. Kuroda, T. Fujita, S. Mita, T. Nagamatu, S. Yoshioka, K. Suzuki, F. Sano, M. Norishima, M. Murota, M. Kako, M. Kinugawa, M. Kakumu, and T. Sakurai, “A 0.9 V 150 MHz 10 mW 4 mm 2-D discrete cosine transform core processor with variable-threshold- voltage scheme,” IEEE J. Solid-State Circuits, vol. 31, pp. 1770–1779, Nov. 1996.

Low-power cmos digital design with dual embedded ... - IEEE Xplore

by 57% without degrading circuit performance compared to a conventional CMOS design. Index Terms—Adaptive power-supply system, clustered voltage.

91KB Sizes 0 Downloads 255 Views

Recommend Documents

Minimizing power consumption in digital CMOS circuits - IEEE Xplore
scaling strategy, which uses parallelism and pipelining, to tradeoff silicon area and power reduction. Since energy is only consumed when capacitance is being ...

Electronics Beyond Nano-scale CMOS - IEEE Xplore
in the future, resulting in lower energy, power, and delay. Electronics has evolved tremendously in the last century. In the reduction. early days, switches and ...

Digital Fabrication - IEEE Xplore
we use on a daily basis are created by professional design- ers, mass-produced at factories, and then transported, through a complex distribution network, to ...

Characterization of CMOS Metamaterial Transmission ... - IEEE Xplore
IEEE TRANSACTIONS ON ELECTRON DEVICES, VOL. 62, NO. 9, SEPTEMBER 2015. Characterization of CMOS Metamaterial. Transmission Line by Compact.

Design and Development of a Flexure-Based Dual ... - IEEE Xplore
flexure mechanisms, micro-/nanopositioning, motion control. Manuscript received ... The author is with the Department of Electromechanical Engineering, Fac-.

Low-power design - IEEE Xplore
tors, combine microcontroller architectures with some high- performance analog circuits, and are routinely produced in tens of millions per year with a power ...

Local Clustering 3-D Stacked CMOS Technology for ... - IEEE Xplore
is developed to closely pack devices in a number of standard cells to form local clusters. Based on the 3-D stacked CMOS technology, an analysis to extend the ...

Radiation effects in a CMOS active pixel sensor - IEEE Xplore
Abstract—A CMOS active pixel sensor has been evaluated with Co60, 10 MeV proton and heavy-ion irradiation. Perma- nent displacement damage effects were ...

Distributed Average Consensus With Dithered ... - IEEE Xplore
computation of averages of the node data over networks with band- width/power constraints or large volumes of data. Distributed averaging algorithms fail to ...

IEEE Photonics Technology - IEEE Xplore
Abstract—Due to the high beam divergence of standard laser diodes (LDs), these are not suitable for wavelength-selective feed- back without extra optical ...

wright layout - IEEE Xplore
tive specifications for voice over asynchronous transfer mode (VoATM) [2], voice over IP. (VoIP), and voice over frame relay (VoFR) [3]. Much has been written ...

Device Ensembles - IEEE Xplore
Dec 2, 2004 - time, the computer and consumer electronics indus- tries are defining ... tered on data synchronization between desktops and personal digital ...

wright layout - IEEE Xplore
ACCEPTED FROM OPEN CALL. INTRODUCTION. Two trends motivate this article: first, the growth of telecommunications industry interest in the implementation ...

Investigating Sensor Networks with Concurrent ... - IEEE Xplore
The background behind this demonstration is described as an one-page poster submission. The goal is to show a flow of tools for quick sensor network modeling, from an high level abstraction down to a system validation, including random network genera

Providing Secrecy with Lattice Codes - IEEE Xplore
Wireless Communications and Networking Laboratory. Electrical Engineering Department. The Pennsylvania State University, University Park, PA 16802.

Trellis-Coded Modulation with Multidimensional ... - IEEE Xplore
constellation, easier tolerance to phase ambiguities, and a better trade-off between complexity and coding gain. A number of such schemes are presented and ...

Evolutionary Computation, IEEE Transactions on - IEEE Xplore
search strategy to a great number of habitats and prey distributions. We propose to synthesize a similar search strategy for the massively multimodal problems of ...

Compressive Sensing With Chaotic Sequence - IEEE Xplore
Index Terms—Chaos, compressive sensing, logistic map. I. INTRODUCTION ... attributes of a signal using very few measurements: for any. -dimensional signal ...

I iJl! - IEEE Xplore
Email: [email protected]. Abstract: A ... consumptions are 8.3mA and 1.lmA for WCDMA mode .... 8.3mA from a 1.5V supply under WCDMA mode and.

Gigabit DSL - IEEE Xplore
(DSL) technology based on MIMO transmission methods finds that symmetric data rates of more than 1 Gbps are achievable over four twisted pairs (category 3) ...

IEEE CIS Social Media - IEEE Xplore
Feb 2, 2012 - interact (e.g., talk with microphones/ headsets, listen to presentations, ask questions, etc.) with other avatars virtu- ally located in the same ...

Grammatical evolution - Evolutionary Computation, IEEE ... - IEEE Xplore
definition are used in a genotype-to-phenotype mapping process to a program. ... evolutionary process on the actual programs, but rather on vari- able-length ...

SITAR - IEEE Xplore
SITAR: A Scalable Intrusion-Tolerant Architecture for Distributed Services. ∗. Feiyi Wang, Frank Jou. Advanced Network Research Group. MCNC. Research Triangle Park, NC. Email: {fwang2,jou}@mcnc.org. Fengmin Gong. Intrusion Detection Technology Divi

striegel layout - IEEE Xplore
tant events can occur: group dynamics, network dynamics ... network topology due to link/node failures/addi- ... article we examine various issues and solutions.