Optical Fiber Technology 17 (2011) 363–367

Contents lists available at ScienceDirect

Optical Fiber Technology www.elsevier.com/locate/yofte

100GbE and beyond for warehouse scale computing interconnects Bikash Koley ⇑, Vijay Vusirikala Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA

a r t i c l e

i n f o

Article history: Available online 28 July 2011 Keywords: 100 Gigabit Ethernet Warehouse scale computer Internet Datacenter WDM Cluster

a b s t r a c t Increasing broadband penetration in the last few years has resulted in a dramatic growth in innovative, bandwidth-intensive applications that have been embraced by the consumers. Coupled with this consumer trend is the migration from local compute/storage model to a cloud computing paradigm. As computation and storage continues to move from desktops to large internet services, computing platforms running such services are transforming into warehouse scale computers. 100 Gigabit Ethernet and beyond will be instrumental in scaling the interconnection within and between these ubiquitous warehouse scale computing infrastructures. In this paper, we describe the drivers for such interfaces and some methods of scaling Ethernet interfaces to speeds beyond 100GbE. Ó 2011 Elsevier Inc. All rights reserved.

1. Introduction As computation continues to move into the cloud, the computing platforms are no longer stand-alone servers but homogeneous interconnected computing infrastructures hosted in mega-datacenters. These warehouse scale computers (WSCs) provide a ubiquitous interconnected compute platform as a shared resource for many distributed services, and therefore are very different from traditional rack-full of collocated servers in a datacenter [1]. Interconnecting such WSCs in a cost-effective yet scalable way is a unique challenge that is being addressed through network design and technology transformation, which in turn is leading to the evolution of the modern internet. The central core of the internet, which was dominated by traditional backbone providers, is now connected by hyper giants offering rich content, hosting, and CDN (Content Distribution Network) services [2]. It is not difficult to imagine that the network is moving towards more and more direct connection from content providers to content consumers with the traditional core providers facing disintermediation. Table 1 lists the ATLAS top-10 inter-domain Autonomous Systems (AS) in the public internet in 2007 and 2009. We see that content providers such as Google and Comcast, which were not ranked in 2007 now occupy prominent places in 2009. It should be noticed that this reports only accounts for publicly measureable bandwidth between AS’s where the measurement was taken. Left uncounted here are three types of traffic: (1) traffic inside datacenters, (2) the backend bandwidths used to interconnect datacenters and operate the content distribution networks, and (3) Virtual

⇑ Corresponding author. E-mail address: [email protected] (B. Koley). 1068-5200/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.yofte.2011.06.008

Private Network (VPN) traffic. These data demonstrate the transformation from the original focus on network connectivity by traditional carriers to a focus on content by the non-traditional companies. New internet applications such as cloud computing and CDN are now reshaping the network landscape: Content providers and cloud computing operators such have now become the major driving forces behind large-capacity optical network deployments [3].

2. Intra-datacenter connectivity A WSC is a massive computing infrastructure built with homogeneous hardware and system software arranged in racks and clusters interconnected by massive networking infrastructure [1]. Fig. 1 shows common architecture of a WSC. A set of commodity servers are arranged into racks and interconnected through a top of rack (TOR) switch. Rack switches are connected to cluster switches which provide connectivity between racks and form the cluster-fabrics for warehouse scale computing. Ideally, one would like to have an intra-datacenter switching fabric with sufficient bi-sectional bandwidth to accommodate non-blocking connection from every server to every other server in a datacenter, so that applications do not require location awareness within a WSC infrastructure. However, such a design would be prohibitively expensive. More commonly, interconnections are aggregated with hierarchies of distributed switching fabrics with an oversubscription factor for communication across racks (Fig. 2) [4]. Intra-datacenter networking takes advantage of a fiber rich environment to drive very large bandwidth within and between clusters. However, fiber infrastructure itself is becoming a significant cost

364

B. Koley, V. Vusirikala / Optical Fiber Technology 17 (2011) 363–367

Table 1 Atlas top-10 public internet bandwidth generating domains [1]. (a) Top-10 in 2007

(b) Top-10 in 2009

Rank

Provider

Percentage

Rank

Provider

Percentage

1 2 3 4 5 6 7 8 9 10

Level(3) Global Crossing ATT Sprint NTT Cogent Verizon TeliaSonera Savvis AboveNet

5.77 4.55 3.35 3.2 2.6 2.77 2.24 1.82 1.35 1.23

1 2 3 4 5 6 7 8 9 10

Level(3) Global Crossing Google

9.41 5.7 5.2

Comcast

3.12

Interconnect Fabric

Intentionally omitted

driver for such large WSC infrastructures. To address this, reuse of existing fiber-infrastructure and scaling of cross-sectional bandwidth by increasing per-port bandwidth is becoming critical. Introduction of higher port-speed optical interfaces always go through the natural evolution of bleeding-edge technology (e.g. 100GbE today) to maturity (e.g. 10GbE today, was bleeding edge 10 years back), with a gradual reduction in power-consumption per gigabit/ second of interconnect [4] (Fig. 3). Broadly, one can break this technology evolution down to three stages:

(a)

(a) Bleeding edge: 10 speed increase is obtained for 20 increase in power consumption (e.g. 100GbE 10  10 MSA modules consume 14 W as compared to a 10GbE SFP + consuming < 1 W). (b) Parity: 10 speed increase is obtained for 10 power consumption. (c) Maturity: 10 speed increase is obtained for 4 power consumption (e.g. 10GbE today as compared to 1GbE interfaces). Increase of per-port bandwidth directly translates into reduction of radix for the individual switches [4]. As a result, larger number of switching-nodes or fabric stages may become necessary to build the same cross-sectional bandwidth. Fig. 4 illustrates the example of a cluster fabric with 10 Tbps cross-sectional bandwidth. If the fabric is built with a switching node capable of 1 Tbps switching bandwidth, use of increasingly higher speed interfaces lead to step-function jumps in power consumption as larger number of stages are introduced due to radix constraint.

Interconnect Fabric

(b) Fig. 2. Hierarchies of intra-datacenter cluster-switching interconnect fabrics (a) within a single building (b) across multiple buildings.

Fig. 1. Typical elements in a Warehouse scale computer.

Per-port Power Consumption (Watts)

B. Koley, V. Vusirikala / Optical Fiber Technology 17 (2011) 363–367

to >40% compound annual growth rate of internet traffic [2] (Fig. 7), with 9 exabytes of traffic volume per month. While the exponential growth of internet traffic drives bandwidth demand for inter-datacenter networks, the Moore’s-law growth of processing and storage capacity [5] utilized in the WSC infrastructure drives bandwidth at an even faster pace. Extrapolating the average CAGR of 60% seen in processing-power and storage capacity, one can see that Ethernet standard and port-speeds have kept up well with internet-scale traffic growth but are falling behind Moore’slaw (machine-to-machine) traffic growth (Fig. 8.). Therefore, the need for 100 Gbps and beyond interconnect technologies are immediate for inter-datacenter connections. Various emerging technology building blocks offer the potential for this cost-effective capacity scaling as described below:

Per Port Power Consumption

140

Constant power/Gbps

120

4x power for 10x speed

20x power for 10x speed

100 80 60 40 20 0 0

50

100

150

200

250

300

350

400

450

Port Speed (Gbps) Fig. 3. Evolution of relative power consumption with port-speed for intradatacenter interconnects at three different technology maturity levels.

Total Interconnect Power Consumption Total Power Consumption (Watts)

300000 Constant power/Gbps 4x power for 10x speed

250000

20x power for 10x speed

200000

150000

100000

50000

0

0

50

100

150

200

250

300

365

350

400

Port Speed (Gbps) Fig. 4. Total power consumption in a 10 Tbps fabric built with nodes capable of 1 Tbps switching with various interface speeds; three technology maturity curves in-terms of power consumption are considered.

In order to meet requirements of bandwidth and power scaling in warehouse scale computing infrastructures, 100 Gbps and faster optical interfaces must target 4 power consumption for 10 speed as guiding design principle. 3. Inter-datacenter connectivity A WSC infrastructure can span multiple datacenters. Consequently the cluster aggregation switching fabric will span multiple datacenters as well as shown in Fig. 5. Typically inter-datacenter connection fabrics are implemented over a fiber-scarce physical layer as the link distances are tens to thousands of kilometers. The fiber-scarcity and limit of available BW on metro and long-haul fiber links lead to a very undesirable available BW between clusters, as illustrated in Fig. 6. If capacity per fiber-pair is not maximized, a bottleneck is introduced due to high oversubscription for inter-datacenter communication [1]. Acceleration of broadband penetration and uptake of internet based applications with rich multi-media contents have led

(a) Higher capacity per fiber: Optical transport solutions that increase the maximum capacity per fiber beyond today’s commercially available 8 Tb/s (based on 80 channels of 100 Gbps transmission in C-band). Published literature has shown a roadmap to continued fiber capacity scaling using a number of approaches for increasing the spectral range and spectral efficiency [6,7]. These include higher data rates, higher-order modulation, OFDM, multiple transmission bands etc. (b) Unregenerated reach: As the transmission data rates increase, the unregenerated reach typically decreases due to the higher OSNR required. The use of techniques such as SoftDecision FEC will help bridge the gap. In addition, techniques for maximizing optical link OSNR across the transmission spectral range such as optimized Raman amplification, tilt control, spectral equalization, per-span launch power adjustment, can be used to increase the maximum unregenerated reach. (c) Variable Rate Optics: With coherent optical transmission systems, it is possible to have a variable transmission rate that is based on the link quality and condition. For example, for shorter links or links with ‘‘good’’ fiber types, the additional optical link margin can be used to transmit higher data rates. This type of variable rate transmission has to be tightly integrated with the packet layer and managed at the system/network level to realize the overall throughput maximization. (d) Flexible grid ROADM: Current commercially available ROADMs are based on 50 GHz or 100 GHz ITU grid spacing. These fixed grid ROADMs become a limitation for future capacity scaling and network flexibility. Emerging flexible grid or gridless ROADMs [8,9], which provide the ability to arbitrarily determine spectral pass bands and spacing between pass bands, enable two key functionalities: (i) support for higher data rates in a spectrally efficient way by packing the wavelengths in a manner determined by the spectral content of the waves rather than the limitations of the ITU grid and (ii) flexibility for arbitrary add/drop of wavelengths in a manner independent of the underlying data rate or modulation scheme. (e) Large Core Fibers: With the ability of coherent systems to compensate for fiber impairments such as chromatic dispersion and polarization mode dispersion, the major remaining fiber impairments that limit transmission are the fiber attenuation and fiber non-linearities. Recent advances in largecore (110 lm2), low-attenuation (<0.17 dB/km) fibers demonstrate the capability to increase transmission distance for a given fiber capacity [10]. The large effective area enables a lower power density which helps alleviate penalties from fiber non-linearities.

366

B. Koley, V. Vusirikala / Optical Fiber Technology 17 (2011) 363–367

Fiber-scarce

Fiber-rich

Interconnect Fabric

Interconnect Fabric

Fig. 5. Inter-datacenter networks connecting multiple WSCs.

10000

1000

100

10

1

0.1 1995

2000

2005

2010

2015

2020

Year

Fig. 6. Inter-dataceneter available bandwidth as a function of distance between the compute elements.

Fig. 8. Ethernet standards and port-speeds compared to internet and extrapolated Moore’s-law (machine-to-machine) traffic growth.

4. Conclusions Advent of warehouse scale computing has been driving the need for bandwidth within and between datacenters. While intra-datacenter connections can take advantage of a fiber-rich physical layer, need for fiber-scarce inter-datacenter connections will drive the adoption of 100GbE and beyond in the massive WSC environments. Deployment of network technologies beyond 100GbE will be needed within the next three to five years for WSC interconnects. References

Fig. 7. A > 40% CAGR of internet traffic [3].

[1] L.A. Barroso, U. Hölzle, The Datacenter as a Computer–an Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, 2009. . [2] C. Labovitz et al., ATLAS Internet Observatory 2009 Annual Report. .

B. Koley, V. Vusirikala / Optical Fiber Technology 17 (2011) 363–367 [3] C.F. Lam, H. Liu, B. Koley, X. Zhao, V. Kamalov, V. Gill, Fiber optic communication technologies: what’s needed for datacenter network operations, IEEE Commun. Mag. 48 (7) (2010). [4] B. Koley, Requirements for data center interconnects, in: 20th Annual Workshop on Interconnections within High Speed Digital Systems, Santa Fe, New Mexico, 3–6 May 2009, Paper TuA2. [5] Truskowski Morris, The evolution of storage systems, IBM Syst. J. 42 (2) (2003). [6] René-Jean Essiambre, Gerhard Kramer, Peter J. Winzer, Gerard J. Foschini, Bernhard Goebel, Capacity limits of optical fiber networks, J. Lightw. Technol. 28 (2010) 662–701.

367

[7] K. Roberts, Digital coherent optical communications beyond 100 Gb/s, in: Signal Processing in Photonic Communications, OSA Technical Digest (CD), Optical Society of America, 2010. Paper JTuA1. [8] C.F. Lam, W.I. Way, A System’s View of Metro and Regional Optical Networks, Photonics West, San Jose, CA, January 29, 2009. [9] M. Jinno et al., IEEE Commun. Mag. (2009) 66–73. [10] R. Chen, M. O’Sullivan, C. Ward, S. Asselin, M. Belanger, Next generation transmission fiber for coherent systems, in: Optical Fiber Communication Conference, OSA Technical Digest (CD), Optical Society of America, 2010. Paper OTuI1.

100GbE and beyond for warehouse scale ... - Semantic Scholar

Jul 28, 2011 - running such services are transforming into warehouse scale computers. ... ters interconnected by massive networking infrastructure [1]. Fig.

1MB Sizes 2 Downloads 205 Views

Recommend Documents

100GbE and Beyond for Warehouse Scale ... - Research at Google
from desktops to large internet services, computing platforms ... racks and clusters interconnected by massive networking ... five years for WSC interconnects.

100GbE and beyond for warehouse scale computing interconnects
Jul 28, 2011 - sumer trend is the migration from local compute/storage model to a cloud computing paradigm. As com- putation and storage continues to ...

buddhism beyond asia - Semantic Scholar
continued along the trade routes, along which the aretologies (miraculous stories of gods and heroes) of ...... printing in 2000 within only 2 years of publication.

buddhism beyond asia - Semantic Scholar
continued along the trade routes, along which the aretologies (miraculous stories ...... Singapore, houses the Buddhist Graduate School (2001), accredited to this ...

LARGE SCALE NATURAL IMAGE ... - Semantic Scholar
1MOE-MS Key Lab of MCC, University of Science and Technology of China. 2Department of Electrical and Computer Engineering, National University of Singapore. 3Advanced ... million natural image database on different semantic levels defined based on Wo

Yedalog: Exploring Knowledge at Scale - Semantic Scholar
neck when analyzing large repositories of data. We introduce Yedalog, a declarative programming language that allows programmers to mix data-parallel ...

Template Detection for Large Scale Search Engines - Semantic Scholar
web pages based on HTML tag . [3] employs the same partition method as [2]. Keywords of each block content are extracted to compute entropy for the.

Semi-Supervised Hashing for Large Scale Search - Semantic Scholar
Unsupervised methods design hash functions using unlabeled ...... Medical School, Harvard University in 2006. He ... stitute, Carnegie Mellon University.

Multi-scale Personalization for Voice Search ... - Semantic Scholar
of recognition results in the form of n-best lists. ... modal cellphone-based business search application ... line, we take the hypothesis rank, which results in.

LSH BANDING FOR LARGE-SCALE RETRIEVAL ... - Semantic Scholar
When combined with data-adaptive bin splitting (needed on only. 0.04% of the ..... tions and applications,” Data Mining and Knowledge Discovery,. 2008.

Time Scale Modification for 3G-Telephony Video - Semantic Scholar
desk call allows the caller to show the call-center expert the problem. Adding a visual back channel (expressions, gestures, posture) lessens the gap between ...

Time Scale Modification for 3G-Telephony Video - Semantic Scholar
We outline the technological and user-interaction ... behind the use of 3G wireless standards [1]. ... complicates the addition of TSM controls to a 3G-telephony-.

Coevolving Communication and Cooperation for ... - Semantic Scholar
Chicago, Illinois, 12-16 July 2003. Coevolving ... University of Toronto. 4925 Dufferin Street .... Each CA agent could be considered a parallel processing computer, in which a set of .... After 300 generations, the GA run converged to a reasonably h

Mining Large-scale Parallel Corpora from ... - Semantic Scholar
Multilingual data are critical resources for building many applications, such as machine translation (MT) and cross-lingual information retrieval. Many parallel ...

Coevolving Communication and Cooperation for ... - Semantic Scholar
behavior. The emphasis in this approach is to gain a better understanding of the ... (blocks) in a 2-D grid world into a desired lattice structure. The agents move.

Development of Embodied Sense of Self Scale - Semantic Scholar
Jul 5, 2016 - Narrative. The developed questionnaire [Embodied Sense of Self Scale (ESSS)] showed good enough validity and reliability for practical use.

The scale-free topology of market investments - Semantic Scholar
Dec 18, 2004 - E-mail address: [email protected] (D. Garlaschelli). ... Automated Quotations [18] (NASDAQ) in the year 2000. The number M of assets in.

Large-Scale Graph-based Transductive Inference - Semantic Scholar
rithm that improves (parallel) spatial locality by being cache cognizant. This ap- ..... distributed processing across a network with a shared disk. 4 Results on ...

Scale-up Graph Processing: A Storage-centric View - Semantic Scholar
fetched from storage: either from disk into memory or from memory into ... Permission to make digital or hard copies of all or part of this work for personal or ... Proc of the First International Workshop on Graph Data Management Ex- perience ...