Path Consolidation for Dynamic Right-Sizing of Data Center Networks Muhammad Abdullah Adnan and Rajesh Gupta University of California San Diego {madnan,rgupta}@ucsd.edu Abstract—Data center topologies typically consist of multirooted trees with many equal-cost paths between a given pair of hosts. Existing power optimization techniques do not utilize this property of data center networks for power proportionality. In this paper, we exploit this opportunity and show that signiﬁcant energy savings can be achieved via path consolidation in the network. We present an ofﬂine formulation for the ﬂow assignment in a data center network and develop an online algorithm by path consolidation for dynamic right-sizing of the network to save energy. To validate our algorithm, we build a ﬂow level simulator for a data center network. Our simulation on ﬂow traces generated from MapReduce workload shows ∼80% reduction in network energy consumption in data center networks and ∼25% more energy savings compared to the existing techniques for saving energy in data center networks.

networks, we evaluate our algorithm on the widely used fattree structure [6]. In a pod − k fat-tree topology, there are (k/2)2 possible paths between any pair of hosts. We exploit this opportunity of multiple paths and select the best paths that fulﬁll our goals for energy reduction. Figure 1(a) shows the Fat-tree topology for data center networks comprising of many redundant links and switches. We dynamically right-size the network to achieve a subset of active nodes and links as shown in Figure 1(b). State-of-the-art method e.g. Elastic Tree [4] for right-sizing the data center network considers ﬂows at each switch and proposes heuristics to determine the power state (on/off) for each switch and adjacent links locally. Unlike the previous method, we leverage the graph-theoretic properties of the fat-tree network to select and update paths globally. Our method takes into account the ﬂows between end hosts and selects the best paths considering the network as a graph. Hence our method is immediately applicable to any topology for data center networks like DCell [7], VL2 [8], BCube [9], etc. Moreover the advent of OpenFlow [10] switches facilitates the collection of ﬂow information and promotes centralized control over ﬂows. In this regard, CARPO [5] proposed trafﬁc consolidation methods for correlation-aware placement of ﬂows on the available paths. But the correlation information among the ﬂow demands is not available in practice during ﬂow scheduling. Instead, we place ﬂows on paths based on their dynamic ﬂow demands and dynamic capacity of the paths. For fault tolerance, we do not use prediction for ﬂow rates rather ensure safety margins by adding redundant paths to the active network.

I. I NTRODUCTION Increase in energy prices and the rise of cloud computing brings up the issue for making data centers energy efﬁcient. A signiﬁcant portion (∼20%) of the energy consumption in data centers comes from the networking elements. According to an EPA report [1], the networking elements in data centers consumed 3 billion kWh in 2006 and the power demand is rising every year. Hence achieving energy proportionality in data center networks has been a recent focus of research in the area [2]. The architecture of a data center network provides many redundant paths to ensure full bisection bandwidth. But the full bisection bandwidth is only useful when the network utilization is near 100%. Benson et al. [3] have found that in a data center network (e.g. Figure 1(a)), the average link utilization of the aggregation layer links is about 8% during 95% of the time and the link utilization of core layer links are 20-40%. This is due to the large variation of data center trafﬁc during different times of the day. To achieve energy proportionality in the data center network, efforts [4], [5] have been made to dynamically turn on/off subsets of the network elements (switches and links) based on trafﬁc demands. These techniques use predictions on the ﬂow rates and provide heuristics without any bound on the quality of the solutions. In this paper, we exploit the topological characteristics of data center networks and devise an online algorithm to dynamically right-size the network for energy proportionality. Right sizing in a data center network is fundamentally a trade-off between power consumption, fault tolerance and performance. Data center topologies offer many redundant paths between a given pair of hosts. We exploit this degree of freedom for selecting paths from the topology which is an inherent property of data center networks. We choose the paths that lead to better energy savings. We also control the level of redundancy by a parameter in our algorithm to ensure robustness as well as energy efﬁciency. We devise a formal model for the right-sizing of data center networks. Although our model is for general data center 978-0-7695-5028-2/13 $26.00 © 2013 IEEE DOI 10.1109/CLOUD.2013.106

This paper makes three contributions. First, we present an ofﬂine optimization for right-sizing of data center networks with dynamic ﬂow rates. Second, we design an online algorithm to dynamically turn on/off network elements in a data center network. The algorithm does not make any assumption about the underlying ﬂow scheduling mechanism in the network and hence it can work on top of any ﬂow scheduler. Third, we build a ﬂow level simulator to simulate ﬂow trafﬁc in a data center network. Implementing our algorithm on top of the network simulator, we demonstrate ∼20-25% energy reduction with respect to the existing techniques for achieving energy-proportionality of data center networks. The rest of the paper is organized as follows. Section II illustrates our model for a data center network and provides the ofﬂine formulation. In Section III, we present the online algorithm for right-sizing of the data center network. Section IV describes the simulator we built for simulating ﬂows in data center networks. In Section V, we present empirical evaluation of our algorithm with respect to the state-of-the-art solutions. Section VI illustrates the related work and Section VII concludes the paper. 581

(a) Fat-tree Fig. 1.

(b) Right-sized Fat-tree

Dynamically turn off subset of links and switches in order to save energy on a pod − k fat-tree.

internal nodes (or switches). This is a common practice for the abstraction of the fat-tree network for ﬂow scheduling [11] as we never turn off any active host and the link connecting the host to an edge switch; and thus enables us to concentrate only on the power consumption from the network elements. In our algorithm, we only consider the ﬂows that are active. The edge switches that do not have ﬂows are essentially turned off.

II. M ODEL F ORMULATION We model the data center network as a graph G = (V, E), where V = S ∪H is the set of all switches S and end hosts H, and E is the set of all links in the network. A path from node u to node v is an ordered sequence e1 , e2 , . . . , el of distinct links where e1 and el have their respective end nodes at u and v and each adjacent links {ej−1 , ej } have a common distinct end node w where w = u = v. The length of the path is l. We denote a path between the nodes u, v as pu,v and the set of all paths between them as Pu,v . Note that by deﬁnition a vertex w ∈ V or a link e ∈ E can appear only once in a path pu,v and thus we avoid loops or cycles in a path. Each link e(u, v) ∈ E has a capacity C(e) which represents the maximum bandwidth it can support. Let ct (e) be the available (unused) bandwidth of a link at time t where ct (e) ≤ C(e) for all t. The capacity ct (p) of a path p is the minimum capacity edge along that path i.e. ct (p) = mine∈p ct (e). The capacity of a set of paths P is the summation of the capacities of all the paths in that set i.e. ct (P ) = p∈P ct (p). A path pu,v is said to have minimum capacity if ct (p) is the minimum among all the paths between nodes u, v at time t. Similarly, a path pu,v is said to have maximum capacity if ct (p) is the maximum among all the paths between nodes u, v at time t.

C. Power Model The power1 consumption in the network is the total power consumed by the network elements: switches, ports and links. The total power consumption at time t is + εport + εlink εt = εswitch t t t

(1)

where εswitch , εport and εlink are the power consumption in t t t active switches, ports and links respectively. If nswitch , nport t t link and nt are the number of active switches, ports and links in the network at time t, then εt

=

αs nswitch + βs + αp nport + βp t t +αl nlink + β l t

(2)

where αs , αp , αl , βs , βp , βl are the constants for power consumption. The power consumption in the link is negligible [5], [12] and can be incorporated with the power consumption in the ports. And the total number of ports is twice the total number of links. Hence

A. Flow Speciﬁcation We denote the set of all ﬂows in the network by F where |F| = K. Each ﬂow Fi is represented by (si , di , fi,t , ti , ti ) where (si , di ) is the source-destination pair, fi,t is the dynamic size in bytes, ti is the start time and ti is the stop time. We do not estimate bandwidth requirement for ﬂows and instead use dynamic ﬂow sizes. This makes our formulation more realistic as bandwidth demand for network ﬂows change over time [4], [11]. A ﬂow is not active outside start and stop time, hence fi,t = 0 for t < ti , t > ti . We represent the ﬂow assignment to a link e and a path p by Fi → e and Fi → p, respectively. Our goal is to assign each ﬂow to a path or a set of paths. Consequently, multiple ﬂows can share a link but the total assigned ﬂow through a link e(u, v) should be less than the capacity, Fi →e fi,t ≤ C(e), for all t.

+ βs + 2αp nlink + βp εt = αs nswitch t t

(3)

The total energy consumption over a time interval of T can be obtained by E=

T t=1

ε t = αs

T t=1

nswitch + T βs + 2αp t

T

nlink + T βp (4) t

t=1

D. Ofﬂine Optimization We now formulate the ofﬂine optimization for the ﬂow assignment problem with dynamic ﬂow rates. In this case, we have complete information about the ﬂows and we optimize energy over a consolidation period T . Finding ﬂow assignment in a general network while not exceeding the capacity of any link is called the Multi Commodity Flow problem. But the problem we are dealing with

B. Model Adaptation for Fat-Tree In the fat-tree topology (e.g. Figure 1(a)), hosts are connected to the edge switches by only one link. To adapt our model to that topology, we consider the edge switches as the end hosts and the aggregation and core switches as the

1 The

582

energy wastage due to heat density is not considered in this paper.

here is different from the Multi Commodity Flow problem in three aspects. First, the ﬂows are dynamic i.e. the number of active ﬂows change over time. Second, the ﬂow rate for an active ﬂow may change over time. Third, the ﬂows cannot be split as ﬂow splitting is undesirable in TCP due to packet reordering effects [13]. Hence each ﬂow should be assigned to a path which makes the problem a 0-1 (Boolean) ﬂow assignment problem. We formulate the ﬂow assignment problem with binary decision variables associated with each path and switch the links accordingly. The binary decision variable will indicate the power state of each switch and link. The formulation is presented below which minimizes the total network power by solving a 0-1 (Boolean) linear program. Let Xi,u,v,t be the binary decision variable representing the assignment of ﬂow i to a path pu,v at time t. A link (u, v) is active at time t if ∃i, Xi,u,v,t = 1. We have the following constraints on the variable Xi,u,v,t : • The network can sustain all the ﬂows i.e. at any time there is a valid assignment for each ﬂow from its source to destination. Xi,si ,di ,t = 1, •

A ﬂow Fi is not active outside its start time time ti . Xi,u,v,t = 0,

•

∀i, ti ≤ t ≤ ti

•

Then the number of active and links at time T switches link t is given by nswitch = Y = t u∈V u,t and nt t=1 T K X . Under constraints (5)-(12), the (u,v)∈E t=1 i=1 i,u,v,t objective function to minimize the total energy consumption E from Equation (4) over a consolidation period T can be represented as: min

Xi,u,v,t ,Yu,t

∀i, ∀u, ∀v, ∀t

•

∀u, ∀w ∈ Vu , ∀t

(6)

(7)

(8)

i=1 w∈Vu

We have constraints on the capacity of links and the path assignment of ﬂows: Capacity Constraints: The total bandwidth requirement for ﬂows assigned to each link cannot exceed the link capacity. (fi,t × Xi,u,v,t ) ≤ C(u, v)

∀(u, v) ∈ E, ∀t

(10)

i=1

Path Constraints: The following two constraints ensure that each ﬂow is assigned to a path: • An assigned path is connected i.e. if there is a path pu,w and a path pw,v then there is a path pu,v where w ∈ Vu . 1 Xi,u,v,t ≤

(Xi,u,w,t + Xi,w,v,t ), 2 w∈Vu

∀i, ∀(u, v) ∈ / E, ∀t

Yu,t + T βs K

Xi,u,v,t + T βp

III. O NLINE A LGORITHM Dynamically turning off elements (switches, links) in a network will affect the ﬂows and packets routing through that network element. Thus dynamic right-sizing is challenging because after right-sizing we need to provide alternate routing paths such that the affected ﬂows can be re-routed through those paths. Also we need to determine how often the network should be resized: periodic vs event-driven (threshold). In this regard, we design an event-driven online algorithm to reduce energy consumption in the data center network. The algorithm guarantees sufﬁcient capacity in the network to schedule the ﬂows at any time such that the network is neither overutilized nor under-utilized. At the same time, the algorithm ensures fault-tolerance and reliability by providing a level of redundancy in the network. Our strategy is to reduce energy consumption by adjusting the topology of the network. We compute sorted lists of most overlapping paths between each pair of hosts. When the total ﬂow rate at any link crosses a pre-speciﬁed threshold (upper/lower), we efﬁciently update the topology by selecting (add/remove) the best overlapping paths that satisfy the ﬂow demand, while keeping margins (r×) for fault tolerance. We call r as the redundancy parameter for our algorithm. We name our approach as ‘path consolidation’ as opposed to ‘trafﬁc consolidation’ in Elastic Tree [4] and ‘link rate adaptation’ in CARPO [5]. Our algorithm consists of two steps: (i) Add paths for new ﬂows in the topology, (ii) Update

If all the links incident to a switch are off, the switch is off. K Xi,u,w,t , ∀u, ∀t (9) Yu,t ≤

K

t=1 u∈V T

Finding the optimal ﬂow assignment by solving the optimization is NP-complete due to the integer (boolean) constraints. While we can use optimization toolbox to solve the binary ILP for small network instances or obtain near optimal solution, it is not applicable in practice for two reasons. First, the formulation assumes that we have perfect knowledge of the ﬂows. But in the data center network ﬂows change unpredictably over time. Second, data center networks consist of huge number of links and switches and solving the ILP on these huge networks takes exponential time. In a real data center network, we neither have knowledge about the incoming trafﬁc nor have information about bandwidth demands. Hence in this paper we provide an online algorithm for reducing energy consumption by determining the set of switches and links to keep active dynamically.

Let Yu,t be the binary decision variable indicating the power state (on/off) of switch u at time t. Then the following two constraints correlate the variables Xi,u,v,t and Yu,t : • If a switch is off, then all the links incident to it are off. Let Vu be the set of switches adjacent to switch u. Xi,u,w,t ≤ Yu,t ,

T

t=1 (u,v)∈E i=1

All ﬂows and paths are bidirectional. Xi,u,v,t = Xi,v,u,t ,

αs

+2αp

and stop

∀i, ∀u, ∀v, t < ti , t > ti

(12)

w∈Vu

(5) ti

Each ﬂow is assigned to exactly one path. Xi,u,w,t ≤ Xi,u,v,t , ∀i, ∀u, ∀v, ∀t

(11)

583

the paths that are either under utilized or over utilized. Below we discuss the two steps in detail. A. Path Selection For each pair (u, v) of source and destination, we compute a set of paths that support the total ﬂow (bandwidth) demand Fu,v between source u and destination v. To ensure fault tolerance, we choose the paths that are capable of supporting r×Fu,v ﬂow rate. Given the network G = (V, E) with the link s that have capacity list C, our goal is to choose the paths Pu,v minimum capacity as well as supports r × Fu,v ﬂow rate. By selecting the paths that have minimum capacity, we increase the overlapping between paths among different pairs. The process of path selection continues until the ﬂow requirement for all the active source-destination pair is considered. We then compute the new subgraph Gon from the union of the s for all active (u, v) source-destination selected paths Pu,v pairs. We turn off all the links and nodes in the subgraph Gof f = G − Gon .

(a) Equal Cost Multipath

(b) Path Consolidation

Fig. 2. The active nodes and the routes with (a) conventional routing and (b) our algorithm . The active nodes are shown in grey.

Hence the sorting in step 5 dominates the time complexity which is O(|E|log|E|). B. Path Update When the ﬂow level in a link crosses threshold limit, we update the active network topology Gon . Before the update operation, the link capacities of Gon and G are updated from the network; and the link capacities of the inactive links in G are set to the corresponding maximum capacity. There are two threshold limits (i) upper threshold and (ii) lower threshold. We discuss the path update procedure for the two cases below. 1) Path Update for Upper Threshold: Suppose the ﬂow level through the link set EU crossed the upper threshold and F(EU ) is the set of ﬂows using those links. We need to add links to the active network Gon so that the updated network has enough capacity to support the ﬂows F(EU ). We add most overlapping paths into the network. We invoke Algorithm 1 (Path Selection) with the ﬂow list F(EU ) to compute new paths and add them to Gon . 2) Path Update for Lower Threshold: Suppose the ﬂow level through the link set EL crossed the lower threshold and F(EL ) is the set of ﬂows using those links. We need to remove links from Gon so that the updated network does not have spare capacity than needed to support the ﬂows F(EL ). For each pair (u, v) in the ﬂow list F(EL ), we remove only one least overlapping path from the network. We invoke Algorithm 2 (Path Deletion) with the ﬂow list F(EL ) to compute the paths to be removed and delete them from Gon .

Algorithm 1 Path Selection (r, t) Input: Flow list F = ∪{Fi }, Network G = (V, E), Link capacity list C. Output: Active Network Gon = (Von , Eon ) where Von ⊆ V , Eon ⊆ E. 1: Gon = {} 2: for all (u, v) source-destination pairs in F do 3: Fu,v := Total bandwidth requirement for (u, v). 4: Pu,v := All paths between (u, v). 5: Sort Pu,v according to path capacity ct (p), ∀p ∈ Pu,v . s 6: Pu,v := {} 7: for all p ∈ Pu,v do Select min-capacity paths s s 8: Pu,v = Pu,v ∪p s 9: if C(Pu,v ) ≥ r × Fu,v then 10: break 11: end if 12: end for s 13: Gon = Gon ∪ Pu,v Add selected paths to Gon 14: Update capacity list C 15: end for

Algorithm 2 Path Deletion (r, t) Input: Flow list F(EL ), Active Network Gon = (Von , Eon ), Link capacity list C. , Eon ) where Output: Active Network Gon = (Von Von ⊆ Von ⊆ V , Eon ⊆ Eon ⊆ E. 1: for all Fi ∈ F(EL ) do 2: (u, v) := Source-destination pair for Fi . 3: Pu,v := All paths between (u, v). 4: pm u,v := arg maxp∈Pu,v ct (p). 5: Gon = Gon − {pm Remove path from Gon . u,v } 6: end for

Figure 2 illustrates an example scenario of routing and active nodes in a data center network. In this ﬁgure, there are two ﬂows going from src1 → dst1 and src2 → dst2. Conventional ﬂow scheduling algorithms schedule ﬂows in a manner that spread the ﬂow across the network. In contrast our algorithm selects the paths such that there is maximum overlapping between paths from different sources and destinations. In Figure 2(a), there are 9 active switches but in Figure 2(b) there are 7 active switches and the two ﬂows share a common link. We now analyze the time complexity of Algorithm 1. The minimum capacity paths in lines 6-14 between a pair of source-destination (u, v) can be computed using Depth-ﬁrst or Breadth-ﬁrst search on the network G. And all the paths from a single source u to different destinations can be computed using only one search traversal. Hence the time required in step 4 is O(|V | + |E|). Note that we can pre-compute the set of all paths between each pair (u, v) of hosts so that we don’t need to compute step 4 at every invocation of Algorithm 1.

The algorithms for updating paths are applied only to the affected ﬂows that have crossed the threshold. Step 4 in Path Selection and step 3 in Path Deletion are pre-computed. Thus we only need to sort the paths for the source-destination pairs of the affected ﬂows. In a data center network, ﬂow rates do not change very frequently. By keeping threshold margins, we avoid frequent updates of the paths. We also do not need to predict ﬂow rates since we provide redundancy

584

in the algorithm. Since we add paths as soon as the threshold is crossed, there is no overhead of latency/performance. In a real network, the algorithm may need to wait for capacity to become online. This will not affect the performance/latency of the ﬂows as there are redundant paths in the network to re-route the affected ﬂow. In the algorithm, we also do not make any assumption about the underlying ﬂow scheduling mechanism in the network. Hence our algorithm can work on top of any ﬂow scheduler. Our algorithm can also be combined with the ﬂow scheduling algorithm to better exploit the topology during ﬂow scheduling. IV. S IMULATOR We implemented a discrete event simulator for simulating the ﬂow level TCP trafﬁc of a data center network. The existing packet level simulators (e.g. ns2) are not suitable for simulation of trafﬁc in data center networks because of the huge number of packets generated at any time in a decent sized network [11]. Hence we implemented a ﬂow level simulator. The simulation events occur when a ﬂow starts or ends. If multiple ﬂows start and end at the same time, they are aggregated to form a single event. At each event, the simulator performs three tasks: (i) ECMP (Equal Cost Multipath), (ii) Max-min fair share, (iii) Right-size Network. Then we transfer ﬂows, update each ﬂow value and determine the next ﬂow event. Each of these steps are described below.

Fig. 3. Block diagram showing the execution steps of the simulator along with the right-sizing algorithm. The dashed box is the emulation of the hardware.

A. Equal Cost Multipath (ECMP) Equal Cost Multipath (ECMP) forwarding is used widely for ﬂow scheduling in data center networks in order to take advantage of multiple paths in data center topology between a pair of source and destination. For each pair of hosts, we compute all the equal cost shortest paths. Then for each ﬂow from the source, we pick one of the shortest paths by hashing on (f lowid, src, dest). Thus the ﬂows from a source to a destination are distributed among different paths and a ﬂow is not split among multiple paths. ECMP also ensures path assignment for each ﬂow.

Figure 3 illustrates the block diagram of the simulator along with the algorithm. The simulator ﬁrst invokes the Path Selection algorithm. Then it simulates the network by ECMP and MMF and assigns path and bandwidth to each ﬂow. Then it checks whether the upper or lower threshold is crossed. If any of the threshold is crossed, then it applies the Path Update algorithm to resize the network for the ﬂows that have crossed the threshold. When the topology contains enough nodes and links such that none of the links crosses the threshold, the simulator advances to the next event. If the next event is a ﬂow arriving event, then the simulator repeats by selecting new paths for the new starting ﬂows. Otherwise, if it is a ﬂow ending event then the simulator terminates the ﬂow and checks for the threshold. Thus the simulator executes until all ﬂows are ﬁnished. In a real network, the two steps in the dashed box in Figure 3 will be computed in the hardware.

B. Max-Min Fair Share (MMF) The max-min fair share algorithm allocates bandwidth to each ﬂow along the path determined by ECMP. A link can be assigned to many ﬂows. Hence the bandwidth of the link is distributed across all the ﬂows sharing the link. At each link, the bandwidth is allocated to each candidate ﬂow by monotonic increasing sequence for their respective demands. The minimum ﬂow rate in the path of a ﬂow is the minimum assigned ﬂow rate on any link along that path. For our simulation we use the max-min fair share (MMF) algorithm for a network and globally compute bandwidth allocation for each link. We use MMF to allocate bandwidth because common transport layer protocol i.e. TCP tries to achieve MMF allocation of bandwidth among ﬂows.

V. E VALUATION In this section, we evaluate our method based on energy reduction with respect to the state-of-the-art. We use energy model for switches and links to compute energy savings for the network. The energy savings depend on the trafﬁc patterns, the level of desired redundancy and the size of the data center itself. We vary all the three parameters for the evaluation. Energy, performance and robustness all depend heavily on the nature of trafﬁc pattern. We now explore the possible energy savings over communication patterns generated from a real production data center workload.

C. Right-size Network Each switch keeps track of whether the total allocated bandwidth at any link has crossed the threshold limit. If the limit is crossed or a ﬂow ends, the simulator invokes the path update procedures to adjust the topology. If a new ﬂow arrives, the simulator invokes the Path Selection algorithm to select new paths for the arriving ﬂows.

A. Flow Traces There is no publicly available ﬂow traces from data center networks. Due to the size of workload trafﬁc, huge amount of

585

ﬂows are generated in a reasonable size data center network. Hence it is not practical to capture all the ﬂows from a production data center network. Given the scarcity of ﬂow traces, previous studies [4], [5] have generated ﬂows from workload traces or use patterns of synthetic traces to analyze the worst case or the expected case. In this paper, we generate ﬂow traces from MapReduce traces available at [14]. The MapReduce traces came from a Facebook datacenter running jobs on a cluster of 600 machines over a period of one day (24 hours). Elastic Tree [4] uses traces from e-commerce website which are service type requests with short ﬂows. In contrast, we look at cluster computing jobs which consists of mixture of short and long ﬂows (interactive and batch jobs). CARPO [5] uses HTTP traces from wikipedia to generate ﬂows by assigning ﬂow sizes using the ﬁlenames in the request. In this paper, we generate ﬂows from more realistic MapReduce traces. We now describe the method for generation of ﬂow traces from the MapReduce traces. The MapReduce performance model is illustrated in Figure 4. A MapReduce job J is deﬁned in [15] as a 7 tuple (S, S , S , X, Y, f (x), g(x)), where S is the size of input data (map stage); S is the size of intermediate data (shufﬂe stage); S is the size of output data (reduce stage); X is the number of mappers that J is divided into; Y is the number of reducers assigned for J to output; f (x) is the running time of a mapper vs size x of input; g(x) is the running time of a reducer vs size x of input. We compute the number of map and reduce tasks by dividing the input size S and output size S by the HDFS (Hadoop Distributed File System) block size respectively. HDFS block size is typically 64MB. We now determine the number of ﬂows for the MapReduce job. Since map and reduce tasks are placed close to the nodes where the data are stored in the HDFS, we consider only the ﬂows corresponding to the shufﬂe stage. There are XY transfers in this stage, each corresponding to a ﬂow from a S . map node to a reduce node. The size of each ﬂow is XY Let Vn be the network transfer rate. Then the rate of the ﬂow can be estimated from the starttime which is the job release time and stoptime = XYS Vn . For our calculation, we used Vn = 10M B/sec. Since we don’t know the placement of the map and reduce tasks, we randomly pick hosts for the source and destination of the ﬂow.

Fig. 4.

bytes to hundreds of MBytes. This is more realistic mixture of large ﬂows and small ﬂows than those used in the literature. Since the trace yields a lot of ﬂows, we simulate for the ﬁrst one hour on 78 ﬂows. The ﬂows are started when the MapReduce job corresponding to that ﬂow is released. Since the ﬂows are released at different times, the utilization of data center network varies over time. B. Power Model Parameters The typical power consumption of a switch is 67.7W and the power consumption for each active port is 2.0W [5]. For the experiments, we use αs = 67.7W , αp = 2W , βs = 0 and βp = 0. Let Ee and Ep be the total power consumption from the two algorithms: trafﬁc consolidation in Elastic Tree and path consolidation in our algorithm. Then the energy reduction in our algorithm with respect to Elastic Tree is E e − Ep × 100% (13) Ee The energy reduction in our algorithm with respect to the original network is Redep =

E o − Ep × 100% (14) Eo where Eo is the total energy consumption when all the switches are active. For a pod − k fat tree, the total number of switches is (k/2)2 + k 2 and we can compute Eo from Equation (4) as Redop =

E o = αs

T t=1

((k/2)2 + k 2 ) + T βs + 2αp

T

nlink + T βp (15) t

t=1

C. Results and Discussion When the network utilization is 100%, clearly there is no energy savings due to lack of any shutdown opportunities. Our right sizing algorithm uses variation in network utilization to reduce energy use. Energy Consumption: The energy consumption over time using path consolidation from our algorithm and trafﬁc consolidation from Elastic Tree are plotted in Figure 5. Figure 5(a) shows the energy consumption over time in the network of size k = 6 and redundancy parameter r = 1. Figure 5(b) shows the number of active ﬂows over time in the network. The original network without right-sizing always consumes energy even if there is no ﬂow. The energy consumption actually varies in the original network due to the variation in the number of active ports with the ﬂow variation. The energy consumption in Elastic Tree and our algorithm varies with the number of active ﬂows in the network. When there are few ﬂows in the network, the energy consumptions from Elastic tree and our algorithm are the same. Because there are not enough ﬂows in the network to overlap in common paths. The main energy saving from our algorithm occurs when there are lot of ﬂows in the network. Our algorithm can overlap ﬂows in common paths when possible, which results in the reduction. Table I depicts the energy consumption and energy reduction from our algorithm with various parameter changes. For a fat tree of size pod − k = 6, our algorithm yields 18.70% total energy reduction with respect to Elastic Tree calculated using Equation (13). Our algorithm also gives 72.60% energy reduction with respect to the original network according to

MapReduce Performance Model.

Generating ﬂows using the above method from the MapReduce traces gives 107389 ﬂows with ﬂow sizes varying from

586

our algorithm can choose among more paths and pack the paths more efﬁciently as there are more available paths in the network. For reasonable size of data center networks, our algorithm shows 25% reduction in energy with respect to Elastic Tree. (a) Energy Consumption

VI. R ELATED W ORK Recent studies have focused on reducing energy consumption in data center networks by designing energy proportional networks. While there are suggestions [16], [17], [18], [19] about reducing energy in general networks there are not many studies on power-proportionality of data center networks. Recently Abts et al. [20] proposed power efﬁcient data center network topologies and used link rate adaptation to dynamically reconﬁgure the data rate of the links for energy proportionality in the network. Right-sizing of data center network for energy efﬁciency was ﬁrst proposed by Heller et al. [4]. They proposed ‘Elastic Tree’ algorithm that continuously monitors data center trafﬁc conditions and chooses a subset of network elements that must be kept active for energy efﬁciency and fault tolerance goals. There algorithm is specialized for fat tree only and switches use only local information from the ‘uplink’ and ‘downlink’ to determine the power state of the links, ports and switches. In contrast we use global information as we take into account ‘path capacity’ for the selection of paths in order to determine active switches and links. In other words our algorithm uses more information to adjust the topology and hence it achieves more energy savings. In terms of redundancy, Elastic Tree only adds extra aggregation or core switches for fault tolerance without being aware of trafﬁc or path. Although this includes redundant path for some of the ﬂows, this does not ensure fault tolerance for every ﬂow. In contrast, we ensure redundancy at path level for each ﬂow and thus our algorithm is more robust. Wang et al. [5] proposed an algorithm (CARPO) for trafﬁc consolidation in a data center network by looking at the correlation among the ﬂows. They place negatively correlated ﬂows on the same path and positively correlated ﬂows on different paths. They simulated on a small topology with ﬁxed ﬂows with varying rates. They assume that the ﬂows are long and ﬁxed which is unrealistic as 80% of the ﬂows last less than 10 sec in a data center network [21]. Computing correlation among these huge number of ﬂows and allocating paths based on the correlation in such a small time is not feasible in large data centers. Also computing correlation requires statistical information which is not available in practice. In contrast, in our algorithm, we do not need any statistical information or prediction for ﬂows rather we place ﬂows on paths based on their dynamic ﬂow demands and dynamic capacity of the paths. In our experiments, we do not compare our algorithm with CARPO [5], as correlation information about the ﬂow rate is not available. CARPO also used link rate adaptation [22] with the algorithm to adjust the port speed of the switch. Incorporating the technique of link rate adaptation with our algorithm will further reduce the energy consumption. Data center networks are commonly designed to support full bisection bandwidth with redundancy in the topology. None of the above studies took advantage of this topological structure. Moreover data center networks are generally conﬁgured to use Equal Cost Multipath routing (ECMP) to schedule ﬂows. ECMP tries to spread trafﬁc across different available paths.

(b) Number of Active Flows

Fig. 5. The energy consumption over time (a) and the number of active ﬂows (b) in the data center network of size k = 6 and redundancy parameter r = 1. TABLE I E NERGY C ONSUMPTION AND R EDUCTION WITH DIFFERENT PARAMETERS FROM DIFFERENT ALGORITHMS

k 4 6 8

r 1 2 3 1 2 3 1 2 3

Eo (MW) 4.14 4.14 4.14 9.98 10.04 10.05 16.13 16.13 16.13

Ee (MW) 1.29 1.36 1.34 3.36 3.64 3.63 2.74 2.76 2.72

Ep (MW) 1.17 1.33 1.34 2.74 3.19 3.42 2.05 2.47 2.68

Redop (%) 71.75 67.80 67.61 72.60 68.26 65.97 87.31 84.66 83.39

Redep (%) 9.36 1.93 0.00 18.70 12.32 5.74 25.37 10.38 1.52

Equation (14). The variation in energy consumption with changes in parameters are discussed below. Robustness Analysis: To adapt with dynamic variation in trafﬁc ﬂows and network failures, data center networks incorporate redundancy in the topology. Due to redundancy, the network uses more switches and links than necessary for the ﬂow demand. In our formulation we used a redundancy parameter r. Varying the redundancy parameter from r = 1 to r = 4, we compare the total energy consumption from Elastic Tree and our algorithm. For the comparison, we do not add any redundancy with the Elastic Tree. Even with the redundancy, our algorithm achieves more energy savings than Elastic Tree with no redundancy. Figure 6(a) and 6(b) illustrates that the energy consumption from our algorithm increases as we add more redundant paths to the topology. In Figure 6(c), we depict the reductions Redop and Redep where both of them decreases with the increase of redundancy r in our algorithm. Size of Data Center Network: We compare the impact of the size of fat tree on the energy savings from different algorithms. A pod − k fat tree topology consists of k 2 /4 core switches, k 2 /2 aggregation switches, k 2 /2 edge switches and k 3 /4 end hosts. For the given ﬂow traces, we change the number of hosts by varying the parameter k. Then we randomly assign the source and destination for the ﬂows from the set of hosts. We compare the energy savings for fat tree sizes of k = 4 to k = 10 as illustrated in Figure 7. For larger fat tree, the energy consumption increases for the original network but for the right-sizing algorithms it is somewhat random due to the random selection of hosts for the ﬂows. Although the energy consumption is arbitrary, the energy reduction with respect to both original network and Elastic Tree increases with the size of fat tree. Figure 7(c) shows that the reduction in energy from our algorithm grows with the increase in the size k of fat tree. This is because with the increase in k,

587

(a) Energy Consumption

(b) Energy Consumption

(c) Energy Reduction

Fig. 6. The energy consumption from our algorithm with respect to the original network (a), Elastic Tree (b) and energy reduction (c) with respect to the original network, Elastic Tree for different redundancy parameter with fat tree size k = 6.

(a) Energy Consumption

(b) Energy Consumption

(c) Energy Reduction

Fig. 7. The energy consumption from our algorithm with respect to the original network (a), Elastic Tree (b) and energy reduction (c) with respect to the original network, Elastic Tree for different sizes of fat tree with redundancy parameter r = 1.

[7] C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, S. Luz DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers, In Proc. of ACM SIGCOMM, 2008. [8] A. Greenberg, J.R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D.A. Maltz, P. Patel, and S. Sengupta, VL2: a scalable and ﬂexible data center network, ACM SIGCOMM Computer Communication Review, 2009. [9] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu, BCube: a high performance, server-centric network architecture for modular data centers, In Proc. of ACM SIGCOMM, 2009. [10] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford , S. Shenker , J. Turner, OpenFlow: enabling innovation in campus networks, ACM SIGCOMM Computer Communication Review, 38(2), 2008. [11] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, Hedera: Dynamic Flow Scheduling for Data Center Networks, In Proc. of USENIX NSDI, April 2010. [12] G. Ananthanarayanan and R. H. Katz, Greening the Switch, In Proc. of HotPower, 2008. [13] S. Kandula, D. Katabi, S. Sinha, and A. Berger. Dynamic Load Balancing Without Packet Reordering. ACM SIGCOMM Computer Communication Review, 37(2), pp 51-62, 2007. [14] https://github.com/SWIMProjectUCB/SWIM/wiki [15] Y. Xiao, S. Jianling, Reliable Estimation of Execution Time of MapReduce Program, China Communications, Vol. 8(6), pp 11–18, 2011. [16] M. Gupta and S. Singh, Greening of the Internet, In Proc. of ACM SIGCOMM, pp 19–26, 2003. [17] S. Nedevschi, L. Popa, G. Iannaccone, S. Ratnasamy and D. Wetherall, Reducing network energy consumption via sleeping and rate-adaptation, In Proc. of USENIX NSDI, 2008. [18] M. Andrews, A. Fernandez Anta, L. Zhang, and W. Zhao. Routing and scheduling for energy and delay minimization in the power-down model. In Proc. of IEEE INFOCOM, March 2010. [19] J. Chabarek, J. Sommers, P. Barford, C. Estan, D. Tsiang, and S. Wright, Power Awareness in Network Design and Routing, In Proc. of INFOCOM, April 2008. [20] D. Abts, M. Marty, P. Wells, P. Klausler, and H. Liu, Energy proportional datacenter networks, In Proc. of ISCA, 2010. [21] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, The nature of data center trafﬁc: measurements & analysis, In Proc. of ACM SIGCOMM, 2009. [22] C. Gunaratne, K. Christensen, B. Nordman and S. Suen, Reducing the energy consumption of Ethernet with adaptive link rate (ALR), IEEE Transactions on Computers, 57(4), pp 448–461, 2008.

Hence algorithms like Elastic Tree that do not restrict the topology prior to routing decisions, cannot save much energy. In this paper, we guide ECMP by restricting the topology to the best available paths. Hence our algorithm uses less number of active links and switches than Elastic Tree to schedule ﬂows and thus save energy. VII. C ONCLUSION In this paper, we dynamically turn off unused network elements (switches and links) for energy savings and turn on elements when the network is over-utilized. This improves the utilization of data center networks which are generally designed for peak trafﬁc. Our algorithm achieves signiﬁcant energy savings ∼80% in the fat-tree topology and 20-25% with respect to the Elastic Tree algorithm. Our ongoing work is to design a distributed right-sizing algorithm for data center networks which will not require global monitoring of TCP ﬂows. We also would like to incorporate right-sizing techniques into the ﬂow scheduling algorithms for better routing in data center networks. R EFERENCES [1] Server and Data Center Energy Efﬁciency, Final Report to Congress, U.S. Environmental Protection Agency, 2007. [2] L. A. Barroso, and U. H¨olzle, The case for energy-proportional computing. Computer, vol. 40, no. 12, pp. 33-37, 2007. [3] T. Benson, A. Anand, A. Akella, and M. Zhang, Understanding data center trafﬁc characteristics, In Proc. of ACM SIGCOMM, 2009. [4] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee, and N. McKeown, Elastictree: Saving energy in data cneter networks, In Proc. of USENIX NSDI, pp 249-264, 2010. [5] X. Wang, Y. Yao, X. Wang, K. Lu, Q. Cao, CARPO: CorrelationAware Power Optimization in Data Center Networks, In Proc. of IEEE INFOCOM, March 2012. [6] M. Al-Fares, A. Loukissas, and A. Vahdat, A Scalable, Commodity, Data Center Network Architecture, In Proc. of ACM SIGCOMM, 2008.

588