2013 IEEE Sixth International Conference on Cloud Computing

Path Consolidation for Dynamic Right-Sizing of Data Center Networks Muhammad Abdullah Adnan and Rajesh Gupta University of California San Diego {madnan,rgupta}@ucsd.edu Abstract—Data center topologies typically consist of multirooted trees with many equal-cost paths between a given pair of hosts. Existing power optimization techniques do not utilize this property of data center networks for power proportionality. In this paper, we exploit this opportunity and show that significant energy savings can be achieved via path consolidation in the network. We present an offline formulation for the flow assignment in a data center network and develop an online algorithm by path consolidation for dynamic right-sizing of the network to save energy. To validate our algorithm, we build a flow level simulator for a data center network. Our simulation on flow traces generated from MapReduce workload shows ∼80% reduction in network energy consumption in data center networks and ∼25% more energy savings compared to the existing techniques for saving energy in data center networks.

networks, we evaluate our algorithm on the widely used fattree structure [6]. In a pod − k fat-tree topology, there are (k/2)2 possible paths between any pair of hosts. We exploit this opportunity of multiple paths and select the best paths that fulfill our goals for energy reduction. Figure 1(a) shows the Fat-tree topology for data center networks comprising of many redundant links and switches. We dynamically right-size the network to achieve a subset of active nodes and links as shown in Figure 1(b). State-of-the-art method e.g. Elastic Tree [4] for right-sizing the data center network considers flows at each switch and proposes heuristics to determine the power state (on/off) for each switch and adjacent links locally. Unlike the previous method, we leverage the graph-theoretic properties of the fat-tree network to select and update paths globally. Our method takes into account the flows between end hosts and selects the best paths considering the network as a graph. Hence our method is immediately applicable to any topology for data center networks like DCell [7], VL2 [8], BCube [9], etc. Moreover the advent of OpenFlow [10] switches facilitates the collection of flow information and promotes centralized control over flows. In this regard, CARPO [5] proposed traffic consolidation methods for correlation-aware placement of flows on the available paths. But the correlation information among the flow demands is not available in practice during flow scheduling. Instead, we place flows on paths based on their dynamic flow demands and dynamic capacity of the paths. For fault tolerance, we do not use prediction for flow rates rather ensure safety margins by adding redundant paths to the active network.

I. I NTRODUCTION Increase in energy prices and the rise of cloud computing brings up the issue for making data centers energy efficient. A significant portion (∼20%) of the energy consumption in data centers comes from the networking elements. According to an EPA report [1], the networking elements in data centers consumed 3 billion kWh in 2006 and the power demand is rising every year. Hence achieving energy proportionality in data center networks has been a recent focus of research in the area [2]. The architecture of a data center network provides many redundant paths to ensure full bisection bandwidth. But the full bisection bandwidth is only useful when the network utilization is near 100%. Benson et al. [3] have found that in a data center network (e.g. Figure 1(a)), the average link utilization of the aggregation layer links is about 8% during 95% of the time and the link utilization of core layer links are 20-40%. This is due to the large variation of data center traffic during different times of the day. To achieve energy proportionality in the data center network, efforts [4], [5] have been made to dynamically turn on/off subsets of the network elements (switches and links) based on traffic demands. These techniques use predictions on the flow rates and provide heuristics without any bound on the quality of the solutions. In this paper, we exploit the topological characteristics of data center networks and devise an online algorithm to dynamically right-size the network for energy proportionality. Right sizing in a data center network is fundamentally a trade-off between power consumption, fault tolerance and performance. Data center topologies offer many redundant paths between a given pair of hosts. We exploit this degree of freedom for selecting paths from the topology which is an inherent property of data center networks. We choose the paths that lead to better energy savings. We also control the level of redundancy by a parameter in our algorithm to ensure robustness as well as energy efficiency. We devise a formal model for the right-sizing of data center networks. Although our model is for general data center 978-0-7695-5028-2/13 $26.00 © 2013 IEEE DOI 10.1109/CLOUD.2013.106

This paper makes three contributions. First, we present an offline optimization for right-sizing of data center networks with dynamic flow rates. Second, we design an online algorithm to dynamically turn on/off network elements in a data center network. The algorithm does not make any assumption about the underlying flow scheduling mechanism in the network and hence it can work on top of any flow scheduler. Third, we build a flow level simulator to simulate flow traffic in a data center network. Implementing our algorithm on top of the network simulator, we demonstrate ∼20-25% energy reduction with respect to the existing techniques for achieving energy-proportionality of data center networks. The rest of the paper is organized as follows. Section II illustrates our model for a data center network and provides the offline formulation. In Section III, we present the online algorithm for right-sizing of the data center network. Section IV describes the simulator we built for simulating flows in data center networks. In Section V, we present empirical evaluation of our algorithm with respect to the state-of-the-art solutions. Section VI illustrates the related work and Section VII concludes the paper. 581

(a) Fat-tree Fig. 1.

(b) Right-sized Fat-tree

Dynamically turn off subset of links and switches in order to save energy on a pod − k fat-tree.

internal nodes (or switches). This is a common practice for the abstraction of the fat-tree network for flow scheduling [11] as we never turn off any active host and the link connecting the host to an edge switch; and thus enables us to concentrate only on the power consumption from the network elements. In our algorithm, we only consider the flows that are active. The edge switches that do not have flows are essentially turned off.

II. M ODEL F ORMULATION We model the data center network as a graph G = (V, E), where V = S ∪H is the set of all switches S and end hosts H, and E is the set of all links in the network. A path from node u to node v is an ordered sequence e1 , e2 , . . . , el of distinct links where e1 and el have their respective end nodes at u and v and each adjacent links {ej−1 , ej } have a common distinct end node w where w = u = v. The length of the path is l. We denote a path between the nodes u, v as pu,v and the set of all paths between them as Pu,v . Note that by definition a vertex w ∈ V or a link e ∈ E can appear only once in a path pu,v and thus we avoid loops or cycles in a path. Each link e(u, v) ∈ E has a capacity C(e) which represents the maximum bandwidth it can support. Let ct (e) be the available (unused) bandwidth of a link at time t where ct (e) ≤ C(e) for all t. The capacity ct (p) of a path p is the minimum capacity edge along that path i.e. ct (p) = mine∈p ct (e). The capacity of a set of paths P is the summation of the capacities of all the  paths in that set i.e. ct (P ) = p∈P ct (p). A path pu,v is said to have minimum capacity if ct (p) is the minimum among all the paths between nodes u, v at time t. Similarly, a path pu,v is said to have maximum capacity if ct (p) is the maximum among all the paths between nodes u, v at time t.

C. Power Model The power1 consumption in the network is the total power consumed by the network elements: switches, ports and links. The total power consumption at time t is + εport + εlink εt = εswitch t t t


where εswitch , εport and εlink are the power consumption in t t t active switches, ports and links respectively. If nswitch , nport t t link and nt are the number of active switches, ports and links in the network at time t, then εt


αs nswitch + βs + αp nport + βp t t +αl nlink + β l t


where αs , αp , αl , βs , βp , βl are the constants for power consumption. The power consumption in the link is negligible [5], [12] and can be incorporated with the power consumption in the ports. And the total number of ports is twice the total number of links. Hence

A. Flow Specification We denote the set of all flows in the network by F where |F| = K. Each flow Fi is represented by (si , di , fi,t , ti , ti ) where (si , di ) is the source-destination pair, fi,t is the dynamic size in bytes, ti is the start time and ti is the stop time. We do not estimate bandwidth requirement for flows and instead use dynamic flow sizes. This makes our formulation more realistic as bandwidth demand for network flows change over time [4], [11]. A flow is not active outside start and stop time, hence fi,t = 0 for t < ti , t > ti . We represent the flow assignment to a link e and a path p by Fi → e and Fi → p, respectively. Our goal is to assign each flow to a path or a set of paths. Consequently, multiple flows can share a link but the total assigned flow  through a link e(u, v) should be less than the capacity, Fi →e fi,t ≤ C(e), for all t.

+ βs + 2αp nlink + βp εt = αs nswitch t t


The total energy consumption over a time interval of T can be obtained by E=

T  t=1

ε t = αs

T  t=1

nswitch + T βs + 2αp t


nlink + T βp (4) t


D. Offline Optimization We now formulate the offline optimization for the flow assignment problem with dynamic flow rates. In this case, we have complete information about the flows and we optimize energy over a consolidation period T . Finding flow assignment in a general network while not exceeding the capacity of any link is called the Multi Commodity Flow problem. But the problem we are dealing with

B. Model Adaptation for Fat-Tree In the fat-tree topology (e.g. Figure 1(a)), hosts are connected to the edge switches by only one link. To adapt our model to that topology, we consider the edge switches as the end hosts and the aggregation and core switches as the

1 The


energy wastage due to heat density is not considered in this paper.

here is different from the Multi Commodity Flow problem in three aspects. First, the flows are dynamic i.e. the number of active flows change over time. Second, the flow rate for an active flow may change over time. Third, the flows cannot be split as flow splitting is undesirable in TCP due to packet reordering effects [13]. Hence each flow should be assigned to a path which makes the problem a 0-1 (Boolean) flow assignment problem. We formulate the flow assignment problem with binary decision variables associated with each path and switch the links accordingly. The binary decision variable will indicate the power state of each switch and link. The formulation is presented below which minimizes the total network power by solving a 0-1 (Boolean) linear program. Let Xi,u,v,t be the binary decision variable representing the assignment of flow i to a path pu,v at time t. A link (u, v) is active at time t if ∃i, Xi,u,v,t = 1. We have the following constraints on the variable Xi,u,v,t : • The network can sustain all the flows i.e. at any time there is a valid assignment for each flow from its source to destination. Xi,si ,di ,t = 1, •

A flow Fi is not active outside its start time time ti . Xi,u,v,t = 0,

∀i, ti ≤ t ≤ ti

Then the number of active and links at time T switches  link t is given by nswitch = Y = t u∈V u,t and nt t=1 T   K X . Under constraints (5)-(12), the (u,v)∈E t=1 i=1 i,u,v,t objective function to minimize the total energy consumption E from Equation (4) over a consolidation period T can be represented as: min

Xi,u,v,t ,Yu,t

∀i, ∀u, ∀v, ∀t

∀u, ∀w ∈ Vu , ∀t




i=1 w∈Vu

We have constraints on the capacity of links and the path assignment of flows: Capacity Constraints: The total bandwidth requirement for flows assigned to each link cannot exceed the link capacity. (fi,t × Xi,u,v,t ) ≤ C(u, v)

∀(u, v) ∈ E, ∀t



Path Constraints: The following two constraints ensure that each flow is assigned to a path: • An assigned path is connected i.e. if there is a path pu,w and a path pw,v then there is a path pu,v where w ∈ Vu .  1 Xi,u,v,t ≤

(Xi,u,w,t + Xi,w,v,t ) , 2 w∈Vu

∀i, ∀(u, v) ∈ / E, ∀t

Yu,t + T βs K  

Xi,u,v,t + T βp

III. O NLINE A LGORITHM Dynamically turning off elements (switches, links) in a network will affect the flows and packets routing through that network element. Thus dynamic right-sizing is challenging because after right-sizing we need to provide alternate routing paths such that the affected flows can be re-routed through those paths. Also we need to determine how often the network should be resized: periodic vs event-driven (threshold). In this regard, we design an event-driven online algorithm to reduce energy consumption in the data center network. The algorithm guarantees sufficient capacity in the network to schedule the flows at any time such that the network is neither overutilized nor under-utilized. At the same time, the algorithm ensures fault-tolerance and reliability by providing a level of redundancy in the network. Our strategy is to reduce energy consumption by adjusting the topology of the network. We compute sorted lists of most overlapping paths between each pair of hosts. When the total flow rate at any link crosses a pre-specified threshold (upper/lower), we efficiently update the topology by selecting (add/remove) the best overlapping paths that satisfy the flow demand, while keeping margins (r×) for fault tolerance. We call r as the redundancy parameter for our algorithm. We name our approach as ‘path consolidation’ as opposed to ‘traffic consolidation’ in Elastic Tree [4] and ‘link rate adaptation’ in CARPO [5]. Our algorithm consists of two steps: (i) Add paths for new flows in the topology, (ii) Update

If all the links incident to a switch are off, the switch is off. K   Xi,u,w,t , ∀u, ∀t (9) Yu,t ≤


t=1 u∈V T 

Finding the optimal flow assignment by solving the optimization is NP-complete due to the integer (boolean) constraints. While we can use optimization toolbox to solve the binary ILP for small network instances or obtain near optimal solution, it is not applicable in practice for two reasons. First, the formulation assumes that we have perfect knowledge of the flows. But in the data center network flows change unpredictably over time. Second, data center networks consist of huge number of links and switches and solving the ILP on these huge networks takes exponential time. In a real data center network, we neither have knowledge about the incoming traffic nor have information about bandwidth demands. Hence in this paper we provide an online algorithm for reducing energy consumption by determining the set of switches and links to keep active dynamically.

Let Yu,t be the binary decision variable indicating the power state (on/off) of switch u at time t. Then the following two constraints correlate the variables Xi,u,v,t and Yu,t : • If a switch is off, then all the links incident to it are off. Let Vu be the set of switches adjacent to switch u. Xi,u,w,t ≤ Yu,t ,


t=1 (u,v)∈E i=1

All flows and paths are bidirectional. Xi,u,v,t = Xi,v,u,t ,



and stop

∀i, ∀u, ∀v, t < ti , t > ti



(5) ti

Each flow is assigned to exactly one path.  Xi,u,w,t ≤ Xi,u,v,t , ∀i, ∀u, ∀v, ∀t



the paths that are either under utilized or over utilized. Below we discuss the two steps in detail. A. Path Selection For each pair (u, v) of source and destination, we compute a set of paths that support the total flow (bandwidth) demand Fu,v between source u and destination v. To ensure fault tolerance, we choose the paths that are capable of supporting r×Fu,v flow rate. Given the network G = (V, E) with the link s that have capacity list C, our goal is to choose the paths Pu,v minimum capacity as well as supports r × Fu,v flow rate. By selecting the paths that have minimum capacity, we increase the overlapping between paths among different pairs. The process of path selection continues until the flow requirement for all the active source-destination pair is considered. We then compute the new subgraph Gon from the union of the s for all active (u, v) source-destination selected paths Pu,v pairs. We turn off all the links and nodes in the subgraph Gof f = G − Gon .

(a) Equal Cost Multipath

(b) Path Consolidation

Fig. 2. The active nodes and the routes with (a) conventional routing and (b) our algorithm . The active nodes are shown in grey.

Hence the sorting in step 5 dominates the time complexity which is O(|E|log|E|). B. Path Update When the flow level in a link crosses threshold limit, we update the active network topology Gon . Before the update operation, the link capacities of Gon and G are updated from the network; and the link capacities of the inactive links in G are set to the corresponding maximum capacity. There are two threshold limits (i) upper threshold and (ii) lower threshold. We discuss the path update procedure for the two cases below. 1) Path Update for Upper Threshold: Suppose the flow level through the link set EU crossed the upper threshold and F(EU ) is the set of flows using those links. We need to add links to the active network Gon so that the updated network has enough capacity to support the flows F(EU ). We add most overlapping paths into the network. We invoke Algorithm 1 (Path Selection) with the flow list F(EU ) to compute new paths and add them to Gon . 2) Path Update for Lower Threshold: Suppose the flow level through the link set EL crossed the lower threshold and F(EL ) is the set of flows using those links. We need to remove links from Gon so that the updated network does not have spare capacity than needed to support the flows F(EL ). For each pair (u, v) in the flow list F(EL ), we remove only one least overlapping path from the network. We invoke Algorithm 2 (Path Deletion) with the flow list F(EL ) to compute the paths to be removed and delete them from Gon .

Algorithm 1 Path Selection (r, t) Input: Flow list F = ∪{Fi }, Network G = (V, E), Link capacity list C. Output: Active Network Gon = (Von , Eon ) where Von ⊆ V , Eon ⊆ E. 1: Gon = {} 2: for all (u, v) source-destination pairs in F do 3: Fu,v := Total bandwidth requirement for (u, v). 4: Pu,v := All paths between (u, v). 5: Sort Pu,v according to path capacity ct (p), ∀p ∈ Pu,v . s 6: Pu,v := {} 7: for all p ∈ Pu,v do  Select min-capacity paths s s 8: Pu,v = Pu,v ∪p s 9: if C(Pu,v ) ≥ r × Fu,v then 10: break 11: end if 12: end for s 13: Gon = Gon ∪ Pu,v  Add selected paths to Gon 14: Update capacity list C 15: end for

Algorithm 2 Path Deletion (r, t) Input: Flow list F(EL ), Active Network Gon = (Von , Eon ), Link capacity list C.   , Eon ) where Output: Active Network Gon = (Von   Von ⊆ Von ⊆ V , Eon ⊆ Eon ⊆ E. 1: for all Fi ∈ F(EL ) do 2: (u, v) := Source-destination pair for Fi . 3: Pu,v := All paths between (u, v). 4: pm u,v := arg maxp∈Pu,v ct (p). 5: Gon = Gon − {pm  Remove path from Gon . u,v } 6: end for

Figure 2 illustrates an example scenario of routing and active nodes in a data center network. In this figure, there are two flows going from src1 → dst1 and src2 → dst2. Conventional flow scheduling algorithms schedule flows in a manner that spread the flow across the network. In contrast our algorithm selects the paths such that there is maximum overlapping between paths from different sources and destinations. In Figure 2(a), there are 9 active switches but in Figure 2(b) there are 7 active switches and the two flows share a common link. We now analyze the time complexity of Algorithm 1. The minimum capacity paths in lines 6-14 between a pair of source-destination (u, v) can be computed using Depth-first or Breadth-first search on the network G. And all the paths from a single source u to different destinations can be computed using only one search traversal. Hence the time required in step 4 is O(|V | + |E|). Note that we can pre-compute the set of all paths between each pair (u, v) of hosts so that we don’t need to compute step 4 at every invocation of Algorithm 1.

The algorithms for updating paths are applied only to the affected flows that have crossed the threshold. Step 4 in Path Selection and step 3 in Path Deletion are pre-computed. Thus we only need to sort the paths for the source-destination pairs of the affected flows. In a data center network, flow rates do not change very frequently. By keeping threshold margins, we avoid frequent updates of the paths. We also do not need to predict flow rates since we provide redundancy


in the algorithm. Since we add paths as soon as the threshold is crossed, there is no overhead of latency/performance. In a real network, the algorithm may need to wait for capacity to become online. This will not affect the performance/latency of the flows as there are redundant paths in the network to re-route the affected flow. In the algorithm, we also do not make any assumption about the underlying flow scheduling mechanism in the network. Hence our algorithm can work on top of any flow scheduler. Our algorithm can also be combined with the flow scheduling algorithm to better exploit the topology during flow scheduling. IV. S IMULATOR We implemented a discrete event simulator for simulating the flow level TCP traffic of a data center network. The existing packet level simulators (e.g. ns2) are not suitable for simulation of traffic in data center networks because of the huge number of packets generated at any time in a decent sized network [11]. Hence we implemented a flow level simulator. The simulation events occur when a flow starts or ends. If multiple flows start and end at the same time, they are aggregated to form a single event. At each event, the simulator performs three tasks: (i) ECMP (Equal Cost Multipath), (ii) Max-min fair share, (iii) Right-size Network. Then we transfer flows, update each flow value and determine the next flow event. Each of these steps are described below.

Fig. 3. Block diagram showing the execution steps of the simulator along with the right-sizing algorithm. The dashed box is the emulation of the hardware.

A. Equal Cost Multipath (ECMP) Equal Cost Multipath (ECMP) forwarding is used widely for flow scheduling in data center networks in order to take advantage of multiple paths in data center topology between a pair of source and destination. For each pair of hosts, we compute all the equal cost shortest paths. Then for each flow from the source, we pick one of the shortest paths by hashing on (f lowid, src, dest). Thus the flows from a source to a destination are distributed among different paths and a flow is not split among multiple paths. ECMP also ensures path assignment for each flow.

Figure 3 illustrates the block diagram of the simulator along with the algorithm. The simulator first invokes the Path Selection algorithm. Then it simulates the network by ECMP and MMF and assigns path and bandwidth to each flow. Then it checks whether the upper or lower threshold is crossed. If any of the threshold is crossed, then it applies the Path Update algorithm to resize the network for the flows that have crossed the threshold. When the topology contains enough nodes and links such that none of the links crosses the threshold, the simulator advances to the next event. If the next event is a flow arriving event, then the simulator repeats by selecting new paths for the new starting flows. Otherwise, if it is a flow ending event then the simulator terminates the flow and checks for the threshold. Thus the simulator executes until all flows are finished. In a real network, the two steps in the dashed box in Figure 3 will be computed in the hardware.

B. Max-Min Fair Share (MMF) The max-min fair share algorithm allocates bandwidth to each flow along the path determined by ECMP. A link can be assigned to many flows. Hence the bandwidth of the link is distributed across all the flows sharing the link. At each link, the bandwidth is allocated to each candidate flow by monotonic increasing sequence for their respective demands. The minimum flow rate in the path of a flow is the minimum assigned flow rate on any link along that path. For our simulation we use the max-min fair share (MMF) algorithm for a network and globally compute bandwidth allocation for each link. We use MMF to allocate bandwidth because common transport layer protocol i.e. TCP tries to achieve MMF allocation of bandwidth among flows.

V. E VALUATION In this section, we evaluate our method based on energy reduction with respect to the state-of-the-art. We use energy model for switches and links to compute energy savings for the network. The energy savings depend on the traffic patterns, the level of desired redundancy and the size of the data center itself. We vary all the three parameters for the evaluation. Energy, performance and robustness all depend heavily on the nature of traffic pattern. We now explore the possible energy savings over communication patterns generated from a real production data center workload.

C. Right-size Network Each switch keeps track of whether the total allocated bandwidth at any link has crossed the threshold limit. If the limit is crossed or a flow ends, the simulator invokes the path update procedures to adjust the topology. If a new flow arrives, the simulator invokes the Path Selection algorithm to select new paths for the arriving flows.

A. Flow Traces There is no publicly available flow traces from data center networks. Due to the size of workload traffic, huge amount of


flows are generated in a reasonable size data center network. Hence it is not practical to capture all the flows from a production data center network. Given the scarcity of flow traces, previous studies [4], [5] have generated flows from workload traces or use patterns of synthetic traces to analyze the worst case or the expected case. In this paper, we generate flow traces from MapReduce traces available at [14]. The MapReduce traces came from a Facebook datacenter running jobs on a cluster of 600 machines over a period of one day (24 hours). Elastic Tree [4] uses traces from e-commerce website which are service type requests with short flows. In contrast, we look at cluster computing jobs which consists of mixture of short and long flows (interactive and batch jobs). CARPO [5] uses HTTP traces from wikipedia to generate flows by assigning flow sizes using the filenames in the request. In this paper, we generate flows from more realistic MapReduce traces. We now describe the method for generation of flow traces from the MapReduce traces. The MapReduce performance model is illustrated in Figure 4. A MapReduce job J is defined in [15] as a 7 tuple (S, S  , S  , X, Y, f (x), g(x)), where S is the size of input data (map stage); S  is the size of intermediate data (shuffle stage); S  is the size of output data (reduce stage); X is the number of mappers that J is divided into; Y is the number of reducers assigned for J to output; f (x) is the running time of a mapper vs size x of input; g(x) is the running time of a reducer vs size x of input. We compute the number of map and reduce tasks by dividing the input size S and output size S  by the HDFS (Hadoop Distributed File System) block size respectively. HDFS block size is typically 64MB. We now determine the number of flows for the MapReduce job. Since map and reduce tasks are placed close to the nodes where the data are stored in the HDFS, we consider only the flows corresponding to the shuffle stage. There are XY transfers in this stage, each corresponding to a flow from a S . map node to a reduce node. The size of each flow is XY Let Vn be the network transfer rate. Then the rate of the flow can be estimated from the starttime which is the job release  time and stoptime = XYS Vn . For our calculation, we used Vn = 10M B/sec. Since we don’t know the placement of the map and reduce tasks, we randomly pick hosts for the source and destination of the flow.

Fig. 4.

bytes to hundreds of MBytes. This is more realistic mixture of large flows and small flows than those used in the literature. Since the trace yields a lot of flows, we simulate for the first one hour on 78 flows. The flows are started when the MapReduce job corresponding to that flow is released. Since the flows are released at different times, the utilization of data center network varies over time. B. Power Model Parameters The typical power consumption of a switch is 67.7W and the power consumption for each active port is 2.0W [5]. For the experiments, we use αs = 67.7W , αp = 2W , βs = 0 and βp = 0. Let Ee and Ep be the total power consumption from the two algorithms: traffic consolidation in Elastic Tree and path consolidation in our algorithm. Then the energy reduction in our algorithm with respect to Elastic Tree is E e − Ep × 100% (13) Ee The energy reduction in our algorithm with respect to the original network is Redep =

E o − Ep × 100% (14) Eo where Eo is the total energy consumption when all the switches are active. For a pod − k fat tree, the total number of switches is (k/2)2 + k 2 and we can compute Eo from Equation (4) as Redop =

E o = αs

T  t=1

((k/2)2 + k 2 ) + T βs + 2αp


nlink + T βp (15) t


C. Results and Discussion When the network utilization is 100%, clearly there is no energy savings due to lack of any shutdown opportunities. Our right sizing algorithm uses variation in network utilization to reduce energy use. Energy Consumption: The energy consumption over time using path consolidation from our algorithm and traffic consolidation from Elastic Tree are plotted in Figure 5. Figure 5(a) shows the energy consumption over time in the network of size k = 6 and redundancy parameter r = 1. Figure 5(b) shows the number of active flows over time in the network. The original network without right-sizing always consumes energy even if there is no flow. The energy consumption actually varies in the original network due to the variation in the number of active ports with the flow variation. The energy consumption in Elastic Tree and our algorithm varies with the number of active flows in the network. When there are few flows in the network, the energy consumptions from Elastic tree and our algorithm are the same. Because there are not enough flows in the network to overlap in common paths. The main energy saving from our algorithm occurs when there are lot of flows in the network. Our algorithm can overlap flows in common paths when possible, which results in the reduction. Table I depicts the energy consumption and energy reduction from our algorithm with various parameter changes. For a fat tree of size pod − k = 6, our algorithm yields 18.70% total energy reduction with respect to Elastic Tree calculated using Equation (13). Our algorithm also gives 72.60% energy reduction with respect to the original network according to

MapReduce Performance Model.

Generating flows using the above method from the MapReduce traces gives 107389 flows with flow sizes varying from


our algorithm can choose among more paths and pack the paths more efficiently as there are more available paths in the network. For reasonable size of data center networks, our algorithm shows 25% reduction in energy with respect to Elastic Tree. (a) Energy Consumption

VI. R ELATED W ORK Recent studies have focused on reducing energy consumption in data center networks by designing energy proportional networks. While there are suggestions [16], [17], [18], [19] about reducing energy in general networks there are not many studies on power-proportionality of data center networks. Recently Abts et al. [20] proposed power efficient data center network topologies and used link rate adaptation to dynamically reconfigure the data rate of the links for energy proportionality in the network. Right-sizing of data center network for energy efficiency was first proposed by Heller et al. [4]. They proposed ‘Elastic Tree’ algorithm that continuously monitors data center traffic conditions and chooses a subset of network elements that must be kept active for energy efficiency and fault tolerance goals. There algorithm is specialized for fat tree only and switches use only local information from the ‘uplink’ and ‘downlink’ to determine the power state of the links, ports and switches. In contrast we use global information as we take into account ‘path capacity’ for the selection of paths in order to determine active switches and links. In other words our algorithm uses more information to adjust the topology and hence it achieves more energy savings. In terms of redundancy, Elastic Tree only adds extra aggregation or core switches for fault tolerance without being aware of traffic or path. Although this includes redundant path for some of the flows, this does not ensure fault tolerance for every flow. In contrast, we ensure redundancy at path level for each flow and thus our algorithm is more robust. Wang et al. [5] proposed an algorithm (CARPO) for traffic consolidation in a data center network by looking at the correlation among the flows. They place negatively correlated flows on the same path and positively correlated flows on different paths. They simulated on a small topology with fixed flows with varying rates. They assume that the flows are long and fixed which is unrealistic as 80% of the flows last less than 10 sec in a data center network [21]. Computing correlation among these huge number of flows and allocating paths based on the correlation in such a small time is not feasible in large data centers. Also computing correlation requires statistical information which is not available in practice. In contrast, in our algorithm, we do not need any statistical information or prediction for flows rather we place flows on paths based on their dynamic flow demands and dynamic capacity of the paths. In our experiments, we do not compare our algorithm with CARPO [5], as correlation information about the flow rate is not available. CARPO also used link rate adaptation [22] with the algorithm to adjust the port speed of the switch. Incorporating the technique of link rate adaptation with our algorithm will further reduce the energy consumption. Data center networks are commonly designed to support full bisection bandwidth with redundancy in the topology. None of the above studies took advantage of this topological structure. Moreover data center networks are generally configured to use Equal Cost Multipath routing (ECMP) to schedule flows. ECMP tries to spread traffic across different available paths.

(b) Number of Active Flows

Fig. 5. The energy consumption over time (a) and the number of active flows (b) in the data center network of size k = 6 and redundancy parameter r = 1. TABLE I E NERGY C ONSUMPTION AND R EDUCTION WITH DIFFERENT PARAMETERS FROM DIFFERENT ALGORITHMS

k 4 6 8

r 1 2 3 1 2 3 1 2 3

Eo (MW) 4.14 4.14 4.14 9.98 10.04 10.05 16.13 16.13 16.13

Ee (MW) 1.29 1.36 1.34 3.36 3.64 3.63 2.74 2.76 2.72

Ep (MW) 1.17 1.33 1.34 2.74 3.19 3.42 2.05 2.47 2.68

Redop (%) 71.75 67.80 67.61 72.60 68.26 65.97 87.31 84.66 83.39

Redep (%) 9.36 1.93 0.00 18.70 12.32 5.74 25.37 10.38 1.52

Equation (14). The variation in energy consumption with changes in parameters are discussed below. Robustness Analysis: To adapt with dynamic variation in traffic flows and network failures, data center networks incorporate redundancy in the topology. Due to redundancy, the network uses more switches and links than necessary for the flow demand. In our formulation we used a redundancy parameter r. Varying the redundancy parameter from r = 1 to r = 4, we compare the total energy consumption from Elastic Tree and our algorithm. For the comparison, we do not add any redundancy with the Elastic Tree. Even with the redundancy, our algorithm achieves more energy savings than Elastic Tree with no redundancy. Figure 6(a) and 6(b) illustrates that the energy consumption from our algorithm increases as we add more redundant paths to the topology. In Figure 6(c), we depict the reductions Redop and Redep where both of them decreases with the increase of redundancy r in our algorithm. Size of Data Center Network: We compare the impact of the size of fat tree on the energy savings from different algorithms. A pod − k fat tree topology consists of k 2 /4 core switches, k 2 /2 aggregation switches, k 2 /2 edge switches and k 3 /4 end hosts. For the given flow traces, we change the number of hosts by varying the parameter k. Then we randomly assign the source and destination for the flows from the set of hosts. We compare the energy savings for fat tree sizes of k = 4 to k = 10 as illustrated in Figure 7. For larger fat tree, the energy consumption increases for the original network but for the right-sizing algorithms it is somewhat random due to the random selection of hosts for the flows. Although the energy consumption is arbitrary, the energy reduction with respect to both original network and Elastic Tree increases with the size of fat tree. Figure 7(c) shows that the reduction in energy from our algorithm grows with the increase in the size k of fat tree. This is because with the increase in k,


(a) Energy Consumption

(b) Energy Consumption

(c) Energy Reduction

Fig. 6. The energy consumption from our algorithm with respect to the original network (a), Elastic Tree (b) and energy reduction (c) with respect to the original network, Elastic Tree for different redundancy parameter with fat tree size k = 6.

(a) Energy Consumption

(b) Energy Consumption

(c) Energy Reduction

Fig. 7. The energy consumption from our algorithm with respect to the original network (a), Elastic Tree (b) and energy reduction (c) with respect to the original network, Elastic Tree for different sizes of fat tree with redundancy parameter r = 1.

[7] C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, S. Luz DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers, In Proc. of ACM SIGCOMM, 2008. [8] A. Greenberg, J.R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D.A. Maltz, P. Patel, and S. Sengupta, VL2: a scalable and flexible data center network, ACM SIGCOMM Computer Communication Review, 2009. [9] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu, BCube: a high performance, server-centric network architecture for modular data centers, In Proc. of ACM SIGCOMM, 2009. [10] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford , S. Shenker , J. Turner, OpenFlow: enabling innovation in campus networks, ACM SIGCOMM Computer Communication Review, 38(2), 2008. [11] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, Hedera: Dynamic Flow Scheduling for Data Center Networks, In Proc. of USENIX NSDI, April 2010. [12] G. Ananthanarayanan and R. H. Katz, Greening the Switch, In Proc. of HotPower, 2008. [13] S. Kandula, D. Katabi, S. Sinha, and A. Berger. Dynamic Load Balancing Without Packet Reordering. ACM SIGCOMM Computer Communication Review, 37(2), pp 51-62, 2007. [14] https://github.com/SWIMProjectUCB/SWIM/wiki [15] Y. Xiao, S. Jianling, Reliable Estimation of Execution Time of MapReduce Program, China Communications, Vol. 8(6), pp 11–18, 2011. [16] M. Gupta and S. Singh, Greening of the Internet, In Proc. of ACM SIGCOMM, pp 19–26, 2003. [17] S. Nedevschi, L. Popa, G. Iannaccone, S. Ratnasamy and D. Wetherall, Reducing network energy consumption via sleeping and rate-adaptation, In Proc. of USENIX NSDI, 2008. [18] M. Andrews, A. Fernandez Anta, L. Zhang, and W. Zhao. Routing and scheduling for energy and delay minimization in the power-down model. In Proc. of IEEE INFOCOM, March 2010. [19] J. Chabarek, J. Sommers, P. Barford, C. Estan, D. Tsiang, and S. Wright, Power Awareness in Network Design and Routing, In Proc. of INFOCOM, April 2008. [20] D. Abts, M. Marty, P. Wells, P. Klausler, and H. Liu, Energy proportional datacenter networks, In Proc. of ISCA, 2010. [21] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, The nature of data center traffic: measurements & analysis, In Proc. of ACM SIGCOMM, 2009. [22] C. Gunaratne, K. Christensen, B. Nordman and S. Suen, Reducing the energy consumption of Ethernet with adaptive link rate (ALR), IEEE Transactions on Computers, 57(4), pp 448–461, 2008.

Hence algorithms like Elastic Tree that do not restrict the topology prior to routing decisions, cannot save much energy. In this paper, we guide ECMP by restricting the topology to the best available paths. Hence our algorithm uses less number of active links and switches than Elastic Tree to schedule flows and thus save energy. VII. C ONCLUSION In this paper, we dynamically turn off unused network elements (switches and links) for energy savings and turn on elements when the network is over-utilized. This improves the utilization of data center networks which are generally designed for peak traffic. Our algorithm achieves significant energy savings ∼80% in the fat-tree topology and 20-25% with respect to the Elastic Tree algorithm. Our ongoing work is to design a distributed right-sizing algorithm for data center networks which will not require global monitoring of TCP flows. We also would like to incorporate right-sizing techniques into the flow scheduling algorithms for better routing in data center networks. R EFERENCES [1] Server and Data Center Energy Efficiency, Final Report to Congress, U.S. Environmental Protection Agency, 2007. [2] L. A. Barroso, and U. H¨olzle, The case for energy-proportional computing. Computer, vol. 40, no. 12, pp. 33-37, 2007. [3] T. Benson, A. Anand, A. Akella, and M. Zhang, Understanding data center traffic characteristics, In Proc. of ACM SIGCOMM, 2009. [4] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee, and N. McKeown, Elastictree: Saving energy in data cneter networks, In Proc. of USENIX NSDI, pp 249-264, 2010. [5] X. Wang, Y. Yao, X. Wang, K. Lu, Q. Cao, CARPO: CorrelationAware Power Optimization in Data Center Networks, In Proc. of IEEE INFOCOM, March 2012. [6] M. Al-Fares, A. Loukissas, and A. Vahdat, A Scalable, Commodity, Data Center Network Architecture, In Proc. of ACM SIGCOMM, 2008.


Path Consolidation for Dynamic Right-Sizing of ... - Semantic Scholar

is the number of reducers assigned for J to output; f(x) is the running time of a mapper vs size x of input; g(x) is the running time of a reducer vs size x of input. We compute the number of map and reduce tasks by dividing the input size S and output size S by the HDFS (Hadoop Distributed File System) block size respectively.

606KB Sizes 0 Downloads 97 Views

Recommend Documents

implementing dynamic semantic resolution - Semantic Scholar
testing of a rule of inference called dynamic semantic resolution is then ... expressed in a special form, theorem provers are able to generate answers, ... case of semantic resolution that Robinson called hyper-resolution uses a static, trivial mode

Secure Dependencies with Dynamic Level ... - Semantic Scholar
evolve due to declassi cation and subject current level ... object classi cation and the subject current level. We ...... in Computer Science, Amsterdam, The Nether-.

Somatosensory Integration Controlled by Dynamic ... - Semantic Scholar
Oct 19, 2005 - voltage recording and representative spike waveforms (red) and mean ..... Note the deviation of the experimental data points from the unity line.

Allocating Dynamic Time-Spectrum Blocks for ... - Semantic Scholar
The problem of allocating spectrum in cognitive radio net- works poses new challenges that do not arise in traditional wireless technologies, including Wi-Fi.

Marriage and Career: The Dynamic Decisions of ... - Semantic Scholar
I especially want to thank Simcha Srebnik for computer support. I also thank the Maurice Falk ...... White-collar wage mis- match adjustment parameter (a3). J.285.