Capacity planning for the Google backbone network Christoph Albrecht, Ajay Bangla, Emilie Danna, Alireza Ghaffarkhah, Joe Jiang, Bikash Koley, Ben Preskill, Xiaoxue Zhao July 13, 2015 ISMP
Multiple large backbone networks
B2: Internet facing backbone 70+ locations in 33 countries
B4: Global software-defined inter-datacenter backbone
20,000+ circuits in operation 40,000+ submarine fiber-pair miles
Multiple layers
B
A
Flows routed on logical links
C B
A Logical Topology (L3): router adjacencies
C A
Physical Topology (L1): fiber paths
B C
Multiple layers
B
A
Flows routed on logical links
C B
A Logical Topology (L3): router adjacencies
C A
Physical Topology (L1): fiber paths
Failures propagate from layer to layer
B C
Multiple time horizons O(seconds)
O(months)
O(years)
• Failure events
• Demand variation
• Long-term demand forecast
• Fast protection/restoration
• Capacity changes
• Routing changes • Definitive failure repair: O (hours or days)
• Risk assessment
• Topology optimization and simulation • What-if business case analysis
Multiple objectives Strategic objectives: minimize cost, ensure scalability Service level objectives (SLO): latency, availability ➔ Failures are modeled probabilistically ➔ The objective is defined as points on the probability distribution ➔ Latency example: the 95th percentile of latency from A to B is at most 17 ms ➔ Availability example: 10 Gbps of bandwidth is available from A to B at least 99.9% of the time
Multiple practical constraints Example: Routing of flows on the logical graph ➔ A flow can take a limited number of paths ➔ Routing is sometimes not deterministic ➔ There is a time delay to modify routing after a failure happens
Deterministic optimization
Problem What is the cheapest network that can route flows during a given set of failure scenarios? ➔
L3-only version: physical topology and logical/physical mapping are fixed, decide logical capacity
➔
Cross-layer version: decide physical and logical topology, mapping between the two, and logical capacity
Mixed integer program: building block Each flow is satisfied during each failure scenario in the given set. For each failure scenario: Variables: how the flow is routed Constraints: for each link, utilization is less than capacity under failure
Matrix represented as:
routing variables
capacity variables
Mixed integer program: L3-only L3 link capacity variables
L3 routing for each L1 failure scenario
... L1 variables Equipment placement Line system capacity ...
Mixed integer program: cross-layer L3 link capacity under failure variables L3/L1 mapping variables L3 routing for each L1 failure scenario
... L3 link capacity variables
How the L3/L1 mapping affects the L3 capacity available under L1 failure
Extended L1 variables L3/L1 constraints
Routing for each failure scenario Path formulation
For each src-dst flow: . Multiple paths are generated from src to dst . One variable per path for the amount of traffic along the path.
Edge formulation
For each src-dst flow: . One variable for the amount of traffic for each link and each direction . At each node: flow conservation constraint
Edge formulation for single source to multiple destinations flow
. All flows of the same source are combined into one flow with multiple destinations. . For each src-multiple dst flow: (link, direction) variables and flow conservation constraints
Latency constraints Strict version: Edge formulation for single source to multiple destinations on the shortest path tree
Challenges: . How to make the constraint less strict? . How to make it probabilistic?
Results Potential cost reduction: Cross-layer optimization can reduce cost 2x more than L3-only optimization
Stochastic simulation
Problem Does a given network meet availability and latency SLOs? ➔ Current network: risk assessment ➔ Hypothetical future networks: what-if analysis
Monte Carlo simulation Input: . Demand . Cross-layer topology . Failure data
Output: Whether each flow meets SLO
Simulation
Combine results into a satisfied demand and latency distribution Random combination of failures
Draw many samples from the failure distribution
Unique L3 topology
TE simulator
Derive corresponding L3 topologies with link capacity and deduplicate
TE results
For each L3 topology: Evaluate the satisfied demand and latency with the traffic engineering (TE) simulator
Parallel implementation Random seeds memory bottleneck
simulation: speed bottleneck
memory bottleneck
Logical topologies
1 billion samples
Deduplicated logical topologies with number of samples
74 million unique topologies
Flows with satisfied demand and number of samples Flow Availability
10,000 threads 2 hours > 1000x speedup
Results Availability
Data and automation are transforming ➔ our decision making ➔ the definition of our business: measurable service quality and guarantees
Time
Stochastic optimization
Problem What is the cheapest network that can meet SLO? ➔ Probabilistic modeling of failures ➔ SLO = chance constraints ◆ ◆
Probability (latency from A to B <= 17 ms) >= 0.95 Probability (satisfied demand from A to B >= 10 Gbps) >= 0.999
Simulation / Optimization loop with scenario-based approach optimized topology for the current set of failure scenarios
Deterministic optimization
Stochastic simulation
new set of failure scenarios
Greedy approach to meet SLO by optimizing with the smallest number of failure scenarios
Add failure scenarios with . highest probability . highest number or volume of flows that miss SLO and are not satisfied during that failure scenario
Challenges Tradeoff between accuracy, optimality, scalability, complexity and speed Examples ➔ Accuracy: routing convergence ➔ Optimality: better stochastic optimization ➔ Scalability: more failure scenarios ➔ Complexity: explanation of solutions to our users ➔ Speed: repair the topology on the fly (transport SDN)
Thank you
Scenario-based approach Probability Probability (satisfied demand from A to B >= 10 Gbps) >= 0.999
Choose subset of failure scenarios that Sum Probability( ) >= 0.999
where demand is satisfied such
0.001 ... Demand must be satisfied during all these failures
Demand must be satisfied during some of these failures
Failure scenario
Accuracy Confidence interval for availability estimator
Target availability Estimated availability
Conclusive Flow
Confidence interval for availability estimator
Target availability Estimated availability
Inconclusive Flow
The availability calculation is a statistical estimation Flow is conclusive if confidence interval lies entirely on one side of its target availability This is used to determine the necessary number of samples
a three-tier web-based service system with multiple server clusters. To the best ..... service deployment. The service provisioning network supports 5 types of ab-.
2014 IEEE Power & Energy Society General Meeting (IEEE-PES-GM 2014), National Harbour, USA, Jul. 27-31, 2014. ... of reliability constraints [4], renewable integration [5], [6], emissions control ... 2-stage Stochastic Mixed-Integer Programming (SMIP
Quality of Service (QoS) = a generic term describing the performance of a system ... We have: 1. A network a. Topology, including Possible Points of Failure b.
into the network so that the Quality-of-Service of admitted flows could be ..... values between 7 and 9 hops yield the best delay-throughput performance behavior. .... on Broadband Networks (BroadNets), page 661-670, Oct. 2004. Authorized ...
Currently, institutional repositories have been serving at about 250 national, public, and private universities. In addition to the ... JAIRO Cloud, which launched.
With the rapid development of computer and network technology, scholarly communication has been generally digitalised. While ... Subdivision on Science, Council for Science and Technology, July 2012) .... quantity of published articles in the consequ
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. BAB VII ...
The current version of the Protocol provides Internet access up to 40 Mbps ... dedicated as an alternative to wired technologies (cable .... substantial businesses.
Keywords-Ad hoc wireless networks; hybrid wireless net- work; mobility; capacity .... A smaller m represents a more severe degree of clustering and vice versa.
Jun 24, 2010 - Uniformly Dense Networks. Non-uniformly Dense Networks. Capacity Scaling in Mobile Wireless Ad Hoc. Network with Infrastructure Support.