The Zero Touch Network Bikash Koley For Google Technical Infrastructure CNSM 2016

Confidential + Proprietary

Confidential + Proprietary

For the past 15 years, Google has been building out the largest cloud infrastructure on the planet. Confidential + Proprietary

2

Source: Google, 2012

100 Billion

searches per month on google.com

Images by Connie Confidential + Proprietary Zhou

A Global Cloud Network

Cluster

Confidential + Proprietary

Google Backbone(s) Internet facing Backbone, B2: 70+ locations in 33 countries

Global Software Defined Inter-DC Backbone: B4

Confidential + Proprietary

Operational scale ● ●

30,000+ circuits in operation Many tens of network element roles



Dozen+ vendors



4M lines of configuration files



~30K configuration changes per month



> 8M OIDs collected every 5 minutes

Confidential + Proprietary

6

At scale stuff breaks!

Cluster

Confidential + Proprietary

The Nines and the Outage Budgets … for four 9s availability?

99.99% uptime

4 minutes per month

… for five 9s availability?

99.999% uptime

24 seconds per month Confidential + Proprietary

Velocity of Evolution Scale Management Complexity

Why is high network availability a challenge? Confidential + Proprietary

9

Capacity

Google’s Network Hardware Evolves Constantly

Watchtower

Jupiter

Firehose 1.0 Saturn

4 Post

Firehose 1.1

Time

Confidential + Proprietary

10

As does the Network Software QUIC

gRPC

Jupiter Freedome

BwE

Andromeda B4

Watchtower Google Global Cache

2014 2012 2010 2008

2006 Confidential + Proprietary

11

… driven by ever-evolving products

Confidential + Proprietary

12

Network Operation is a tradeoff

Traditional network: pick any two of the three

reliability t} ien ffic ine le,

lia

re

ala {sc

t} ien

fic

ef

ble

e, bl

ab

le,

lab

,r eli

ca ns

{u

scale

{scalable, unreliable, efficient}

efficiency

We want all three! Confidential + Proprietary

13

Lessons learned from a decade of high-availability network design Confidential + Proprietary

14

We analyzed over 100 Post-mortem reports written over a 2 year period

Confidential + Proprietary

15

What is a Post-mortem? Carefully curated description of a previously unseen failure that had significant availability impact Blame-free process

Learn from failures Confidential + Proprietary

16

Confidential + Proprietary

17

Confidential + Proprietary

18

Where do failures happen?

No one network or plane dominates Confidential + Proprietary

19

How long do the failures last? Shorter failures on B2

Durations much longer than outage budgets

Confidential + Proprietary

20

What role does network evolution play?

70% of failures happen when a management operation is in progress

Confidential + Proprietary

21

The Zero Touch Network

Reliability, efficiency, scale

{reliability, efficiency, scale} are NOT tradeoffs .. if network operation is fully intent driven

Intent-driven Operation

Evolution is inevitable: Design for it! Confidential + Proprietary

22

The Zero Touch Network ● All network operations are automated, requiring no operator steps beyond the instantiation of intent ● Changes applied to individual network elements are fully declarative, vendor-neutral, and derived by the network infrastructure from the high-level network-wide intent ● Any network changes are automatically halted and rolled-back if the network displays unintended behavior ● The infrastructure does not allow operations which violate network policies Confidential + Proprietary

The Zero Touch Network ● All network operations are automated, requiring no operator steps beyond the instantiation of intent ● Changes applied to individual network elements are fully declarative, vendor-neutral, and derived by the network infrastructure from the high-level network-wide intent ● Any network changes are automatically halted and rolled-back if the network displays unintended behavior ● The infrastructure does not allow operations which violate network policies Confidential + Proprietary

The Zero Touch Network ● All network operations are automated, requiring no operator steps beyond the instantiation of intent ● Changes applied to individual network elements are fully declarative, vendor-neutral and derived by the network infrastructure from the high-level network-wide intent ● Any network changes are automatically halted and rolled-back if the network displays unintended behavior ● The infrastructure does not allow operations which violate network policies Confidential + Proprietary

The Zero Touch Network ● All network operations are automated, requiring no operator steps beyond the instantiation of intent ● Changes applied to individual network elements are fully declarative, vendor-neutral and derived by the network infrastructure from the high-level network-wide intent ● Any network changes are automatically halted and rolled-back if the network displays unintended behavior ● The infrastructure does not allow operations which violate network policies Confidential + Proprietary

Bikash

ZTN Architecture operators “drain a link” Workflow Engine

Workflow API

Update Network model

Topology

Config

Network Management Layer configuration, commands, telemetry

Network devices/ systems

Confidential + Proprietary

Workflow Engine operators

Workflow Engine



The workflow engine executes a goal-seeking workflow graph



Workflows are expressed in a meta-language



All interesting metrics of execution logged



Workflows have the same test coverage as any software system

Confidential + Proprietary

Network intent ● operators

intent-based network management

“drain a link” Workflow Engine

The workflow engine interacts with the

infrastructure over transactional APIs

Workflow API



Workflow intents are expressed at the network-level, as changes to ○

Topology



Config



Functional calls

Confidential + Proprietary

Network Models ●

Update Network model

OpenConfig (www.openconfig.net) for vendor-neutral configuration model

config / topology models

base model

Topology

Config



YANG for data modeling, gRPC as transport



Both configuration and op-state models



BGP, MPLS, ISIS, L2, Optical-transport, ACL,

extended model

policy...

● local modifications

X

vendor modifications

“Unified Network Model” for topology ○

Protocol Buffer based Google internal schema



Describes all layer-0/1/2/3 abstractions Confidential + Proprietary

Network Management Services ●

Compose full config (vendor-neutral and vendor-specific) from topology/config intent update

Topology

Config



Provides secure transport of full config to network elements (OpenConfig+gRPC)

Network Management Layer configuration, commands



Enforce Operational Policies ○

Rate limiting



Blast radius containment



Minimum survivable topology Confidential + Proprietary

Streaming Telemetry network state changes observed by analyzing comprehensive time-series data stream

● Common schema for operational state data in OpenConfig ● stream data continuously -with incremental updates ● Efficient, secure transport protocol, gRPC

Confidential + Proprietary

Workflow Safety ●

Ability to automatically check the safety of operations



Ability to repeatedly validate the network state against the stated intent



Ability to recognize “bad” network behavior



Ability to roll back to the original state

Confidential + Proprietary

Do not treat a change to the network as an exceptional event Lessons learned from a decade of high-availability network design Confidential + Proprietary

34

Changes are common

Confidential + Proprietary

Changes are common ↓ Make it safe to evolve the network daily

Confidential + Proprietary

Changes are common ↓ Make it safe to evolve the network daily ↓ Scale just-in-time, scale often

Confidential + Proprietary

Changes are common ↓ Make it safe to evolve the network daily ↓ Scale just-in-time, scale often ↓ Evolve into a Zero Touch Network Confidential + Proprietary

References ● ● ● ● ● ●

B4: Experience With a Globally Deployed Software Defined WAN [sigcomm 2013] Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network [Sigcomm 2015] Evolve or Die - High-Availability Design Principles Drawn from Google’s Network Infrastructure [sigcomm 2016] Andromeda: Google’s cloud networking stack OpenConfig : http://www.openconfig.net gRPC: http://www.grpc.io

Confidential + Proprietary

The Zero Touch Network - Research at Google

1.1. 4 Post. Jupiter. 10. Google's Network Hardware Evolves Constantly ... No one network or plane dominates. 19 .... Network Management Services. Network ...

8MB Sizes 6 Downloads 349 Views

Recommend Documents

The YouTube Social Network - Research at Google
media diffusion. According to ... as Facebook, Twitter, and Google+ to facilitate off-site diffu- sion. ... More importantly, YouTube serves as a popular social net-.

Bayesian touch: a statistical criterion of target ... - Research at Google
than 10% of touch points falling off the target key on a ... [8] used a game published at the Android ... by device (e.g., 3D posture of the input finger [10, 11], or a.

Data-driven network connectivity - Research at Google
not made or distributed for profit or commercial advantage and that copies bear this notice and ..... A distributed routing algorithm for mobile wireless networks.

Minutes of the European network of paediatric research at the ...
Nov 10, 2016 - Thursday 06 October 2016, EMA room 03-K and via Adobe Connect;. 13.00 to 15.20 UK .... paediatric specific training for Ethics. Committees ...

Minutes of the European network of paediatric research at the ...
Nov 10, 2016 - Send a question via our website www.ema.europa.eu/contact. © European ... Human Medicines Research & Development Support Division. Minutes .... networks to share best practices, funding models, organizational ...

Towards Zero-Shot Frame Semantic Parsing for ... - Research at Google
Most modern approaches for conversational language un- derstanding involve training machine learning models on anno- ... In recent years, in the space of virtual personal assistants, especially motivated by commercial ... training and evaluating on l

Inferring the Network Latency Requirements of ... - Research at Google
1 Introduction. Tenants of Infrastructure-as-a-Service (IaaS) and. Platform-as-a-Service (PaaS) cloud providers rely on these providers to provide network ...

Network Utilization: the Flow View - Research at Google
impact on the service and too high utilization may cause higher .... 1. Typical Backbone Link Utilization. A large backbone network has many links, and can have.

Towards Zero-Shot Frame Semantic Parsing for ... - Research at Google
origin, destination, transit operator find restaurants amenities, hours, neighborhood, cuisine, price range appointments services, appointment time, appointment date, title reserve restaurant number of people, restaurant name,reservation date, locati

European network of paediatric research at the European Medicines ...
however, only networks of category 1-4 that updated the forms regularly are ... good practice as well as a list of ideal services that networks can provide was ...

European network of paediatric research at the European Medicines ...
This new working group, co-chaired by a PDCO member and the representative of a national multi- disciplinary network, had its 1st face to ... decision tree, (d) risk management strategy. We thank the members of all ... We wish you and your families M

European network of paediatric research at the European Medicines ...
Arch Dis Child Published Online. First:25 May ... benefit from additional training in the following areas: IT training / clinical trial setup / specialist skills for clinical ...

Coordinating group of European network of paediatric research at the ...
Sep 9, 2016 - National Institute for Health. Research Clinical Research ... Anne Junker [email protected]. Duke Clinical Research Institute. Brian Smith.

a Robust Wireless Facilities Network for Data ... - Research at Google
Today's network control and management traffic are limited by their reliance on .... Planning the necessary cable tray infrastructure to connect arbitrary points in the .... WiFi is the obvious choice given its cost and availability. The problem ...

Lower Frame Rate Neural Network Acoustic ... - Research at Google
CD-Phones is that it allowed the model to output a symbol ev- ... this setup reduces the number of possible different alignments of each .... Soft-target label class.

Accuracy at the Top - Research at Google
We define an algorithm optimizing a convex surrogate of the ... as search engines or recommendation systems, since most users of these systems browse or ...

Neural Network Adaptive Beamforming for ... - Research at Google
network adaptive beamforming (NAB) technique to address this issue. Specifically, we use ..... locations vary across utterances; the distance between the sound source and the ..... cessing. Springer Science & Business Media, 2008, vol. 1. ... long sh

Drilling Network Stacks with packetdrill - Research at Google
Ph.D. in Computer Science from. UC San Diego and a B.S. .... Here's an example of a bind() system call invocation in packet- drill notation: +0 bind(3, ..., . ... information and to make assertions about the internal state of a. TCP socket using the 

LARGE SCALE DEEP NEURAL NETWORK ... - Research at Google
ral networks, deep learning, audio indexing. 1. INTRODUCTION. More than one billion people ... recognition technology can be an attractive and useful service.

Approaches for Neural-Network Language ... - Research at Google
Oct 10, 2017 - guage model of an ASR system is likely to benefit from training on spoken ... The solution we propose in this paper to address this mis- match is ...

Maglev: A Fast and Reliable Software Network ... - Research at Google
Google's traffic since 2008. It has sustained the rapid global growth of Google services, and it also provides network load balancing for Google Cloud Platform. 1 ...

packetdrill: Scriptable Network Stack Testing ... - Research at Google
network stack implementations, from the system call layer to the hardware network ..... ing receiver ACK that completed the three-way hand- shake [10], and TFO ...

at-zero-the-final-secrets-to-zero-limits-the-quest.pdf
Besides all of his books, Joe also recorded the #1 best-selling Nightingale-Conant audioprogram,. The Power of Outrageous Marketing. Joes marketing ...