The Zero Touch Network Bikash Koley For Google Technical Infrastructure CNSM 2016

Confidential + Proprietary

Confidential + Proprietary

For the past 15 years, Google has been building out the largest cloud infrastructure on the planet. Confidential + Proprietary

2

Source: Google, 2012

100 Billion

searches per month on google.com

Images by Connie Confidential + Proprietary Zhou

A Global Cloud Network

Cluster

Confidential + Proprietary

Google Backbone(s) Internet facing Backbone, B2: 70+ locations in 33 countries

Global Software Defined Inter-DC Backbone: B4

Confidential + Proprietary

Operational scale ● ●

30,000+ circuits in operation Many tens of network element roles



Dozen+ vendors



4M lines of configuration files



~30K configuration changes per month



> 8M OIDs collected every 5 minutes

Confidential + Proprietary

6

At scale stuff breaks!

Cluster

Confidential + Proprietary

The Nines and the Outage Budgets … for four 9s availability?

99.99% uptime

4 minutes per month

… for five 9s availability?

99.999% uptime

24 seconds per month Confidential + Proprietary

Velocity of Evolution Scale Management Complexity

Why is high network availability a challenge? Confidential + Proprietary

9

Capacity

Google’s Network Hardware Evolves Constantly

Watchtower

Jupiter

Firehose 1.0 Saturn

4 Post

Firehose 1.1

Time

Confidential + Proprietary

10

As does the Network Software QUIC

gRPC

Jupiter Freedome

BwE

Andromeda B4

Watchtower Google Global Cache

2014 2012 2010 2008

2006 Confidential + Proprietary

11

… driven by ever-evolving products

Confidential + Proprietary

12

Network Operation is a tradeoff

Traditional network: pick any two of the three

reliability t} ien ffic ine le,

lia

re

ala {sc

t} ien

fic

ef

ble

e, bl

ab

le,

lab

,r eli

ca ns

{u

scale

{scalable, unreliable, efficient}

efficiency

We want all three! Confidential + Proprietary

13

Lessons learned from a decade of high-availability network design Confidential + Proprietary

14

We analyzed over 100 Post-mortem reports written over a 2 year period

Confidential + Proprietary

15

What is a Post-mortem? Carefully curated description of a previously unseen failure that had significant availability impact Blame-free process

Learn from failures Confidential + Proprietary

16

Confidential + Proprietary

17

Confidential + Proprietary

18

Where do failures happen?

No one network or plane dominates Confidential + Proprietary

19

How long do the failures last? Shorter failures on B2

Durations much longer than outage budgets

Confidential + Proprietary

20

What role does network evolution play?

70% of failures happen when a management operation is in progress

Confidential + Proprietary

21

The Zero Touch Network

Reliability, efficiency, scale

{reliability, efficiency, scale} are NOT tradeoffs .. if network operation is fully intent driven

Intent-driven Operation

Evolution is inevitable: Design for it! Confidential + Proprietary

22

The Zero Touch Network ● All network operations are automated, requiring no operator steps beyond the instantiation of intent ● Changes applied to individual network elements are fully declarative, vendor-neutral, and derived by the network infrastructure from the high-level network-wide intent ● Any network changes are automatically halted and rolled-back if the network displays unintended behavior ● The infrastructure does not allow operations which violate network policies Confidential + Proprietary

The Zero Touch Network ● All network operations are automated, requiring no operator steps beyond the instantiation of intent ● Changes applied to individual network elements are fully declarative, vendor-neutral, and derived by the network infrastructure from the high-level network-wide intent ● Any network changes are automatically halted and rolled-back if the network displays unintended behavior ● The infrastructure does not allow operations which violate network policies Confidential + Proprietary

The Zero Touch Network ● All network operations are automated, requiring no operator steps beyond the instantiation of intent ● Changes applied to individual network elements are fully declarative, vendor-neutral and derived by the network infrastructure from the high-level network-wide intent ● Any network changes are automatically halted and rolled-back if the network displays unintended behavior ● The infrastructure does not allow operations which violate network policies Confidential + Proprietary

The Zero Touch Network ● All network operations are automated, requiring no operator steps beyond the instantiation of intent ● Changes applied to individual network elements are fully declarative, vendor-neutral and derived by the network infrastructure from the high-level network-wide intent ● Any network changes are automatically halted and rolled-back if the network displays unintended behavior ● The infrastructure does not allow operations which violate network policies Confidential + Proprietary

Bikash

ZTN Architecture operators “drain a link” Workflow Engine

Workflow API

Update Network model

Topology

Config

Network Management Layer configuration, commands, telemetry

Network devices/ systems

Confidential + Proprietary

Workflow Engine operators

Workflow Engine



The workflow engine executes a goal-seeking workflow graph



Workflows are expressed in a meta-language



All interesting metrics of execution logged



Workflows have the same test coverage as any software system

Confidential + Proprietary

Network intent ● operators

intent-based network management

“drain a link” Workflow Engine

The workflow engine interacts with the

infrastructure over transactional APIs

Workflow API



Workflow intents are expressed at the network-level, as changes to ○

Topology



Config



Functional calls

Confidential + Proprietary

Network Models ●

Update Network model

OpenConfig (www.openconfig.net) for vendor-neutral configuration model

config / topology models

base model

Topology

Config



YANG for data modeling, gRPC as transport



Both configuration and op-state models



BGP, MPLS, ISIS, L2, Optical-transport, ACL,

extended model

policy...

● local modifications

X

vendor modifications

“Unified Network Model” for topology ○

Protocol Buffer based Google internal schema



Describes all layer-0/1/2/3 abstractions Confidential + Proprietary

Network Management Services ●

Compose full config (vendor-neutral and vendor-specific) from topology/config intent update

Topology

Config



Provides secure transport of full config to network elements (OpenConfig+gRPC)

Network Management Layer configuration, commands



Enforce Operational Policies ○

Rate limiting



Blast radius containment



Minimum survivable topology Confidential + Proprietary

Streaming Telemetry network state changes observed by analyzing comprehensive time-series data stream

● Common schema for operational state data in OpenConfig ● stream data continuously -with incremental updates ● Efficient, secure transport protocol, gRPC

Confidential + Proprietary

Workflow Safety ●

Ability to automatically check the safety of operations



Ability to repeatedly validate the network state against the stated intent



Ability to recognize “bad” network behavior



Ability to roll back to the original state

Confidential + Proprietary

Do not treat a change to the network as an exceptional event Lessons learned from a decade of high-availability network design Confidential + Proprietary

34

Changes are common

Confidential + Proprietary

Changes are common ↓ Make it safe to evolve the network daily

Confidential + Proprietary

Changes are common ↓ Make it safe to evolve the network daily ↓ Scale just-in-time, scale often

Confidential + Proprietary

Changes are common ↓ Make it safe to evolve the network daily ↓ Scale just-in-time, scale often ↓ Evolve into a Zero Touch Network Confidential + Proprietary

References ● ● ● ● ● ●

B4: Experience With a Globally Deployed Software Defined WAN [sigcomm 2013] Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network [Sigcomm 2015] Evolve or Die - High-Availability Design Principles Drawn from Google’s Network Infrastructure [sigcomm 2016] Andromeda: Google’s cloud networking stack OpenConfig : http://www.openconfig.net gRPC: http://www.grpc.io

Confidential + Proprietary

The Zero Touch Network - Research at Google

1.1. 4 Post. Jupiter. 10. Google's Network Hardware Evolves Constantly ... No one network or plane dominates. 19 .... Network Management Services. Network ...

8MB Sizes 4 Downloads 62 Views

Recommend Documents

The YouTube Social Network - Research at Google
media diffusion. According to ... as Facebook, Twitter, and Google+ to facilitate off-site diffu- sion. ... More importantly, YouTube serves as a popular social net-.

Data-driven network connectivity - Research at Google
not made or distributed for profit or commercial advantage and that copies bear this notice and ..... A distributed routing algorithm for mobile wireless networks.

Bayesian touch: a statistical criterion of target ... - Research at Google
than 10% of touch points falling off the target key on a ... [8] used a game published at the Android ... by device (e.g., 3D posture of the input finger [10, 11], or a.

Towards Zero-Shot Frame Semantic Parsing for ... - Research at Google
origin, destination, transit operator find restaurants amenities, hours, neighborhood, cuisine, price range appointments services, appointment time, appointment date, title reserve restaurant number of people, restaurant name,reservation date, locati

Network Utilization: the Flow View - Research at Google
impact on the service and too high utilization may cause higher .... 1. Typical Backbone Link Utilization. A large backbone network has many links, and can have.

Coordinating group of European network of paediatric research at the ...
Sep 9, 2016 - National Institute for Health. Research Clinical Research ... Anne Junker [email protected] Duke Clinical Research Institute. Brian Smith.

Lower Frame Rate Neural Network Acoustic ... - Research at Google
CD-Phones is that it allowed the model to output a symbol ev- ... this setup reduces the number of possible different alignments of each .... Soft-target label class.

LARGE SCALE DEEP NEURAL NETWORK ... - Research at Google
ral networks, deep learning, audio indexing. 1. INTRODUCTION. More than one billion people ... recognition technology can be an attractive and useful service.