F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business Jeff Shute, Mircea Oancea, Stephan Ellner, Ben Handy, Eric Rollins, Bart Samwel, Radek Vingralek, Chad Whipkey, Xin Chen, Beat Jegerlehner, Kyle Littlefield, Phoenix Tong SIGMOD May 22, 2012

Today's Talk F1 - A Hybrid Database combining the ● Scalability of Bigtable ● Usability and functionality of SQL databases Key Ideas ● Scalability: Auto-sharded storage ● Availability & Consistency: Synchronous replication ● High commit latency: Can be hidden ○ Hierarchical schema ○ Protocol buffer column types ○ Efficient client code Can you have a scalable database without going NoSQL? Yes.

The AdWords Ecosystem One shared database backing Google's core AdWords business advertiser

SOAP API

web UI

reports Java / "frontend"

ad-hoc SQL users

C++ / "backend"

DB log aggregation

ad servers

ad approvals spam analysis ad logs

Our Legacy DB: Sharded MySQL Sharding Strategy ● Sharded by customer ● Apps optimized using shard awareness Limitations ● Availability ○ Master / slave replication -> downtime during failover ○ Schema changes -> downtime for table locking ● Scaling ○ Grow by adding shards ○ Rebalancing shards is extremely difficult and risky ○ Therefore, limit size and growth of data stored in database ● Functionality ○ Can't do cross-shard transactions or joins

Demanding Users Critical applications driving Google's core ad business ● 24/7 availability, even with datacenter outages ● Consistency required ○ Can't afford to process inconsistent data ○ Eventual consistency too complex and painful ● Scale: 10s of TB, replicated to 1000s of machines Shared schema ● Dozens of systems sharing one database ● Constantly evolving - multiple schema changes per week SQL Query ● Query without code

Our Solution: F1 A new database, ● built from scratch, ● designed to operate at Google scale, ● without compromising on RDBMS features. Co-developed with new lower-level storage system, Spanner

Underlying Storage - Spanner Descendant of Bigtable, Successor to Megastore Properties ● Globally distributed ● Synchronous cross-datacenter replication (with Paxos) ● Transparent sharding, data movement ● General transactions ○ Multiple reads followed by a single atomic write ○ Local or cross-machine (using 2PC) ● Snapshot reads

F1 F1 client

Architecture ● Sharded Spanner servers ○ data on GFS and in memory ● Stateless F1 server ● Pool of workers for query execution

F1 server F1 query workers

Features ● Relational schema ○ Extensions for hierarchy and rich data types ○ Non-blocking schema changes ● Consistent indexes ● Parallel reads with SQL or Map-Reduce

Spanner server

GFS

How We Deploy ● Five replicas needed for high availability ● Why not three? ○ Assume one datacenter down ○ Then one more machine crash => partial outage Geography ● Replicas spread across the country to survive regional disasters ○ Up to 100ms apart Performance ● Very high commit latency - 50-100ms ● Reads take 5-10ms - much slower than MySQL ● High throughput

Hierarchical Schema Explicit table hierarchies. Example: ● Customer (root table): PK (CustomerId) ● Campaign (child): PK (CustomerId, CampaignId) ● AdGroup (child): PK (CustomerId, CampaignId, AdGroupId) Storage Layout

Rows and PKs 1

1,3

1,3,5

1,3,6

2

1,4

2,5

1,4,7

2,5,8

Customer Campaign AdGroup AdGroup Campaign AdGroup Customer Campaign AdGroup

(1) (1,3) (1,3,5) (1,3,6) (1,4) (1,4,7) (2) (2,5) (2,5,8)

Clustered Storage ● ● ● ●

Child rows under one root row form a cluster Cluster stored on one machine (unless huge) Transactions within one cluster are most efficient Very efficient joins inside clusters (can merge with no sorting) Storage Layout

Rows and PKs 1

1,3

1,3,5

1,3,6

2

1,4

2,5

1,4,7

2,5,8

Customer Campaign AdGroup AdGroup Campaign AdGroup Customer Campaign AdGroup

(1) (1,3) (1,3,5) (1,3,6) (1,4) (1,4,7) (2) (2,5) (2,5,8)

Protocol Buffer Column Types Protocol Buffers ● Structured data types with optional and repeated fields ● Open-sourced by Google, APIs in several languages Column data types are mostly Protocol Buffers ● Treated as blobs by underlying storage ● SQL syntax extensions for reading nested fields ● Coarser schema with fewer tables - inlined objects instead Why useful? ● Protocol Buffers pervasive at Google -> no impedance mismatch ● Simplified schema and code - apps use the same objects ○ Don't need foreign keys or joins if data is inlined

SQL Query ● Parallel query engine implemented from scratch ● Fully functional SQL, joins to external sources ● Language extensions for protocol buffers SELECT CustomerId FROM Customer c PROTO JOIN c.Whitelist.feature f WHERE f.feature_id = 302 AND f.status = 'STATUS_ENABLED'

Making queries fast ● Hide RPC latency ● Parallel and batch execution ● Hierarchical joins

Coping with High Latency Preferred transaction structure ● One read phase: No serial reads ○ Read in batches ○ Read asynchronously in parallel ● Buffer writes in client, send as one RPC Use coarse schema and hierarchy ● Fewer tables and columns ● Fewer joins For bulk operations ● Use small transactions in parallel - high throughput Avoid ORMs that add hidden costs

ORM Anti-Patterns ● Obscuring database operations from app developers ● Serial reads ○ for loops doing one query per iteration ● Implicit traversal ○ Adding unwanted joins and loading unnecessary data These hurt performance in all databases. They are disastrous on F1.

Our Client Library ● Very lightweight ORM - doesn't really have the "R" ○ Never uses Relational joins or traversal ● All objects are loaded explicitly ○ Hierarchical schema and protocol buffers make this easy ○ Don't join - just load child objects with a range read ● Ask explicitly for parallel and async reads

Results Development ● Code is slightly more complex ○ But predictable performance, scales well by default ● Developers happy ○ Simpler schema ○ Rich data types -> lower impedance mismatch User-Facing Latency ● Avg user action: ~200ms - on par with legacy system ● Flatter distribution of latencies ○ Mostly from better client code ○ Few user actions take much longer than average ○ Old system had severe latency tail of multi-second transactions

Current Challenges ● Parallel query execution ○ Failure recovery ○ Isolation ○ Skew and stragglers ○ Optimization ● Migrating applications, without downtime ○ Core systems already on F1, many more moving ○ Millions of LOC

Summary We've moved a large and critical application suite from MySQL to F1. This gave us ● Better scalability ● Better availability ● Equivalent consistency guarantees ● Equally powerful SQL query And also similar application latency, using ● Coarser schema with rich column types ● Smarter client coding patterns In short, we made our database scale, and didn't lose any key database features along the way.

F1 - The Fault-Tolerant Distributed RDBMS ... - Research at Google

May 22, 2012 - One shared database backing Google's core AdWords business. DB ... Our Solution: F1 .... These hurt performance in all databases. They are ...

309KB Sizes 2 Downloads 387 Views

Recommend Documents

F1: A Distributed SQL Database That Scales - Research at Google
Aug 26, 2013 - The F1 database system is indeed such a hybrid, combining the best aspects of tradi- ... servable latency on our web applications has not increased compared to the old ... processing through Google's MapReduce framework [10]. ...... er

The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business
May 22, 2012 - 1,3,6. 1,3,5. 2. 2,5. 2,5,8. Storage Layout. Rows and PKs ... Parallel query engine implemented from scratch. ○ Fully functional SQL, joins to ...

DISTRIBUTED ACOUSTIC MODELING WITH ... - Research at Google
best rescoring framework for Google Voice Search. 87,000 hours of training .... serving system (SSTable service) with S servers each holding. 1/S-th of the data.

DISTRIBUTED DISCRIMINATIVE LANGUAGE ... - Research at Google
formance after reranking N-best lists of a standard Google voice-search data ..... hypotheses in domain adaptation and generalization,” in Proc. ICASSP, 2006.

revisiting distributed synchronous sgd - Research at Google
The recent success of deep learning approaches for domains like speech recognition ... but also from the fact that the size of available training data has grown ...

Online, Asynchronous Schema Change in F1 - Research at Google
Aug 26, 2013 - quires a distributed schema change to update all servers. Shared data ..... that table is associated with (or covered by) exactly one optimistic ...

Distributed Training Strategies for the Structured ... - Research at Google
ification we call iterative parameter mixing can be .... imum entropy model, which is not known to hold ..... of the International Conference on Machine Learning.

Spanner: Google's Globally-Distributed Database - Research at Google
replicate their data across 3 to 5 datacenters in one ge- ographic region, but with .... which is an important way in which Spanner is more .... that use Megastore are Gmail, Picasa, Calendar, Android. Market ..... Spanner's call stack. In addition .

Distributed divide-and-conquer techniques for ... - Research at Google
1. Introduction. Denial-of-Service (DoS) attacks pose a significant threat to today's Internet. The first ... the entire network, exploiting the attack traffic convergence.

Distributed Large-scale Natural Graph ... - Research at Google
Natural graphs, such as social networks, email graphs, or instant messaging ... cated values in order to perform most of the computation ... On a graph of 200 million vertices and 10 billion edges, de- ... to the author's site if the Material is used

Large Scale Distributed Acoustic Modeling With ... - Research at Google
Jan 29, 2013 - 10-millisecond steps), which means that about 360 million samples are ... From a modeling point of view the question becomes: what is the best ...

Spanner: Google's Globally-Distributed Database - Research at Google
replicate their data across 3 to 5 datacenters in one ge- ographic region, but .... which is an important way in which Spanner is more ...... Spanner's call stack.

Large Scale Distributed Deep Networks - Research at Google
second point, we trained a large neural network of more than 1 billion parameters and .... rameter server service for an updated copy of its model parameters.

Causeway: A message-oriented distributed ... - Research at Google
with a bug symptom, debugging proceeds by searching for an earlier discrepancy ... stack view for navigating call-return order, we realized that support for the ... Starting with one-way ...... Proceedings of the 1997 Conference of the Centre for.

BigTable: A System for Distributed Structured ... - Research at Google
2. Motivation. • Lots of (semi-)structured data at Google. – URLs: • Contents, crawl metadata ... See SOSP'03 paper at http://labs.google.com/papers/gfs.html.

Design patterns for container-based distributed ... - Research at Google
tectures built from containerized software components. ... management, single-node patterns of closely cooperat- ... profiling information of interest to de-.

Distributed MAP Inference for Undirected ... - Research at Google
Department of Computer Science, University of Massachusetts, Amherst. † Google Research, Mountain View. 1 Introduction. Graphical models have widespread ...

Automatic Reconfiguration of Distributed Storage - Research at Google
Email: [email protected]. Alexander ... Email: 1shralex, [email protected] ... trators have to determine a good configuration by trial and error.

Distributed MAP Inference for Undirected ... - Research at Google
Department of Computer Science, University of Massachusetts, Amherst. † Google ... This jump is accepted with the following Metropolis-Hastings acceptance probability: .... by a model can be used as a measure of its rate of convergence. ... Uncerta

Recursion in Scalable Protocols via Distributed ... - Research at Google
varies between local administrative domains. In a hierarchi- ... up to Q. How should our recursive program deal with that? First, note that in ... Here, software.

Idest: Learning a Distributed Representation for ... - Research at Google
May 31, 2015 - Natural Language. Engineering, 7(4):343–360. Mausam, M. Schmitz, R. Bart, S. Soderland &. O. Etzioni (2012). Open language learning for in-.

MapReduce/Bigtable for Distributed Optimization - Research at Google
With large data sets, it can be time consuming to run gradient based optimiza- tion, for example to minimize the log-likelihood for maximum entropy models.