SS 2009 Proseminar: Beautiful Code

Distributed Programming with MapReduce by Jeffrey Dean and Sanjay Ghemawat Presentation: Elke Weber 04 June 2009

Structure 1. What is MapReduce? 2. Motivation 3. Example program: word count 4. Naïve, non-parallel word count program 5. Parallelized word count programs 6. The MapReduce Programming Model 7. Implementations 8. Demonstration: Disco 9. Conclusion 2

What is MapReduce? ●





a programming system for large-scale data processing problems a parallelization pattern/programming model separates the details of the original problem from the details of parallelization

3

Motivation ●







symplifying the large-scale computations at Google originally developed for rewriting the indexing system for the Google web search product MapReduce programs are automatically parallelized and executed on a large-scale cluster programmers without any experience with parallel and distributed systems can easily use large distributed resources

4

Example program: word count ●





you have 20 billion documents generate a count of how often each word occurs in the documents average document size 20 kilobytes → 400 terabytes of data

5

Naïve, non-parallel word count program Assumption: Machine has sufficient memory! map word_count; for each document d { for each word w in d { word_count[w]++; } } … save word_count to persistent storage …

→ will take roughly 4 months 6

Parallelized word count program Mutex lock; // protects word_count map word_count; for each document d in parallel { for each word w in d { lock.Lock(); word_count[w]++; lock.Unlock(); } } … save word_count to persistent storage …

→ problem: uses a single global data structure for generated counts 7

Parallelized word count program with partitioned storage 1 struct CountTable { Mutex lock; map word_count; }; const int kNumBuckets = 256; CountTable tables[kNumBuckets];

8

Parallelized word count program with partitioned storage 2 for each document d in parallel { for each word w in d { int bucket = hash(w) % kNumBuckets; tables[bucket].lock.Lock(); tables[bucket].word_count[w]++; tabels[bucket].lock.Unlock(); } } for (int b = 0; b < kNumBuckets; b++){ … save tables[b].word_count to persistent storage … }

9

Parallelized word count program with partitioned storage 3 ●





no more than the number of processors in a single machine most machines 8 or fewer processors → requires still multiple weeks of processing solution: distribute the data and the computation across multiple machines 10

Parallelized word count program with partitioned processors 1

Assumption: Machines do not fail! const int M = 1000; // number of input processes const int R = 256; // number of output processes

11

Parallelized word count program with partitioned processors 2 main() { // Compute the number of documents to // assign to each process const int D = number of documents / M; for (int i = 0; i < M; i++) { fork InputProcess(i * D, (i + 1) * D); } for (int i = 0; i < R; i++) { fork OutputProcess(i); } … wait for all processes to finish … }

12

Parallelized word count program with partitioned processors 3 void InputProcess(int start_doc, int end_doc) { // Separate table per output process map word_count[R]; for each doc d in range [start_doc .. end_doc-1] { for each word w in d { int b = hash(w) % R; word_count[b][w]++; } } for (int b = 0; b < R; b++) { string s = EncodeTable(word_count[b]); … send s to output process b … } } 13

Parallelized word count program with partitioned processors 4 void OutputProcess(int bucket) { map word_count; for each input process p { string s = … read message from p … map partial = DecodeTable(s); for each in partial { word_count[word] += count; } } … save word_count to persistent storage … }

14

Parallelized word count program with partitioned processors 5 ●

scales nicely on a network of workstations



but more complicated and hard to understand



deals not with machine failures



adding failure handling would further complicate things

15

The MapReduce Programming Model ●



separate the details from the original problem from the details of parallelization parallelization pattern: ●



For each input record, extract a set of key/value pairs that we care about from each record. For each extracted key/value pair, combine it with other values that share the same key.

16

Division of word counting problem into Map and Reduce void Map(string document) { for each word w in document { EmitIntermediate(w, "1"); } void Reduce(string word, list values) { int count = 0; for each v in values { count += StringToInt(v); } Emit(word, IntToString(count)); }

17

Driver for Map and Reduce 1

map > intermediate_data; void EmitIntermediate(string key, string value){ intermediate_data[key].append(value); } void Emit(string key, string value) { … write key/value to final data file … }

18

Driver for Map and Reduce 2 void Driver(

MapFunction mapper, ReduceFunction reducer) { for each input item do { mapper(item) } for each key k in intermediate_data { reducer(k, intermediate_data[k]); }

} main() { Driver(Map, Reduce); }

19

The MapReduce Programming Model ●







example implementation runs on a single machine because of separation → we can now change the implementation of the driver program driver dealing with: ●

distribution



automatic parallelization



fault tolerance

independet of the Map and Reduce functions

20

Other Map and Reduce Examples 1 ●

distributed grep



reverse web-link graph



term vector per host



inverted index



distributed sort



many more ...

21

Other Map and Reduce Examples 2 ●





complex computations: ●

a sequence of MapReduce steps



iterative application of a MapReduce computation

March 2003: small handful of MapReduce programs at Google December 2006: 6,000 distinct MapReduce programs

22

A distributed MapReduce Implementation ●









for running large-scale MapReduce jobs large clusters of PCs connected together with switched Ethernet (in wide use at Google) machines with dual-processors x86, Linux, 2-4 GB memory GFS (Google File System): file chunks of 64 MB, 3 copies of each chunk on different machines user submits jobs to a scheduling system 23

Relationships between processes in MapReduce

24

Implementation details ●

load balancing



fault tolerance



locality



backup tasks

25

Extenstions to the Model ●

partitioning function



ordering guarantees



skipping bad records



some other extensions (see paper about MapReduce)

26

Implementations ●

The Google MapReduce framework



Apache Hadoop (Yahoo!, Facebook, IBM, Last.fm ...)



Cell Broadband Engine



NVIDIA GPUs (Graphics Processors)



Apache CouchDB



Skynet



Disco (Nokia)



Aster Data Systems (MySpace)



Bashreduce 27

Demonstration: Disco massive data – minimal code ●

open-source implementation of MapReduce



Nokia Research Center



supports parallel computations over large data sets on unreliable clusters of computers



Disco core: Erlang



jobs: Python

28

Conclusion ●

Google: ●

early 2007 more than 6,000 distinct MapReduce programs



more than 35,000 MapReduce jobs per day



about 8 petabytes of input data per day



about 100 gigabytes per second

29

Conclusion ●

useful across a very broad range of problems: ●

machine learning



statistical machine translation



log analysis



information retrieval experimentation



general large-scale data processing and computation tasks

30

References ●

"Distributed Programming with MapReduce" by Jeffrey Dean & Sanjay Ghemawat: Beautiful Code, edited by Andy Oram & Greg Wilson, chapter 23, pages 371 – 384, O'REILLY, 2007



list of different implementations: ●



http://en.wikipedia.org/wiki/MapReduce

Disco: ●

http://discoproject.org/ 31

Further Reading ●

"MapReduce: Simplified Data Processing on Large Clusters." http://labs.google.com/papers/mapreduce.html



"The Google File System." http://labs.google.com/papers/gfs.html



"Web Search for a Planet: The Google Cluster Architecture." http://labs.google.com/papers/googlecluster.html



"Interpreting the Data: Parallel Analysis with Sawzall." http://labs.google.com/papers/sawzall.html

32

Distributed Programming with MapReduce

Jun 4, 2009 - a programming system for large-scale data processing ... save word_count to persistent storage … → will take .... locality. ○ backup tasks ...

187KB Sizes 4 Downloads 54 Views

Recommend Documents

MapReduce/Bigtable for Distributed Optimization
our global model parameters, we use a distributed data-store known as Bigtable ... allows backup workers to duplicate effort without producing erroneous output.

MapReduce/Bigtable for Distributed Optimization - Research at Google
With large data sets, it can be time consuming to run gradient based optimiza- tion, for example to minimize the log-likelihood for maximum entropy models.

SIGMETRICS Tutorial: MapReduce
Jun 19, 2009 - A programming model for large-scale distributed data ..... Could be hard to debug in .... Reading from local disk is much faster and cheaper.

Download Online Programming Distributed Computing ...
Download Best Book Programming Distributed Computing Systems: A .... Clean Architecture: A Craftsman's Guide to Software Structure and Design (Robert C.

Programming-Distributed-Computing-Systems-A-Foundational ...
... more apps... Try one of the apps below to open or edit this item. Programming-Distributed-Computing-Systems-A-Foundational-Approach-MIT-Press.pdf.

Cutting MapReduce Cost with Spot Market
There are several SQS queues: one input queue, one master reduce queue, one output queue and many reduce queues. At the start of a MapReduce job, CMR partitions the input data into M splits, where each split will be pro- cessed by a separate map task

pdf-1195\data-intensive-text-processing-with-mapreduce-by-jimmy ...
Connect more apps... Try one of the apps below to open or edit this item. pdf-1195\data-intensive-text-processing-with-mapreduce-by-jimmy-lin-chris-dyer.pdf.

DISTRIBUTED ACOUSTIC MODELING WITH ... - Research at Google
best rescoring framework for Google Voice Search. 87,000 hours of training .... serving system (SSTable service) with S servers each holding. 1/S-th of the data.