Advanced Tools, Techniques and Applications Unit No: 6
• MapReduce • Combiner • Partitioner • MapReduce Word Count Example • MongoDB and MapReduce Functions • ETL Processing • Apache Pig • Pig Features • Pig Execution Modes • Pig Running Modes • Pig UDF’s • Word Count Example using Pig
Advanced Tools, Techniques, Applications
2
Large scale data processing was difficult! ◦ ◦ ◦ ◦ ◦
Managing hundreds or thousands of processors Managing parallelization and distribution I/O Scheduling Status and monitoring Fault/crash tolerance
MapReduce provides all of these, easily! Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
Advanced Tools, Techniques, Applications
3
What is it? ◦ Programming model used by Google ◦ A combination of the Map and Reduce models with an associated implementation ◦ Used for processing and generating large data sets
Advanced Tools, Techniques, Applications
4
How does it solve our previously mentioned problems? ◦ MapReduce is highly scalable and can be used across many computers. ◦ Many small machines can be used to process jobs that normally could not be processed by a large machine.
Advanced Tools, Techniques, Applications
5
Inputs a key/value pair
◦ Key is a reference to the input value ◦ Value is the data set on which to operate
Evaluation
◦ Function defined by user ◦ Applies to every value in value input Might need to parse input
Produces a new list of key/value pairs ◦ Can be different type from input pair
Advanced Tools, Techniques, Applications
6
Starts with intermediate Key / Value pairs Ends with finalized Key / Value pairs Starting pairs are sorted by key Iterator supplies the values for a given key to the Reduce function.
Advanced Tools, Techniques, Applications
7
Typically a function that:
◦ Starts with a large number of key/value pairs One key/value for each word in all files being greped (including multiple entries for the same word)
◦ Ends with very few key/value pairs
One key/value for each unique word across all the files with the number of instances summed into this entry
Broken up so a given worker works with input of the same key. Advanced Tools, Techniques, Applications
8
Map returns information
Reduces accepts information
Reduce applies a user defined function to reduce the amount of data
Advanced Tools, Techniques, Applications
9
Yahoo!
◦ Webmap application uses Hadoop to create a database of information on all known webpages
Facebook
Rackspace
◦ Hive data center uses Hadoop to provide business statistics to application developers and advertisers ◦ Analyzes sever log files and usage data using Hadoop
Advanced Tools, Techniques, Applications
11
Creates an abstraction for dealing with complex overhead
◦ The computations are simple, the overhead is messy
Removing the overhead makes programs much smaller and thus easier to use
◦ Less testing is required as well. The MapReduce libraries can be assumed to work properly, so only user code needs to be tested
Division of labor also handled by the MapReduce libraries, so programmers only need to focus on the actual computation Advanced Tools, Techniques, Applications
12
import java.io.IOException; import java.util.*; import import import import import
org.apache.hadoop.fs.Path; org.apache.hadoop.conf.*; org.apache.hadoop.io.*; org.apache.hadoop.mapred.*; org.apache.hadoop.util.*;
public class WordCount { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
Advanced Tools, Techniques, Applications
13
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
Advanced Tools, Techniques, Applications
14
public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Advanced Tools, Techniques, Applications
15
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1]));
}
JobClient.runJob(conf);
Advanced Tools, Techniques, Applications
16
MapReduce is composed of several components, including:
◦ JobTracker -- the master node that manages all jobs and resources in a cluster ◦ TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce tasks ◦ JobHistoryServer -- a component that tracks completed jobs, and is typically deployed as a separate function or with JobTracker
Advanced Tools, Techniques, Applications
17
MapReduce operates in parallel across massive cluster sizes. MapReduce is available in several languages, including C, C++, Java, Ruby, Perl and Python. Programmers can use MapReduce libraries to create tasks without dealing with communication or coordination between nodes. MapReduce is also fault-tolerant, with each node periodically reporting its status to a master node. If a node doesn't respond as expected, the master node reassigns that piece of the job to other available nodes in the cluster. This creates resiliency and makes it practical for MapReduce to run on inexpensive commodity servers.
Advanced Tools, Techniques, Applications
18
Advanced Tools, Techniques, Applications
19
Advanced Tools, Techniques, Applications
20
Advanced Tools, Techniques, Applications
21
Advanced Tools, Techniques, Applications
22
Advanced Tools, Techniques, Applications
23
Advanced Tools, Techniques, Applications
24
Advanced Tools, Techniques, Applications
25
Advanced Tools, Techniques, Applications
26
Advanced Tools, Techniques, Applications
27
Advanced Tools, Techniques, Applications
28
Advanced Tools, Techniques, Applications
29
Advanced Tools, Techniques, Applications
30
Advanced Tools, Techniques, Applications
31
Advanced Tools, Techniques, Applications
32
Advanced Tools, Techniques, Applications
33
Advanced Tools, Techniques, Applications
34
Advanced Tools, Techniques, Applications
35
Advanced Tools, Techniques, Applications
36
Specifying ranges in FOREACH Operator
Advanced Tools, Techniques, Applications
37
Advanced Tools, Techniques, Applications
38
Advanced Tools, Techniques, Applications
39
Advanced Tools, Techniques, Applications
40
Advanced Tools, Techniques, Applications
41
Advanced Tools, Techniques, Applications
42
Advanced Tools, Techniques, Applications
43
Advanced Tools, Techniques, Applications
44
Advanced Tools, Techniques, Applications
45
The NESTED FOREACH
Advanced Tools, Techniques, Applications
46
Advanced Tools, Techniques, Applications
47
Advanced Tools, Techniques, Applications
48
Advanced Tools, Techniques, Applications
49
Advanced Tools, Techniques, Applications
50
Splitting Data Sets
Advanced Tools, Techniques, Applications
51
Advanced Tools, Techniques, Applications
52
Advanced Tools, Techniques, Applications
53
UDF’s
Advanced Tools, Techniques, Applications
54
Advanced Tools, Techniques, Applications
55
Advanced Tools, Techniques, Applications
56
Advanced Tools, Techniques, Applications
57
Advanced Tools, Techniques, Applications
58
Advanced Tools, Techniques, Applications
59
Advanced Tools, Techniques, Applications
60
1.
How Hadoop helps to process big data? Explain with
suitable case study. 2.
Why MapReduce is important as far as big data is concerned? Explain with some analogy.
3.
Write
short
note
on
–
mapper,
reducer,
combiner,
partitioner. 4.
Write and explain MapReduce program (Java/Python) to
count word occurrences in a text file. 5.
How MongoDb can be helpful in MapReduce paradigm?
Advanced Tools, Techniques, Applications
61
6.
What is Pig? Where it can be found in Hadoop
architecture? 7.
Explain execution modes of Pig.
8.
Write and explain any 10 HDFS commands.
9.
Write short note on UDF.
10.
Write and explain wordcount example using Pig.
Advanced Tools, Techniques, Applications
62
Thank You http://www.pavanjaiswal.com
Advanced Tools, Techniques, Applications
63