1

MapReduce-Recoded NPB/EP Benchmark and Its Performance Comparison Against MPI Huiwei Lv, Hua Cheng, and Weidan Wu  Abstract—MPI is the dominant model used in high-performance computing today. However, as the data-intensive super computing becomes a fashion, MapReduce is used in more and more fields. This paper gives an implementation of the MapReduce-recoded NPB/EP benchmark and its comparison against the original MPI version. Moreover, we evaluate the performance of the two different version. The evaluation shows the MapReduce is slower in the case of NPB/EP. Index Terms—MapReduce, MPI, Distributed computing, Data-intensive computing, Parallel programming, NPB , Hadoop

I. INTRODUCTION

P

ARALLEL computing, which is mainly used in high-performing computers, is a form of computing that many instruction execute simultaneously. MPI is the de facto standard parallel programming model. MPI is widely used on heterogeneous systems, typically on systems with distributed memory. The key concept of MPI is to process communication through messages. In a system with distributed memory, each computing element owns its local address space. The memory is logically distributed and often physically distributed as well. Non-local memory is accessible but with a much lower speed compared with local memory access. Communication is the key technology in parallel computing, in which different processors or processes could cooperate with others. In distributed memory systems, message passing is the main technology in which computing elements communicate with others by sending and receiving messages. MPI (Message Passing Interface)[2] is the most widely used system API. It provides a library (in C) or subroutines ( in Fortran ) that you insert into source code to perform data communication between processes. These processes, each working on some local data, has purely local variable and there is no mechanism for any process to directly access the memory of others. Sharing of the data between different processes takes place by message passing, which is achieved by MPI. As a programming model, MPI is designed to be practical, portable, efficient and flexible for writing message-passing program. It avoids memory-to-memory copying and allows overlap of computation and communication and offload to communication co-processor where available. C and Fortran 77 and convenient bind for the interface, the user need not cope with communication failures, such failures are dealt with by the underlying communication subsystem, which makes the MPI a reliable message-passing model. The semantics of the interface should be language independent and it allows for thread-safety. Further more, the interface is not too different from practice such as PVM, NX, Express, p4, etc., and provides extensions that allow greater flexibility and can be implemented on many vendors' platforms, with no significant changes in the underlying communication and system software. All these features makes MPI the most widely used parallel programming model, program writers range from individual application programmers, developers of software designed to run on parallel machines, and creators of environments and tools use MPI on developing their system. Recently, Google and its competitors have created a new class of large-scale computer systems to support Internet search. These ”Data-Intensive Super Computing” (DISC)[3] systems differ from conventional supercomputers in their focus on data. The programming model Google used to adapt the change is MapReduce[1], which allows simple programs to benefit from advanced mechanisms for communication, load balancing, and fault tolerance. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Mapreduce is designed to meet different goals, so its program model is totally different from MPI. It is mostly a functional programming model, users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Program written by mapreduce is automatically parallelized and executed on a large cluster of commodity machines. The run-time system of Mapreduce deal with the tasks of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. The programmers need not to take Huiwei Lv is a graduate student in the Institute of Computing Technology, Beijing, China. (email: [email protected]) Hua Cheng is a Phd student in the Institute of Computing Technology, Beijing, China. (email: [email protected]) Weidan Wu is a graduate student in the Institute of Computing Technology, Beijing, China. (email: [email protected])

2 care with the details of utilizing the resource, which the MPI users do. TABLE I THE COMPARISON OF MPI AND MAPREDUCE Symbol

Mapreduce

MPI

Task distribution Language Design goal

explicit C or Fortran General Message-passing model

Implemetation Reliability and portability

MPICH Yes

automatically Java Process large data sets, Cope with hardware failure, High throughput Hadoop Yes

II. MAPREDUCE A. Overview MapReduce is a programming model and an associated implementation for processing and generating large data sets. It is largely taken from functional programming. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Programs written in MapReduce are automatically parallelized and executed on a large cluster of commodity machines. The run-time system take care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. B. Programming Model The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user‟s reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory. C. Example The following example is taken from [1], which is the first paper introduce MapReduce. Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map function emits each word plus an associated count of occurrences (just „1‟ in this simple example). The reduce function sums together all counts emitted for a particular word. In addition, the user writes code to fill in a mapreduce specification object

3 with the names of the input and output files, and optional tuning parameters. The user then invokes the MapReduce function, passing it the specification object. The user‟s code is linked together with the MapReduce library. D. Hadoop Hadoop[5] is a free java framework for storing and processing petabytes of data. Under the Apache License[11], Hadoop is free to download and use. It is composed of HDFS[10], HBase and MapReduce. The goals of Hadoop MapReduce are: 1. Process large data sets. 2. Cope with hardware failure. 3. High throughput Note that low latency is not Hadoop MapReduce‟s goal, for the framework is dedicated to process large datasets, rather than compute-intensive applications.

III. THE NPB/EP BENCHMARK A. Overview of the NPB Benchmark The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several "flavors." NAS solicits performance results for each from all sources [6][7] . B. The EP Benchmark The EP benchmark is one of the kernel benmarks in the NPB suit. It provides an estimate of the upper achievable limits for floating point performance, i.e., the performance without significant interprocessor communication. The program generate pairs of Gaussian random deviates which are uniformly distributed on the interval (-1,1), and then verify the test of the result with the reference values.

IV. IMPLEMENTATION A. Motivation We choose to implement NPB because of its importance in the scientific computing. As for now, the MapReduce is mainly used in the processing of the large datasets like web pages, search results, etc. We use MapReduce to recode the NPB/EP in case there are large scientific datasets needed to be processed. Moreover, by comparing the same EP application of MapReduce and MPI implementation, we can get a insight view of these two different programming models. B. The MapReduce version of EP Benchmark The work is carried out under the Hadoop framework. To get the program work, three classes need to be implemented, namely, Jobconf class, Map class and Reduce class. 1) The Jobconf class In the Jobconf class, the number of map tasks and reduce tasks is set, and the task of EP is distributed to different maps. Typically, programs written in MapReduce are automatically parallelized and executed on a large cluster of commodity machines. The run-time system take care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. However, in the case of NPB/EP, to maintain the correctness and the sequential of the pseudorandom number generated by the parallel program, the index of each map need to be explicitly assigned to. jobConf.setNumMapTasks(numMaps); jobConf.setNumReduceTasks(numReds); for(int idx=0; idx < numMaps; ++idx) {

4 Path file = new Path(inDir, "part"+idx); SequenceFile.Writer writer = SequenceFile.createWriter(fileSys, jobConf, file, IntWritable.class, FloatWritable.class, CompressionType.NONE); writer.append(new IntWritable(idx), new FloatWritable(idx)); writer.close(); System.out.println("Wrote input for Map #"+idx); } The task is distributed by creating numMaps files. Next, the map function will read one file each, and start computing with the parameter idx passed to it. 2) The Map class In the Map class, we implement the map function which compute each part of the pseudorandom numbers and Gaussian deviates. public static class Map extends MapReduceBase implements Mapper { public void map(IntWritable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { //distribute tasks int no_maps = conf.getNumMapTasks(); int imap = key.get(); np = nn/no_maps; no_large_maps = nn % no_maps; if(imap < no_large_maps){ np_add = 1; }else{ np_add = 0; } np = np + np_add; //Compute AN = A ^ (2 * NK) (mod 2^46) //Compute the k offsets for each loop //Find starting seed //Compute uniform pseudorandom numbers //Compute Gaussian deviates //collect the results float qq[] = new float [nq]; for(i = 0; i < nq; i++){ qq[i] = (float)q[i]; output.collect(new IntWritable(i), new FloatWritable(qq[i])); } float fsx = (float)sx; float fsy = (float)sy; output.collect(new IntWritable(10), new FloatWritable(fsx)); output.collect(new IntWritable(11), new FloatWritable(fsy)); } } In MPI, the id of the distributed task is get by the function mpi_comm_rank, while in MapReduce, it is passed to each map function as a parameter key. The rest of the MapReduce code is much the same with the original MPI version, which implement the task of computing the pseudorandom numbers and the Gaussian deviates. The collect function is a interface to reduce function, it collects the results of the map function, and pass it to the reduce function where the results will be reduced. 3) The Reduce class In the Reduce class, the reduce function is implemented. It reduce the number of Gaussian deviates in each intervals. To validate the result, each sums in qq[] must agree with the reference values. For the Class A problem, the reference sx and sy value are

5

 4.295875165629892  103 and  1.580732573678431104 . public static class Reduce extends MapReduceBase implements Reducer { public void reduce(IntWritable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { if(nkey == 10){ while(values.hasNext()){ float num = values.next().get(); sx += num; } output.collect(key, new FloatWritable(sx)); } else if(nkey == 11){ while(values.hasNext()){ float num = values.next().get(); sy += num; } output.collect(key,new FloatWritable(sy)); } else{ while(values.hasNext()){ float num = values.next().get(); qq[nkey] += num; } output.collect(key,new FloatWritable(qq[nkey])); } } C. Dataset The dataset was produced use the pseudorandom algorithm provided by the NPB/EP Benchmark. For the Class A problem, the number of Gaussian deviates to be generated is 2 28 .

V. PERFORMANCE EVALUATION A. Platform The dawning 5000A high performance computer node features four Dual Core AMD Opteron(tm) Processor 275 at 2193.784 MHz. The memory for each node is up to 7.78 GB, and the cache is 1 GB. Both the MPI version and the HADOOP version in this paper occupied 4 nodes,on the Linux version AS 3.0 platform. we repeated the experiments 3 times and observed small performance variations (standard deviation for 3 measures is less than 5%).

6 B. Hadoop test Under the Hadoop environment, we rewrite the NPB-EB test program with the MapReduce programming model. The reduce number changed from 1, 2, 4, 8, and the map number changed as 1, 2, 4, 8 and 16. The test results are shown as Figure 1. In the Figure 1, we notice that when the Map Number grows to 2 and 4, the execution time decrease remarkably; but when the Map Number grows to 8, 16 and 32, the execution time increase gradually, and when the Map Number grows to 64, the execution time increase remarkably. When NPB-EP is executed on single processor (MAP=1 , Reduce=1), that to say no distributing characteristic, no communicating overhead, the execution time is 57s, which can be regarded as the basic time on Hadoop. The overall overhead increase 28% when the Map Number grows to 2, 26% when the Map Number grows to 4, and 242% when the Map Number grows to 8, 7760% when the Map Number grows to 64! The overhead of different configure of NPB-EP on Hadoop is shown as figure 2. Because there is no inter-communication in NPB-EP, the Hadoop test result indicates that the communicating overhead is a little big, especially when the MAP Number increases.

Hadoop Test 100 80

1 2 4 8

Time

60 40 20 0 1 2 4 8

Reduces Reduces Reduces Reduces

1

2

62.6 57.6 64.6 61.6

34.9 36.5 36.5 44.5

4

8

16

43.6 43.6 84.7 22.5 24.5 28.5 28.5 29.5 26.5 27.5 34.5 27.5 Map Number

32

64

35.6 34.6 35.5 48.6

63.7 69.7 58.8 91.9

Reduces Reduces Reduces Reduces

Fig. 1. The test results of NPB-EP on Hadoop

Overhead of Hadoop 100

Ratio

80 60

Overhead

40 20 0 Overhead

1 2 4 8 16 32 64 0 0.28 0.56 2.42 7 18.4 77.6 Map Number

Fig. 2. The overhead of different configure of NPB-EP on Hadoop (by percent)

7 C. MPI test In our test, the MPI version is 1.2.7, and the NPB version is NPB2.4-MPI. We run the standard benchmark with 1, 2, 4, 8, 16, 32 and 64 processes. The test result is described in Figure 3. The Figure 3 presents the time spent in MPI test, when the processes grows from 1, 2, 4, 8, and 16, there is a linear acceleration. The reason due to there is no communicational requirement in NPB-EP. The processes need not to communicate with each other after the MPI distributed the task to the processes, and this result in the full use of the computational capability. When the processes

MPI Test 80

Time

60 40

MPI

20 0

1 2 4 8 16 32 64 MPI 64.28 32.19 16.06 8.08 4.27 4.44 4.35 Processor Number Fig. 3. The test results of NPB-EP on MPI environment

grows to 64, because of the total core number is 32, due to the computational resource limitation, there is a additional process switch overhead, so the execution time is a little longer than that of 16 processes. D. Performance Comparison Figure 4 is the merged table of Hadoop and MPI test result. From the table, we can get to the conclusion that, Hadoop can present a good linear acceleration in a narrow range, for instance, when the MAP Number is 1 or 2, when the processes grows, the overhead of Hadoop will increase, but MPI still have a good linear acceleration. Think over the whole test, the overhead of Hadoop is much bigger that MPI, as for scientific computation, the communication in the node has been optimized in MPI, so we can get a better efficiency in MPI. But as the multi-core developing rapidly, can the

Compare 100 1 Reduces 2 Reduces 4 Reduces 8 Reduces MPI

Time

80 60 40 20 0 1 Reduces 2 Reduces 4 Reduces 8 Reduces MPI

1 63 58 65 62 64

2

4

8

16

32

64

35 44 44 85 36 64 37 22 24 29 35 70 36 28 30 26 36 59 45 28 35 28 49 92 32 16 8.1 4.3 4.4 4.4 MAP/Processor Number

Fig. 4. The combined test results of NPB-EP on MPI and Hadoop environment

MapReduce programming model be used to utilize the fine-grained aspects in a node? It is to be discovered in future.

8

VI. CONCLUSION AND DISCUSSION As data-intensive super computing is becoming a fashion, MapReduce will be used in more and more fields. In this paper, we analyse the efficiency of the MapReduce framework at the field of the scientific computation. As a preliminary work, we have just implemented one of the NPB benchmarks. The analysis of the performance shows that in the case of EP, the MPI is more efficient than MapReduce. To get a full view of the comparison of MPI and MapReduce, more applications should be implemented and tested. The straight forward extension of this work would be the implementation of the other benchmarks in the NPB suite.

ACKNOWLEDGMENT This paper is taken under the instruction of the Prof. Kai Hwang. The author would like to give thanks to Jianwei Xu and Wenli Zhang, who give a lot of support and advice on the test of the NPB.

REFERENCES Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters” in OSDI, 2004 “MPI: A Message-Passing Interface Standard”, Message Passing Interface Forum, November 15, 2003 Randal E. Bryant, “Data-Intensive Supercomputing:The case for DISC”, May 2007 Message Passing Interface (MPI), https://computing.llnl.gov/tutorials/mpi/ Tom White, “A Tour of Apache Hadoop”, ApacheCon EU, April 2008 D. Bailey, E.Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan and S. Weeratunga, “The NAS Parallel Benchmarks”, RNR Technical Report RNR-94-007, March 1994 [7] Rob F. Van der Wijngaart, “NAS Parallel Benchmarks Version 2.4”, Computer Sciences Corporation, NASA Advanced Supercomputing (NAS) Division, NAS Technical Report NAS-02-007, October 2002 [8] Apache Hadoop, http://hadoop.apache.org/core/ [9] Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao and D. Stott Parker, “Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters”, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, Beijing, China, 2007 [10] The Hadoop Distributed File System: Architecture and Design, http://hadoop.apache.org/core/docs/current/hdfs_design.html [11] Apache License Version 2.0, http://www.apache.org/licenses/LICENSE-2.0 [12] Ralf Lammel, “Google‟s MapReduce Programming Model”, 2008 [1] [2] [3] [4] [5] [6]

MapReduce-Recoded NPB/EP Benchmark and Its ...

(email: [email protected]). MapReduce-Recoded NPB/EP Benchmark and. Its Performance Comparison Against MPI. Huiwei Lv, Hua Cheng, and Weidan ...

225KB Sizes 2 Downloads 157 Views

Recommend Documents

RVS Independent Reading Level and Comprehension Benchmark ...
RVS Independent Reading Level and. Comprehension Benchmark Assessment. Fiction. Word Recognition. Select a 200 -250 word section in a text (2-3 pages) ...

the zendesk benchmark - cloudfront.net
Among industries, Social Media, Web Hosting, and Manufacturing ... support team should experiment to find the performance targets, staffing model, and best ..... Peak live chat volumes for our customers occurred between 10:00 a.m. and 3:00 ...

2013 Benchmark Survey - Rackcdn.com
... More About Buyers. About The Survey Sample. About Demand Gen Report. 3. 7. 8. 9. 10. 11. 11 ... those dollars on social media as well as developing white papers ... efforts will get top priority, according to the respondents to the 2013 ... Marke

Benchmark-Fablab.pdf
Photographie 60. Remerciements 60. SYNTHESE ET RECOMMANDATIONS 61. Page 3 of 63. Benchmark-Fablab.pdf. Benchmark-Fablab.pdf. Open. Extract.

Benchmark SURVEY - Rackcdn.com
expect their demand generation budgets to increase in 2013, with one half expecting their budgets to rise by 20% or more. Did demand generation become a ...

Benchmark-Fablab.pdf
TYPOLOGIE 32. Structure et organisation 32. Fab Lab type « éducationnel » 33. Fab Lab type « privé-business », prototypage et services aux entreprises 36.

Overall Benchmark Report.pdf
Figure 2. *Top-Performing colleges are those that scored in the top 10 percent of the cohort by benchmark. Kauai Community College 2014 CCSSE Cohort 2014 ...

Benchmark for promotion.PDF
HAG Very Good Plus - two 'Outstanding' and. three 'Very Good' gnading in the ... iJ- *. No. II/7. NATIONAL FEDERATION OF INDIAN RAILI.YAYMEN (N,F,I,R,).

Scales to measure and benchmark service quality in ...
Design/methodology/approach – The second-order confirmatory factor analysis is employed to validate the instrument. SQ dimensions have been modeled which have significant impact on customer satisfaction (CS) separately from those which do not have

USATestprep, Inc. - Online State-Specific Review and Benchmark ...
3) Newton's second law states that F = m x a (Force is mass times acceleration). Which example ... A) NaOH C) bleach. B) blood D) lemon juice ... USATestprep, Inc. - Online State-Specific Review and Benchmark Testing.pdf. USATestprep, Inc.

BSC and benchmark development for an e-commerce ...
The Leipzig Graduate School of Management,. Handelshochschule Leipzig, Leipzig, Germany and. Hochschule Harz, Wernigerode, Germany. Abstract.

Benchmark Litigation Recognizes Multiple Snell & Wilmer Offices and ...
Oct 5, 2016 - Snell & Wilmer was “Highly Recommended” for its Arizona and Utah ... Mark O. Morris - Local Litigation Star, Utah ... For more information,.

8th Grade Winter Benchmark Poem and Excerpt.pdf
else in the wodd care whether I win or lose? Page 1 of 1. 8th Grade Winter Benchmark Poem and Excerpt.pdf. 8th Grade Winter Benchmark Poem and Excerpt.

Benchmark Litigation Recognizes Multiple Snell & Wilmer Offices and ...
Oct 5, 2016 - Warren E. Platt - Local Litigation Star, California ... publicly traded corporations to small businesses, individuals and entrepreneurs. For more ...

2016 racetrack casino benchmark report - Pennsylvania Gaming ...
new and renovated racing facilities. These new facilities such ... agricultural economy as horsemen do business within the Commonwealth by ... new racetrack casino properties and increased wagering ... impact on Pennsylvania racing and continues to b

Dutch Gov Benchmark Bonds.pdf
Dutch Gov Benchmark Bonds.pdf. Dutch Gov Benchmark Bonds.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Dutch Gov Benchmark Bonds.pdf.

Belgium Gov Benchmark Bonds.pdf
There was a problem loading more pages. Retrying... Belgium Gov Benchmark Bonds.pdf. Belgium Gov Benchmark Bonds.pdf. Open. Extract. Open with. Sign In.