Cloud MapReduce: a MapReduce Implementation on ...

Viewer
Transcript

Cloud MapReduce: a MapReduce Implementation on top of a Cloud Operating System Huan Liu, Dan Orban Accenture Technology Labs {huan.liu, dan.orban}@accenture.com Abstract—Like a traditional Operating System (OS), a cloud OS is responsible for managing the low level cloud resources and presenting a high level interface to the application programmers in order to hide the infrastructure details. However, unlike a traditional OS, a cloud OS has to manage these resources at scale. If a cloud OS has already taken on the complexity to make its services scalable, we should be able to greatly simplify a large-scale system design and implementation if we build on top of it. Unfortunately, a cloud’s scale comes at a price. For example, Amazon cloud not only relies on horizontal scaling, but it also adopts a weaker consistency model called eventual consistency. We describe Cloud MapReduce (CMR), which implements the MapReduce programming model on top of the Amazon cloud OS. CMR is a demonstration that it is possible to overcome the cloud limitations and simplify system design and implementation by building on top of a cloud OS. We describe how we overcome the limitations presented by horizontal scaling and the weaker consistency guarantee. Our experimental results show that CMR runs faster than Hadoop, another implementation of MapReduce, and that CMR is a practical system. We believe that the techniques we used are general enough that they can be used to build other systems on top of a cloud OS.

I. I NTRODUCTION Like a server Operating System (OS), a cloud OS is responsible for managing resources. In a server (e.g., a PC), the OS is responsible for managing the various hardware resources, such as CPU, memory, disks, network interfaces – everything inside a server’s chassis. It hides the hardware operation details and allows these scarce resources to be efficiently shared. A cloud OS serves the same purpose. Instead of managing a single machine’s resources, a cloud OS is responsible for managing the cloud infrastructure, hiding the cloud infrastructure details from the application programmers and coordinating the sharing of the limited resources. But unlike a traditional OS, a cloud OS it much more complex, not only because it has to manage a much bigger infrastructure, but also because it has to serve many more customers. IBM CEO Thomas J. Watson is well known for his 1943 statement (although only scant evidence exists): “I think there is a world market for maybe five computers.” Although it is often laughed at since the advent of Personal Computers, it is becoming a reality again. The only difference is that we refer to these computers as clouds. Today, only a handful of companies, such as Google, Microsoft, Amazon and Yahoo, need and are capable of building a cloud–a large server farm with hundreds of thousands of servers. For example, it is reported that Google has well over 1 million servers,

and it serves millions of customers. Managing such big an infrastructure and supporting so many users require the OS to be extremely robust and scalable. In this paper, we explore how to build systems on top of a cloud OS. We have implemented the MapReduce[1] programming model using services provided by the Amazon cloud OS. By leaving the scalability concerns and the implementation complexity to the cloud OS, we demonstrate that we can easily build highly scalable systems with much less code. Beyond simplicity, our implementation is not only faster than other implementations, but it is also fully distributed, i.e., it has no single point of failure and scalability bottleneck. A. Cloud OS Even though the underlying resources it manages are different, a cloud OS is similar to a traditional server OS in terms of the services it provides. First, it provides compute services, such as Amazon EC2 and Windows Azure workers. They provide computing power in the form of Virtual Machines (VMs) instead of threads. Second, it provides storage services, such as Amazon S3 and Windows Azure blob storage. Third, a cloud OS provides communication services, such as Amazon’s Simple Queue Service (SQS) and Windows Azure queue service, which are similar to a pipe on a UNIX OS, where a user can push in messages at one end and pop out messages at the other end. Last, a cloud OS also provides persistent storage services, such as Amazon’s SimpleDB and Windows Azure table services. They provide persistent storage similar to the registry service in a Windows OS. We refer interested readers to cloud vendors’ documentations (e.g., at aws.amazon.com) for details on cloud services (e.g., EC2, S3, SQS, SimpleDB) provided as part of a cloud OS. Compared to a server OS, a key difference for a cloud OS is that it is scalable. This is for two reasons. First, a cloud OS has to manage a much bigger infrastructure. Second, a cloud has to support hundreds of thousands of people instead of just a few users on a PC. To meet the scalability challenges, cloud providers are forced to build new solutions from the ground up. For example, Google designed their own GFS [2] to manage files and their own BigTable [3] to store a large amount of semi-structured data. Similarly, Amazon designed Dynamo[4] to manage storage and built their own management infrastructure to support their web services API. Even though a cloud OS is complex to implement, out of necessity, cloud providers have already spent a large amount

of engineering efforts on building a highly scalable cloud OS that can manage a large infrastructure shared by many people. If we leverage the existing cloud OS, we can potentially lower the application complexity, yet achieve high scalability. B. Challenges posed by a cloud OS A cloud OS’ scalability comes at a price. It has to be traded off with other desirable system properties. Eric Brewer, in a keynote address to the PODC (Principles of Distributed Computing) 2000 conference[5], presented the CAP theorem. The theorem states that, of the three properties of shared-data systems – data consistency, system availability, and tolerance to network partition - only two can be achieved at any given time. A more formal confirmation of the CAP theorem can be found in [6]. Because a cloud is used by thousands of people, it has to be highly scalable and always available; thus, the only property it can give up is data consistency. Indeed, the Amazon cloud OS has embraced a weaker consistency model called “Eventual Consistency”[7]. Under the eventual consistency model, the system guarantees that if no new updates are made to an object, eventually all accesses will return the last updated value. However, during a small time window, clients may observe inconsistent states. The inconsistency window size cannot be determined apriori because it depends on communication delays, the load on the system, the number of replicas involved in the replication scheme, and the extent of components failure (both the number of and the length of) if any. The most popular system that implements eventual consistency is DNS (Domain Name System). Updates to a domain name are distributed according to a configured pattern and in combination with time-controlled caches; eventually, all clients will see the update. In addition to eventual consistency, a cloud also employs horizontal scaling. For example, SimpleDB can only sustain a small write throughput per domain (a domain is Amazon’s terminology and it can be considered as equivalent to a table in databases); but, a user can write to multiple domains at the same time to increase the aggregate write throughput. Although each Amazon account has 100 domains by default, one can simply send an email to request more domains. This is similar to EC2 which by default has a 20 instances (Amazon’s term for virtual machines) limit, but it can be lifted by a simple email request. Building applications on top of a cloud OS must overcome these limitations. We describe the manifestations of the eventual consistency model that we are able to observe, and how we architect and implement Cloud MapReduce (CMR) to overcome them. C. Advantages of Cloud MapReduce In this paper, we show that we can greatly simplify the design and implementation of MapReduce by leveraging what a cloud OS has implemented already. By using queues, we easily parallelize the Map and the Shuffling stages. By using Amazon’s visibility timeout mechanism, we easily implement fault-tolerance. By leveraging a cloud OS’s fully distributed

implementation, we are able to implement a fully distributed architecture with no single point of failure and scalability bottleneck. The simplicity means that it is easy to extend the framework beyond simply MapReduce. Many applications do not conform to the MapReduce model. If implemented in the MapReduce framework, the application could experience slow performance. For example, many applications, such as distributed grep, require map stage only, i.e., these applications only need to spread out the work to as many workers as possible. Using MapReduce, the map output has to go through the reduce phase, which consumes unnecessary compute resources. Using a simple implementation like ours, we can easily change our framework to not only refine the MapReduce model, but also implement a different model (e.g., Dryad[8]). Cloud MapReduce has several highly desirable properties, which seems to be shared by other highly-scalable systems (such as Dynamo[4]). Incremental scalability: Cloud MapReduce can scale incrementally in the number of computing nodes. A user not only can launch a number of servers at the beginning, but can also launch additional servers in the middle of a computation if the user thinks the progress is too slow. The new servers can automatically figure out the current job progress and poll the queues for work to process. Symmetry and Decentralization: Every computing node in Cloud MapReduce has the same set of responsibilities as its peers. There are no master or slave nodes. Symmetry simplifies system provisioning, configuration and failure recovery. As implied by symmetry, there is no single central agent (master), which makes the system more available. Heterogeneity: The computing nodes could have varying computation capacity. The faster nodes would do more work than the slower nodes. In addition, the computing nodes could be distributed geographically. In the extreme, a user can even harvest idle computing capacity from servers/desktops distributed on the Internet. D. Contributions The contributions of the paper include the followings. First, we propose, implement and evaluate a new architecture for the MapReduce programming model on top of a cloud OS. Beyond using a cloud OS to simplify the implementation, the architecture is novel in a couple of respects compared with the master/slave architecture used in the Google [1] and Hadoop [9] implementations. Nodes in the architecture act independently, and they pull for job assignments and global status in order to determine their individual actions (instead of a job scheduler pushing job assignments). The architecture also uses queues to shuffle results from Map to Reduce. Mappers write results as soon as they are available and Reducers filter out results from failed nodes, as well as duplicate results. The parallelization of the Map and Shuffling stage improves MapReduce performance. Second, we propose several techniques to detect and work around problems arising

from eventual consistency. The techniques are general enough that they can be applied to other systems implemented on top of a cloud OS. II. C LOUD M AP R EDUCE

ARCHITECTURE AND IMPLEMENTATION

In this section, we describe how we implement Cloud MapReduce using the Amazon cloud OS. We start with the high level architecture, then delve into detailed implementation issues we have encountered. We use the Word Count application as an example to describe our implementation. We use various cloud services. We store our input and possibly output data in S3. By leveraging the distributed nature of S3, we can achieve higher data throughput since data comes from multiple servers and communications with the servers potentially all traverse different network paths. We also use SQS, which is a critical component that allows us to design MapReduce in a simple way. A queue serves two purposes. First, it is a synchronization point where workers (a process running on an instance) can coordinate job assignments. Second, a queue serves as a decoupling mechanism to coordinate data flow between different stages. Lastly, we use SimpleDB, which serves as the central job coordination point in our fully distributed implementation. We keep all workers’ status here. A. The architecture Cloud MapReduce architecture is shown in Fig. 1. There are several SQS queues: one input queue, one master reduce queue, one output queue and many reduce queues. As its name implies, the input queue holds the inputs to the MapReduce computation. At the start of the computation, the user provides an input queue which contains a list of S input key-value pairs. Each key-value pair corresponds to a split of the input data that will be processed by one map task. To facilitate tracking, each key-value pair also has a unique map ID. In the word count application, this queue contains the document collections where the key is the document name and the value is a pointer into S3 storage. SQS is designed for message communication, hence it has an 8KB message size limitation. Because it could be too small to fit a large document, we store a pointer to S3 instead of the data directly in SQS. In addition to pointing to the location in S3, the pointer could also contain a range specification, specifying a chunk of the file. Using ranges, the user could split up a bigger file into pieces and process them separately. Similar to the input queue, the output queue holds the results of the MapReduce computation. In the word count application, the output holds the resulting key-value pairs. There is only one master reduce queue and it holds many pointers, one for each reduce queue. As we will see, the master reduce queue is used to assign reduce tasks. There are a large number of reduce queues. The number of them, denoted by Q, is a configurable parameter that is set by the user. The reduce queues and the master reduce queue, as well as the entries in the master reduce queue, are created distributedly before the start of the MapReduce job.

A set of map workers, each runs as a separate thread on an EC2 instance, poll the input queue for work. When a map worker dequeues one key-value pair, it invokes the userdefined map function to process it. Just like in other MapReduce implementations, the user-defined function processes the input key-value pair and emits a set of intermediate key-value pairs. In the word count example, the input value is a pointer to a document stored in S3. The map function first downloads the document from S3 to the local machine. It then parses the document, and for each word (e.g., “talk”) it sees, it emits a key-value pair, where the key is the word (e.g., “talk”) and the value is simply “1” to indicate that it has seen this word once. The MapReduce framework collects the intermediate keyvalue pairs from the map function, then writes them to the reduce queues. A reduce key maps to one of the reduce queues through a hash function. A default hash function is provided, but the users could also supply their own. Since the number of reduce keys could be much bigger than Q, several keys may map to the same queue. As we will see, each reduce queue is processed by a separate reduce worker; thus, Q should be set to at least as large as the number of reduce workers. Preferably, Q should be much bigger in order to maximize load balancing. CMR uses the network to transfer the intermediate keyvalue pairs as soon as they are available, thus it overlaps data shuffling with map processing. This is different from other implementations where the intermediate key-value pairs are only copied after a map task finishes. Overlapping shuffling is used when pipelining MapReduce [10]. Compared to the implementation in [10] where it has to implement pairwise socket connections and buffering/copying mechanism, our implementation using queues is much simpler. Since the map phase is typically long, overlap shuffling has the effect of spreading out traffic. This can help alleviate the incast problem [11] [12] [13] (switch buffer overflow caused by simultaneous transfer of a large amount of data) if it occurs. Once the map workers finish their jobs, the reduce workers start to poll work from the master reduce queue. Once a reduce worker dequeues a message, it is responsible for processing all data in the reduce queue indicated by the message. It dequeues messages from the reduce queue and feeds them into the user-defined reduce function as an iterator. After the reduce function finishes processing all data in the reduce queue, the worker goes back to the master reduce queue to fetch the next message to process. Just like in other MapReduce implementations, the userdefined reduce function writes a set of key-value pairs as the outputs. The reduce workers collect the outputs and write them to the output queue. The name of the output queue has been specified before the start of the MapReduce job. It can be used either as the final output or as the input to the next MapReduce job. Besides reading from and writing to the various queues, the workers also read from and write to SimpleDB to update their status. By communicating status with a central scalable SimpleDB service, we not only avoid a single point bottleneck

…….

…….

…….

Fig. 1.

Cloud MapReduce architecture

in our architecture, but we also make our implementation fully distributable. Workers work independently of all other workers and they do not care how many other workers are there. In addition, workers can be heterogeneous. They can be located anywhere in the world and they can have a vastly different computing capacity. Even though we have shown two sets of workers (map workers and reduce workers) in Fig. 1, both could run on the same set of computing nodes (e.g., EC2 instances). Reduce workers start only after the map phase has finished. A new worker can join the computation at any time. When it joins, it can determine whether the map phase has finished by querying SimpleDB, and it can then poll the input queue or the master reduce queue accordingly for work. In our architecture, it is easy for the job owner to get a rough sense of the job progress. The input queue length as a percentage of S – the original input queue length – is a good approximation of the map progress. Similarly, the master reduce queue length as a percentage of Q – the original master reduce queue length – is a good approximation of the reduce progress. Obtaining the approximate queue length is a simple call to the SQS GetQueueAttributes API. Our current implementation is written in Java. Since the interface functions are in Java, all user-defined map and reduce functions (at least their interface part) have to be written in Java. This limitation could be easily removed by using a mechanism similar to the Streaming mechanism used in the Hadoop [9] implementation. Because the nodes are symmetric, it is easy to launch a MapReduce job. Users simply launch a certain number of virtual machines from our custom Amazon Machine Image

(AMI), and pass a few job specific parameters to the virtual machines as the user data. There is no complicated cluster setup and configuration, and there is no need for selecting a master. Our AMI contains a simple script which parses the user data passed in during launch to determine what application to run and which data set to use, then the script automatically starts the MapReduce job. B. Cloud challenges and our general solution approaches Even though the architecture presented above is simple, we have to get around several limitations posed by the cloud. We list the key challenges we encountered and the general techniques we used to get around them. One of the contributions of this paper is on the general techniques to get around cloud’s challenges. We believe they could be used for other applications built on a cloud OS. In the subsequent sections, we get into more details on the implementation. Long latency: Since Amazon services are accessed through the network, the latency could be significant. In our measurement, SQS latency ranges from 20ms to 100ms even from within EC2. Hence, a significant portion of the time will be spent waiting for SQS to respond if we access it synchronously. We get around this limitation through two techniques: message aggregation and multi-threading (described in Sec. II-E1). Horizontal scaling: Although all Amazon cloud services are based on horizontal scaling, we are only able to observe one concrete manifestation: when using SimpleDB, each SimpleDB domain is only able to sustain a small write throughput. In our experiments, the threshold is roughly 30-40 items per second. To get around this problem, we spread the write

workload across many domains, and we aggregate the results from all domains when reading the status. Unlike SimpleDB, other services, such as S3 and SQS, hide the horizontal scaling details from the end users. Do not know when a queue is created for the first time: According to Amazon documentation, to know whether a worker is the first to create a queue, the worker can call the CreateQueue SQS API with a unique visibility timeout (time for a message to reappear after read) setting. If a queue already exists but has a different visibility timeout, Amazon returns an error message; otherwise, it returns success. In practice, due to eventual consistency, if two workers create the queue at the same time, both may return success. We do not encounter this problem in our current architecture; however, it did limit our architecture design to avoid dynamic queue creation. Duplicate message: According to Amazon documentation, when a worker reads a message from a SQS queue, the message disappears from the queue for a certain amount of time (the visibility timeout). In practice, two workers (or two threads) may read the same message twice if they read at the same time. This is another manifestation of eventual consistency because each read modifies the message state – hiding it for a visibility timeout. Our solution approach depends on the queue purpose. We use filtering for input and reduce queues (Sec. II-E2), but we use conflict resolution for the master reduce queue (Sec. II-D). Note that duplicate message happens rarely, so even if the recovery mechanism is expensive, it will not impact the performance much. Potential node failure: A worker may fail in the middle of processing a map or reduce task. We use status update to a central place (SimpleDB) as a commit mechanism and we use filtering to remove uncommitted results. Indeterministic eventual consistency windows: This problem has a different manifestation in SQS and SimpleDB. In SQS, we find that it frequently reports the queue to be empty even when there are still messages in the queue, especially when there are only a few messages left. Amazon documentation attributes this to the distributed nature of the SQS implementation, where messages for the same queue are stored on different servers. The Amazon documentation states that one can call the dequeue API a few times and the queue would return all messages. Unfortunately, there is neither a bound on the number of API calls nor a bound on the time to wait. Similarly, in SimpleDB, when we read an item right after it is written, we may not get the latest value. One solution is to wait for an arbitrarily long time, unfortunately it not only does not provide a guarantee, but it will also result in a much slower performance since workers are frequently idle waiting. Our solution strategy is to set an expectation before reading. For example, we record the number of key-value pairs generated by each map task for each reduce queue. Then, in the reduce phase, we know exactly how many key-value pairs to expect, and we poll from the reduce queue until all are read. As another example, when tallying the total key-value pairs generated for a reduce queue, we make sure we get S counts

from SimpleDB, one reported by each map task. C. Status tracking Due to eventual consistency, we have no reliable way of knowing whether a queue is empty or not. To facilitate tracking, each worker updates its progress to SimpleDB. The worker then uses the progress reports from all nodes, including itself’s, to determine whether there are more to get from a queue. When a map worker finishes a map task, it writes two pieces of information to SimpleDB: the worker ID and map ID i pair, and the number of reduce key-value pairs the worker generated for each reduce queue j while processing map ID i (denoted by Rij ). Updating the status to SimpleDB serves as the commit mechanism to signify that the input split corresponding to the map ID has been processed successfully. The worker ID and map ID pair information is used to determine when the input queue is empty. When SQS indicates that there are no more key-value pairs to process in the input queue, the map worker queries SimpleDB to get the list of all worker ID and map ID pairs. It counts the number of unique map IDs that have been processed since some map IDs may have been processed more than once. This happens for one of two reasons. First two map workers may have received the same map ID due to the eventual consistency problem. Second, a map worker may have failed after committing to SimpleDB but before removing the corresponding message from the input queue (Sec. II-D); thus, a second map worker would have processed the same map ID again after the visibility timeout. If the number of unique map IDs is the same as S, Cloud MapReduce proceeds to the reduce phase; otherwise, the worker goes back to query the input queue again for more work. To minimize load on SimpleDB, we only query for the missing map IDs that have not been committed yet, if the number of missing map IDs is small. The commit records are stored and indexed with the map ID as the record name, and it is more efficient to read the missing map IDs one by one when there are only few of them left. The reduce key-value pairs count (Rij ) is used to determine when a reduce queue has been processed. When a reduce worker is assigned reduce queue j (by querying the master reduce queue), it first queries SimpleDB to sum up Rij for all i to see how many key-value pairs are Pin reduce queue j. It then queries reduce queue j until all i Rij key-value pairs have been read. In the reduce phase, we use a similar commit mechanism. When a reduce worker finishes a reduce task, it writes two pieces of information to SimpleDB: the worker ID and reduce ID i pair, and the number of output key-value pairs the worker generated while processing reduce queue i. Similar to what we have done with the map stage, we use the number of committed reduce tasks to determine when the reduce stage finishes and we use message tagging and filtering to remove output results generated by failed nodes (details omitted due to space limitation, see Sec. II-E2 for what we do with the map outputs).

SimpleDB is designed with the same eventual consistency principle; however, it is not a problem for us because we wait for the eventual result. At the end of the map stage, we make sure all S map results are committed before tallying up Rij for the reduce queue size, and we make sure we tallying up S Rij counts, one from each committed map task. Similarly, at the end of the reduce stage, we make sure all Q reduce results are committed. To overcome the write throughput limitation of a single SimpleDB domain, each worker randomly picks one of several domains to write the status. When querying SimpleDB for results, each worker launches multiple threads to read from all domains at the same time, and then aggregates the overall result. We find that 50 domains are enough even for our test cases with 1000 nodes. Even though statuses are maintained centrally, SimpleDB would not be a bottleneck since itself is implemented in a distributed fashion. D. Failure detection/recovery and conflict resolution We use SQS’s visibility timeout mechanism for failure detection and recovery. After a worker reads a message from a queue, the message disappears from the queue for a certain period of time (the visibility timeout). Unless deleted explicitly, a message will reappear after the visibility timeout passes. The input queue holds the task assignment messages for map tasks and the master reduce queue holds the task assignment messages for reduce tasks. Both queues have a short visibility timeout (default 30 seconds). While a map/reduce worker is still processing a task, it periodically resets the corresponding message’s visibility timeout. After a task has been successfully processed, the worker removes the corresponding message from the queue to prevent other workers from repeating the same work. If a worker fails while processing a map or reduce task, the message will reappear in the input or master reduce queue shortly (less than the default timeout, e.g., 30 seconds), so that other workers can take over. All status updates to SimpleDB are done before removing the message from the queue to make sure that the result is committed fully first. Two map workers may work on the same map task due to either node failure or message duplication as a result of eventual consistency. In the MapReduce programming model, it is OK to process the same map task twice, so we do not take extra steps. In Sec. II-E2, we discuss how to filter out duplicate map outputs. However, two reduce workers processing the same reduce queue could pose a problem. If it happens, we use SimpleDB for conflict resolution. When SQS reports that a reduce queue j is empty, but the reduce P worker has not processed all keyvalue pairs (fewer than i Rij ), the reduce worker suspects that there may be a conflict, so it enters the conflict resolution mode. It first writes the reduce queue ID j and worker ID pair into SimpleDB, and it then queries to see if other workers have claimed the same reduce queue ID. If so, it invokes a deterministic resolution algorithm (same for all nodes) to

determine who should be in charge of processing this reduce queue. If the worker loses, it abandons what it has processed and moves on. However, if the worker wins, it goes back to query the reduce queue again. Even if other workers have read some messages from the reduce queue, the messages will reappear after the visibility timeout for the winning worker to finish its processing. E. Working with SQS 1) Hide access latency: We use two techniques – message aggregation and multi-threading – to hide SQS latency when accessing reduce queues. Message aggregation takes advantage of the 8k SQS message limit, which is typically much bigger than a key-value pair. By aggregating, we turn multiple round trips into one, which not only saves the number of queue write requests, but also saves the number of read requests during the reduce phase. Note that message aggregation is different from the combiner in the MapReduce framework. A combiner is an application specific function which reduces the intermediate result size by applying application specific logic. In contrast, our message aggregation is a framework implementation optimization. The optimization works regardless of the application. To hide latency further, we use a thread pool of multiple threads for both writing to and reading from the reduce queues. When a worker has a message to write, it pushes the message into a local queue. If a thread in the thread pool becomes idle, it pops off a new message from the local queue and sends the message to SQS synchronously. For reading from the reduce queues, we allocate a read buffer and set a read buffer threshold. When the number of messages in the buffer falls below the threshold, we ask idle threads to download additional messages. Each idle thread performs one bulk read of 10 messages (10 is the maximum allowed by SQS API). The reduce workers read directly from the read buffer, instead of interfacing with SQS. Fig. 2 shows the time for the word count application as a function of the number of threads in the thread pool. The word count application runs on a single m1.small EC2 instance (the smallest EC2 instance) and processes a 25MB data set. We show both the case with the combiner enabled and disabled. When the combiner is disabled, more data is shuffled between the map and reduce stages. As shown, the time quickly decreases as we add more threads, suggesting that threads are effective at hiding the latency. The latency from within EC2 to SQS ranges from 20ms to 100ms. If computation nodes are further away from SQS, more threads are needed to hide the latency. A user can specify the number of threads simply by using a command line option. Since having more threads in the thread pool has little impact on the performance, we initialize 100 threads in the thread pool by default to support a large amount of data transfer and to hide larger latency. The message aggregation and multi-threading techniques are only used on the reduce queues since the input queue and the master reduce queue serve a very different purpose. The reduce queues are intermediary staging points between the map and

1000

Time (s)

800 No combiner With combiner

600 400 200 0 0

5

10 15 # of upload threads

20

25

Fig. 2. Computation time as a function of the number of threads in the thread pool. Word Count on 25MB data on a single m1.small EC2 node.

reduce phases; thus, they require high throughput. In contrast, the input queue and the master reduce queue are used for job assignments. It is better to read one at a time to ensure a more even workload distribution. 2) Duplicate detection: Due to eventual consistency, we may read a message twice from a queue. We use tagging to overcome this problem for the reduce queues. When a map worker writes a SQS message, it tags the message with three pieces of information: the worker ID, the map ID and a unique number. The unique number is used to distinguish between the messages generated by the same map worker while processing the same map ID. The tag is simply prepended to the message. When a reduce worker reads a SQS message, it checks the tag to see if it has seen the message before. If so, the reduce worker ignores the message; otherwise, it stores the tag in its database to facilitate future duplicate detection and it then processes the message. Since we only need to detect duplicates within one reduce queue and since each message is an aggregate of up to 8KB, the number of tags we need to store is small. If two map workers read a duplicate message from the input queue (or if a worker failed in the middle of processing a map task), there will be redundant map outputs in the reduce queues. The reduce workers consult SimpleDB to construct a list of committed worker ID and map ID pairs (randomly pick a winner if more than one worker ID committed a map ID), and they then filter out redundant messages by checking the worker ID and the map ID in the message tag against the list. Two workers may also read a duplicate message from the master reduce queue. As discussed in Sec. II-D, we use a conflict resolution mechanism to get around the problem. F. Map and Reduce interfaces The user defined Map function must implement the following interface. void map(String key, String value, OutputCollector output) Just as described in [1], both the key and value are passed to the user-defined Map function as strings. The OutputCollector is provided to the Map function so that it can emit the output key-value pairs. The user defined reduce function in the MapReduce programming model requires an iterator interface for the list of

values for each reduce key. In our architecture, we have Q fixed reduce queues; thus, it is possible to have multiple reduce keys in the same reduce queue. Since values for different reduce keys may be mingled in the same reduce queue, we cannot simply feed the queue outputs to the reduce function. We have evaluated several implementation options. 1) Pull iterator with sorting: In a pull iterator implementation, the user defined reduce function must implement the following interface. void reduce(String key, Iterable values, OutputCollector output) The user defined function can simply iterate through the values iterator to retrieve all values associated with the key. Like other MapReduce implementations, we can implement this interface through sorting. We first download all messages in a reduce queue locally, sort them in order to group values by key, then feed the values to the reduce function. We have implemented a version with in-memory sort, which requires that all data from a reduce queue fits in the main memory. A more complete implementation should assume that all data may not fit in the memory; thus, a paging mechanism has to be built in to dump partial results to disk. Unfortunately, this will increase the implementation complexity. We adopt the simpler in-memory implementation because we can bound the size of the reduce queue by increasing Q. Unlike other MapReduce implementations, increasing Q will not have the adverse effect of stressing the master node. Amazon’s SQS does not support FIFO (first in, first out) queuing discipline, which prevents us from adopting a solution similar to that used in other MapReduce implementations for sorting (map node sorts locally, reduce node merge sorts). However, we are currently porting CMR to run on top of an internal cloud OS, which supports FIFO queues. We are experimenting with how to best implement sorting. 2) Pull iterator with multi-threading: An alternative on implementing the pull iterator is to use multi-threading. In a multi-threading implementation, we start a new thread which runs the user defined reduce function whenever we see a new key from the reduce queue and we pass new values for the key through a buffer. When the reduce function asks for new values as part of the iterator interface, it calls the hasNext() function to query the existence of new values. The thread invokes wait() if the buffer is empty, so that it can sleep and wait for the next value. When the main thread dequeues a new value for a key, it inserts the value into the buffer for the corresponding reduce function, then invokes the notify() function to wake up the sleeping thread. When the reduce queue is empty, the main thread iterates through all sleeping reduce functions and passes through a special nil value to the sleeping threads. The hasNext() function recognizes the special value and returns false to indicate that there are no more values in the iterator. The Java language does not provide a mechanism for us to implement our own thread scheduling algorithm. Since there is only one reduce function to wake up when we dequeue a new message and since we know exactly which one to wake

up, we may be able to implement a more efficient scheduling algorithm. Using the built-in Java scheduling algorithm, we are able to launch a little more than 5,000 threads, which limits the number of reduce keys one can have in a reduce queue. 3) Push iterator: We have also implemented a push iterator interface for the user-defined reduce function to facilitate an efficient implementation. This interface is currently the default in Cloud MapReduce. In the push iterator implementation, we pass to the reduce function one value at a time as we dequeue from the reduce queue, instead of passing to it an explicit iterator. The reduce function is called once for each new value. The push iterator interface consists of three interface functions. The first is the start interface. T start(String key, OutputCollector output) The start interface is called when we see a key for the first time while dequeuing from the reduce queue. It is called before passing the first value to the reduce function. T is a userdefined class which holds the states that the reduce function needs to keep track of. The key associated with this reduce function is also passed in. For example, for the word count example, the start function initializes a count variable in object T and sets its value to 0. The second is the actual reduce function. void next(String key, String value, T state, OutputCollector output) A new value for the reduce key is passed in every time this interface is called. As in the Google implementation, both the key and the value are generic strings. The reduce function parses the string to derive the correct data. T is the object that holds the current state. The reduce function processes the current value and updates the state as necessary. For example, in the word count example, the reduce function converts the string to a numerical value, then adds the value to the count variable stored in T. The last interface is the end interface. void complete(String key, T state, OutputCollector output) This interface is called when there are no more values associated with the reduce key. In the word count example, the reduce function emits a key-value pair, where the key is the reduce key and the value is the count stored in T. In our implementation, a reduce worker first dequeues a message from the master reduce queue to know which reduce queue it is responsible for. Then the worker dequeues messages from the reduce queue one by one. If it sees a reduce key for the first time, it invokes the start interface function and keeps the state object T in a collection. For every new key-value pair, it finds the state object T associated with the key, then calls the next interface function. When there are no more messages in the reduce queue, it calls the complete interface function for each reduce key it has seen. Even though we have to keep a reduce key collection and search the key collection for each new key-value pair, this could be efficiently implemented

because the number of reduce keys in each reduce queue is expected to be small. One drawback of the push iterator implementation is that we need to maintain a set of states. This is not a problem for a reduce worker since we can bound the number of reduce keys in each reduce queue by increasing Q. However, this may be a problem for a combiner, since a map worker may generate keyvalue pairs for a large number of keys. Fortunately, a combiner does not need to combine all values for a particular key. Cloud MapReduce currently sets a 64MB memory limit on the total amount of combiner state a map worker can keep. If the limit is reached, we flush the buffer by invoking the complete interface for all reduce keys in the combiner buffer. Because Amazon SQS does not support FIFO and because we store results in an output queue, current CMR does not generate sorted output, regardless of the iterator interface we use. Again, this is a feature that we will support when CMR is ported to run on an internal cloud OS with a queue service supporting FIFO. Currently, CMR provides a sorting function which sorts the outputs from the output queue and store them in a sequential file in the sorted order. Applications requiring a sorted output could invoke our sorting function as the last step. 4) Comparison between the push and pull iterators: We compare the performance of the different iterator interface implementations. As a test case, we use the word count application processing a 25MB file on an m1.small EC2 instance. We disable the combiner in order to generate a larger amount of intermediate data. The multi-thread pull iterator implementation takes 311 seconds, the sorting pull iterator implementation takes 161 seconds, and the push iterator implementation takes 124 seconds. The multi-thread pull iterator is the least efficient because starting and stopping threads carry a large overhead. The sorting pull iterator takes slightly longer time than the push iterator. Even though the overhead of the sorting pull iterator is low, a complete implementation with paging support would be complex. As a result, Cloud MapReduce currently uses the push iterator interface by default. III. P ROS

AND CONS OF

C LOUD M AP R EDUCE

Leveraging what a cloud OS has built already, we are able to greatly simplify CMR’s implementation. CMR currently has around 3,000 lines of Java code (including generous comments, three MapReduce applications, and our profiling code to collect statistics), compared to 285,387 lines in Hadoop 0.21.0. CMR is simpler for several reasons, including the following. First, S3 presents a large and reliable file storage abstraction, which alleviates us from having to design our own file system. Second, SimpleDB presents a high bandwidth status vault, which can sustain a high read and write (through striping) throughput. The high read throughput, in particular, enables our distributed architecture. Instead of relying on the master to instruct the slave nodes (to alleviate the stress on the master), we allow all workers to query the central store for a global knowledge first, and then derive the local actions on their own. Third, both S3 and SQS present a single point of

contact that is capable of sustaining a high throughput. We no longer need to worry about communicating with many nodes at the same time. Last, we simply use Amazon’s visibility timeout mechanism to handle failure. No extra logic is needed to detect and recover from failure. One drawback of the current CMR implementation is that it does not employ any locality optimization. It uses the network exclusively for I/O, bypassing all local storage. Such an architecture would encounter network bottleneck eventually in today’s cloud infrastructure, when the network links between the computing nodes and the cloud services are saturated. The lack of locality optimization is not only because the cloud services run on a separate set of nodes than our map and reduce workers, but also because Amazon cloud does not expose any locality hint. We are currently porting CMR to run on an internal cloud OS (hosted in an enterprise’s data center) and the new version will support locality optimization. The internal cloud OS is based on a commercial product (from Appistry) which allows queues and storage service to co-locate on the same nodes as the computing nodes, and it exposes locality hint such that we are able to optimize data placement. In the future, locality optimization may no longer be necessary. Future data center architecture (e.g., fat tree [14], Portland [15], Bcube [16], Dcell [17], and VL2[18]) would support the full bi-sectional bandwidth, removing network as a bottleneck. IV. E XPERIMENTAL EVALUATION A thorough performance comparison between CMR and other MapReduce implementations, such as Hadoop, is beyond the scope of this paper. In this section, we instead show a few sample applications, contrast it with Hadoop to highlight their differences and show that CMR is a practical system. We have implemented three different common MapReduce applications to evaluate Cloud MapReduce’s performance: Word Count, Reverse Index, and String Matching (Distributed Grep). Unless specified otherwise, all experiments reported in this section use default parameters in both Cloud MapReduce and Hadoop. We deploy Hadoop ourselves in our EC2 cluster, instead of using Amazon’s Elastic MapReduce. Elastic MapReduce uses Hadoop version 0.18.3, whereas we want to compare against the latest stable 0.21.0 version. We compare the performance between Hadoop 0.21.0 and Cloud MapReduce on a cluster of 100 m1.small EC2 instances. An m1.small instance provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor, which is about 40% of a CPU core on the Amazon physical machines. The CPU capacity is significantly smaller than the state-of-theart physical machines. For example, m1.small’s CPU capacity is roughly one tenth or less of that of nodes used in [19]. We use m1.small because it has the same network IO potential of 1Gbps, like other larger EC2 instances. Thus, the cost per network bandwidth is more favorable. To make sure the master node in Hadoop is not interfering with the slave node tasks, we put the master on a separate node (so 101 nodes in total for Hadoop v.s. 100 for CMR). Note that we are not comparing CMR with Hadoop in a dedicated cluster environment, not

Hadoop Cloud MapReduce Cloud MapReduce w/sort

combiner 264 104 221

no combiner 537 372 463

TABLE I T IME ( IN SECONDS ) TO RUN THE WORD COUNT APPLICATION . ROUGHLY 13GB DATA ON 100 NODES .

only because we have no access to such a cluster, but also because a full performance comparison is beyond the scope of this paper. For Word Count and Distributed Grep, we use the examples provided by the Hadoop distribution. However, for Reverse Index, we have to implement our own since it is not included in the Hadoop distribution. We first run the Word Count application on a roughly 13GB of Wikipedia article text file. Using the default 64MB block size in HDFS (Hadoop File System), there are exactly 200 splits, which corresponds to 200 map tasks. For CMR, the data is stored in S3 as several files since S3 has a 5GB per file limit, and we use file ranges to create an equal number of splits (200) as inputs to the CMR jobs. Using the default 2 map tasks per node setting, all map tasks finish in the first wave. We set the number of reduce tasks to be 100, so that they also finish in one wave. To see the effects of larger data, we run the test with and without the combiner enabled. To enable side-byside comparison, we also run a version of Cloud MapReduce with the pull iterator interface implemented with in-memory sorting. Table I shows the time it takes to run the MapReduce jobs. In both cases, Cloud MapReduce is roughly twice as fast as Hadoop. Even with sorting, Cloud MapReduce is still faster. Fig. 3 shows two of the data points in Table I in more details. Fig. 3(a) shows Hadoop’s computation progress for the case with no combiner. Hadoop’s report on the reduce progress consists of three components. The first (0-33%) is for data shuffling, the second (33-66%) is for data sorting and the third (66-100%) is for applying the user-defined reduce function. At 125s, some map tasks have finished, and Hadoop starts to shuffle data. At 231s, the map stage finishes. However, there are still some intermediate data that needs to be shuffled. Shuffling continues until 385s when the reduce progress reaches 33%. While data is shuffling between 231s and 385s, the CPU is underutilized. In comparison, Fig. 3(b) shows CMR’s computation progress. The map tasks finish at 210s, and the reduce tasks almost start processing immediately. In CMR, the reduce progress report only tracks the progress on applying the userdefined reduce function since data is already loaded into SQS when the reduce stage starts. Since data uploading happens in parallel with map computation, CMR makes more efficient use of both CPU and network IO. In addition, CMR employs many parallel threads for SQS upload, which helps to increase the throughput. In the next experiment, we use a larger block size to see its

(a) CPU Usage

(a) Hadoop progress

("" #""$ '" #""$ .. ,*) *+

/01 1234 2566 7589:5 1234 2566

&" #""$ %" #""$

(b) Network Usage Fig. 4.

!" #""$

Word count on a 200MB data set on an m1.small instance

" #""$ "

(""

!"" < =>5 ?65:3@86A

;""

%""

Hadoop Cloud MapReduce

(b) Cloud MapReduce progress Fig. 3. Computation progress for Hadoop and CMR. Word Count on roughly 13GB data. No combiner.

impact on the run time. We use a roughly 52GB of Wikipedia article text file, and with a 256MB block size in HDFS so that there are again exactly 200 splits. With combiner, CMR takes 436s, whereas Hadoop takes 747s. Without combiner, CMR takes 1,247s, whereas Hadoop takes 1733s. Even though the map stage is longer due to the larger block size, the shuffling phase is also longer in Hadoop, which translates into a smaller CMR run time. Fig. 4 presents a different view on the overlapping between map and shuffling. It shows the CPU and memory usage during one run of the Word Count application on a single m1.small instance processing 200MB of data. We disable combiner in order to stress the network. The CPU remains mostly at peak utilization throughout the job ( 40% is the highest utilization on m1.small). In Fig. 4(a), at around 21:42, the map phase finishes and the worker waits to flush out all SQS messages before starting the reduce phase. While waiting, there is a short drop in the CPU utilization, but the reduce stage starts soon after, making full use of the CPU again. Fig. 4(b) shows the network IO happening in parallel, including both downloading files from S3 and accessing SQS. Because network access is spread out, the network bandwidth demand is small, staying under 60Mbps, even with the combiner disabled. In our independent tests, an EC2 instance is able to sustain roughly 800Mbps throughput, so the network interface is far from being the bottleneck. Fig. 3 also shows that CMR’s map stage is faster than that of Hadoop (209s vs. 385s). There are several reasons for this. Beyond avoiding sorting in our iterator interface, we also avoid going to the disk for the intermediate results. Hadoop always stores the intermediate results on disks and

combiner 2058 1324

no combiner 4379 3213

TABLE II T IME ( IN SECONDS ) TO

RUN THE WORD COUNT APPLICATION . 100GB DATA ON 100 NODES .

then copies over the results to the hard disks on the destination node when instructed by the master. As a result, the data not only transits through the network once, but it also transits twice through the local disk (at least the local disk buffer). The latencies are incurred sequentially and they cannot be hidden beyond the granularity of a map task worth of output. In comparison, Cloud MapReduce uses SQS as a staging area so that it can do everything in memory; therefore, the data only transits once through the network. Even though SQS could be implemented with disk paging, it is part of the long latency presented by SQS, which could be effectively hidden with multiple threads down to the granularity of a single key/value pair. Lastly, Hadoop also has a higher overhead of launching Map tasks, which accounts for a few hundred milliseconds of the difference. We next run Word Count on a much larger data set, which has 100GB data and 1,563 splits when stored in HDFS using the default block size. Table II shows the results. Since there are many map tasks, the shuffling of the earlier map output would overlap with later map processing; thus, the overlapping advantage of CMR is not as pronounced. However, removing sorting and avoiding disk staging still result in significant reduction. We are not able to run the sorting version of CMR due to memory limit. We also compare CMR with Hadoop by taking Amazon’s resource usage into account. Since we have no visibility into the resource usage in SimpleDB and SQS (they run in a different cluster), we are not able to compare directly. Instead, we take Amazon’s pricing as an approximate measure of Amazon resource usage and compare based on the cost. We

run Word Count on a 400GB of data using 100 nodes, which takes roughly an hour (the unit of EC2 billing) to complete. During the test, we generate 983,152 SQS requests, which costs $0.98. For SimpleDB, we consume 3.7 machine hours, which would have costed $0.52 if it were beyond the first 25 free machine hours per month. We store very little state in SimpleDB for only a short period, so the SimpleDB storage cost is negligible. In total, SQS and SimpleDB cost 17.6% more than the machine cost (100 of $0.085, the lowest cost Linux server in N. Virginia), so the overhead is small. For applications that do not generate as much traffic (e.g., Reverse Index and Distributed Grep), the SQS cost is even lower. For the Distributed Grep application, we use the same 13GB data as used in the Word Count example and we grep for the keyword “which”. Cloud MapReduce takes 962 seconds, whereas, Hadoop takes 1,047 seconds. Adding sorting or combiner makes little difference since the amount of data in the reduce stage is small. The time difference is not as much because this job is dominated by string (regular expression) matching in the map phase, which is CPU intensive. Also, the map output data is small for the reduce stage; thus, the effects of overlapping map and shuffling is not as pronounced. For the Reverse Index application, we use the same 1.2GB data that is used in Phoenix evaluation [20]. We duplicate the input data 10 times to create 12GB of data so that it is reasonably large for our 100-node cluster. The resulting data set contains 923,670 HTML files. Using the default settings, it takes Hadoop more than 6 hours to process all data, whereas CMR only takes 297s. Most of the long running time is due to Hadoop’s high overhead of task creation, where sometime it takes a few hundred milliseconds to create a new task. Since each input file is a separate map task, the overhead adds up. To get around the overhead, we use MultiFileInputFormat as the input format, which aggregates many small files to be an input split. Using MultiFileInputFormat, Hadoop’s computing time reduces to 638 seconds, more than twice that of CMR. V. C ONCLUSION It is far from obvious that we can simplify large-scale systems’ design and implementation if we build them on top of a cloud OS. The tradeoffs a cloud has made in favor of scaling make it hard to build systems on top. For example, a naive approach to get around the eventual consistency problem is to wait long enough, which could lead to a very poor system performance. Using MapReduce as an example, we have demonstrated that it is possible to overcome the cloud limitations without performance degradation. The techniques we used are general enough that they can be used for other systems. We also proposed a new fully distributed architecture to implement the MapReduce programming model. Nodes pull for job assignments and global status in order to determine their individual actions. The architecture also uses queues to shuffle results from Map to Reduce. Mappers write results

as soon as they are available and Reducers filter out results from failed nodes, as well as duplicate results. Even though a full scale performance evaluation is beyond the scope of this paper, our preliminary results indicate that CMR is a practical system and its performance is on par with that of Hadoop. Our experimental results also indicate that using queues to overlap the map and shuffling stage seems to be a promising approach to improve MapReduce performance. R EFERENCES [1] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” in OSDI’04: Sixth Symposium on Operating System Design and Implementation, December 2004. [2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in 19th ACM Symposium on Operating Systems Principles, October 2003. [3] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, and D. A. Wallach, “Bigtable: A distributed storage system for structured data,” in Proc. OSDI, 2006, pp. 205–218. [4] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: amazon’s highly available key-value store,” SIGOPS Oper. Syst. Rev., vol. 41, no. 6, pp. 205–220, 2007. [5] E. A. Brewer, “Towards robust distributed systems (abstract),” in PODC ’00: Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing. New York, NY, USA: ACM, 2000, p. 7. [6] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,” SIGACT News, vol. 33, no. 2, pp. 51–59, 2002. [7] W. Vogels, “Eventually consistent,” Commun. ACM, vol. 52, no. 1, pp. 40–44, 2009. [8] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: Distributed data-parallel programs from sequential building blocks,” in European Conference on Computer Systems (EuroSys), March 2007. [9] “Hadoop,” http://lucene.apache.org/hadoop. [10] T. Condie, N. Conway, P. Alvaro, and J. M. Hellerstein, “Mapreduce online,” in Proc. NSDI, 2010. [11] A. Phanishayee, E. Krevat, V. Vasudevan, D. Andersen, G. Ganger, G. Gibson, and S. Seshan, “Measurement and analysis of tcp throughput collapse in cluster-based storage systems,” in Proc. File and Storage Technologies (FAST), Feb 2008. [12] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, , D. Andersen, G. Ganger, G. Gibson, and B. Mueller, “Safe and effective fine-grained tcp retransmissions for datacenter communication,” in Proc. SIGCOMM, 2009. [13] R. Griffith, Y. Chen, J. Liu, A. Joseph, and R. Katz, “Understanding tcp incast throughput collapse in datacenter networks,” in Proc. SIGCOMM WREN Workshop, 2009. [14] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” in Proc. SIGCOMM, 2008. [15] R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, , P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat, “Portland: A scalable fault-tolerant layer 2 data center network fabric,” in Proc. SIGCOMM, 2009. [16] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu, “Bcube: A high performance, server-centric network architecture for modular data centers,” in Proc. SIGCOMM, 2009. [17] C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Luz, “Dcell: A scalable and fault-tolerant network structure for data centers,” in Proc. SIGCOMM, 2008. [18] A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta, “Vl2: A scalable and flexible data center network,” in Proc. SIGCOMM, 2009. [19] Y. Yu, P. K. Gunda, and M. Isard, “Distributed aggregation for dataparallel computing: interfaces and implementations,” in SOSP ’09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. New York, NY, USA: ACM, 2009, pp. 247–260. [20] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evaluating mapreduce for multi-core and multiprocessor systems,” in Proceedings of the 13th Intl. Symposium on High-Performance Computer Architecture (HPCA), Feb. 2007.