IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697- 701

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

A SelfSelf-Tuning System for Big Data Analytics STARFISH Y. Sai Pramoda1, C. Mary Sindhura2, T. Hareesha 3, K. Sindhuri4 1

B.Tech, Computer Science & Engineering, JNTUA College of engineering, Pulivendula, Andhra Pradesh, India [email protected]

2

B.Tech, Computer Science & Engineering, JNTUA College of engineering, Pulivendula, Andhra Pradesh, India [email protected]

3

B.Tech, Computer Science & Engineering, JNTUA College of engineering, Pulivendula, Andhra Pradesh, India [email protected]

4

B.Tech, Computer Science & Engineering, JNTUA College of engineering, Pulivendula, Andhra Pradesh, India [email protected]

Abstract Timely and cost-effective analytics over “Big Data” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces—is a popular choice for big data analytics. Most practitioners of big data analytics—like computational Scientists, systems researchers, and business analysts—lack the expertise to tune the system to get good performance. Unfortunately, Hadoop’s performance out of the box leaves much to be desired, leading to suboptimal use of resources, time, and money (in payas-you-go clouds). We introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. While Starfish’s system architecture is guided by work on self-tuning database systems, we discuss how new analysis practices over big data pose new challenges; leading us to different design choices in starfish.

Keywords: Big data, Hadoop, MapReduce programs, MAD & MADDER system I.INTRODUCTION Big data is the term for data sets so large and complicated that it becomes difficult to process using traditional data management tools or processing applications. Starfish is one of tool to improve the performance of a system. Y. Sai Pramoda, IJRIT

697

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697- 701

Timely and cost-effective analytics over “Big Data” has emerged as a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. Web search engines and social networks capture and analyze every user action on their sites to improve site design, spam and fraud detection, and advertising opportunities. Powerful telescopes in astronomy, genome sequencers in biology, and particle accelerators in physics are putting massive amounts of data into the hands of scientists. Key scientific breakthroughs are expected to come from computational analysis of such data. Much basic and applied science disciplines now have computational subareas, e.g., computational biology, computational economics, and computational journalism. To express the features that users expect from a system for big data analytics is MAD (Magnetism, Agility, and Depth).Getting desired performance from a MAD system can be a major important exercise. The practitioners of big data analytics like data analysts, computational scientists, and systems researchers usually lack the expertise to tune system internals. Such users would rather use a system that can tune itself and provide good performance automatically. Hadoop is a MAD system that is becoming popular for big data analytics. An entire ecosystem of tools is being developed around Hadoop. Hadoop itself has two primary components. They are Map Reduce execution engine and a distributed file system. The properties that make Hadoop MAD pose new challenges in the path to self-tuning: 1. Data opacity until processing 2. File-based processing 3. Heavy use of programming languages Traditional Data Warehouses are kept non MAD by its administrators because it is easier to meet performance requirements in tightly controlled environments. To further complicate matters, three more features in addition to MAD are becoming important in analytics systems. They are 1. Data-lifecycle awareness 2. Elasticity 3. Robustness. Hadoop has the core mechanisms which use most of these mechanisms has to be managed manually. Hadoop limitations are How many machines and what type of machines to buy? How to configure the system? How to deal with failures? How to tune the system? For understanding those limitations STARFISH is introduced on Hadoop which is Intelligent, Predictive, and Automated Management. Starfish is a MADDER and self-tuning system for analytics on big data. An important design decision we made is to build Starfish on the Hadoop stack. The primary focus of this paper is on using experimental results to illustrate the challenges in each component and to motivate Starfish’s solution approach. A number of ongoing projects aim to improve Hadoop’s peak performance, especially to match the query performance of parallel database systems. Starfish has a different goal. The peak performance a manually-tuned system can achieve is not the primary concern, especially if this performance is for one of the many phases in the data lifecycle. Starfish’s goal is to enable Hadoop users and applications to get good performance automatically throughout the data lifecycle in analytics, without any need on their part to understand and manipulate.

II. LITERATURE REVIEW Starfish is a self-tuning system for Big Data Analytics. It is a tool to understand, optimize and strategize Hadoop applications. It’s an open source project hosted by Github. As a recent effort, DITRICH gives a tutorial on optimizing big data processing efficiency in Hadoop and Map Reduce. To be specific, the users focused on introducing different data management techniques, e.g., job optimization, physical data organization such as data layouts and indexes. A comprehensive comparison between Hadoop Mad Reduce and Parallel DBMS was given[2]. From an architecture perspective, FERGUSON reported their progress for accelerating big data analytics. This work introduces efforts of IBM in architecting their big data platforms to meet the requirement that one new analytical ecosystem can support entire spectrum of big data analytics. The reported technology utilized Hadoop, IBM Smart Analytic System with built-in No SQL graph store[4]. According to HERODOTOU proposed Starfish a self-tuning system for big data analytics? The focus of this work is to mitigate the knowledge gap between new users and the sophisticated configurations of Hadoop and its default Y. Sai Pramoda, IJRIT

698

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697- 701

Map Reduce layer. Moreover, Starfish can adapt to user ends and system workloads for better performance. The basis of Starfish is self tuning database. Nevertheless, it is not clear how well Starfish can react to high-rate streaming data [1]. Ori Brafman and Rod Beckstrom are Entrepreneurs and they told that decentralized organization Spider compared to Starfish can be defined by several characteristics

III.DESCRIPTION A.Architecture review The tuning challenges present at each level of workload processing led us to the Starfish architecture. Broadly, the functionality of the components in this architecture can be categorized into job-level tuning, workflow-level tuning, and workload level tuning. These components interact to provide Starfish’s self-tuning capabilities. Starfish enables hadoop users and applications to get good performance automatically. 1)Job-level tuning: It has three components. They are i)Just-in-Time Optimizer: Starfish’s Just-in-Time Optimizer addresses unique optimization for problems to automatically select efficient execution techniques for Map Reduce jobs. In that Optimizer “Just-in-time” captures the online nature of decisions forced on the optimizer by Hadoop’s MADDER features. The optimizer takes the help of the Profiler and the Sampler. ii)Job profile/profiler: The Profiler uses a technique called dynamic instrumentation to learn performance models, called job profiles used for unmodified Map Reduce programs written in languages like Java and Python. iii)Sampler: The Sampler collects statistics efficiently about the input, intermediate, and output key-value spaces of a Map Reduce job. A unique feature of the Sampler is that it can sample the execution of a Map Reduce job in order to enable the Profiler to collect approximate job profiles at a fraction of the full job execution cost. 2) Workflow-level tuning: Workflow execution brings out some critical and unanticipated interactions between the Map Reduce task scheduler and the underlying distributed file system. Significant performance gains are realized in parallel task scheduling by moving the computation to the data. Efficient scheduling of a Hadoop workflow is further complicated by concerns like avoiding cascading reexecution under node failure or data corruption, Ensuring power proportional computing, and adapting to imbalance in load or cost of energy across geographic regions and time at the datacenter level. It has mainly two components they are Workflow-aware scheduler, What-if engine. Starfish’s Workflow-aware Scheduler addresses such concerns in conjunction with the What-if Engine and the Data Manager. This scheduler communicates with, but operates outside, Hadoop’s internal task scheduler. 3) Workload-level tuning: i) Workload optimizer A workload consisting of a collection of workflows, Starfish’s Workload Optimizer generates an equivalent, but optimized, collection of workflows that are handed off to the Workflow-aware Scheduler for execution. Three important categories are Data-flow sharing, Materialization, and Reorganization. ii) Elastisizer: Starfish’s Elastisizer automates such decisions. The intelligence in the Elastisizer comes from a search strategy in combination with the What-if Engine that uses a mix of simulation and model-based estimation to answer what-if questions regarding workload performance on a specified cluster configuration.

Y. Sai Pramoda, IJRIT

699

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697- 701

Fig: Starfish architecture

B. ACHIEVEMENTS: Using Starfish Architecture which is inbuilt with Hadoop Architecture the system achieves  Perform in-depth job analysis with profiles  Predict the behavior of hypothetical job executions  Optimize arbitrary MapReduce programs C. STARFISH FEATURES: The Starfish is addressing these challenges using a combination of techniques from cost-based database query optimization, robust and adaptive query processing, static and dynamic program analysis, dynamic data sampling and run-time profiling, and statistical machine learning applied to streams of system instrumentation data. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance. Starfish usages are to visualize, optimize and strategize the Hadoop applications. 1)VISUALIZE:. Starfish provides visualizations that enable Hadoop users to see how the Map Reduce applications are performing. Understand the bottlenecks existing in Hadoop. Find Hadoop parameters that may have been misconfigure. Overall with the regular users of Starfish application developers become efficient and learn to develop better Map Reduce applications. 2) OPTIMIZE: Hadoop contains many dozens of configuring parameters these parameters conjoin the performance of the Map Reduce applications written in languages like java, hive, pig, cascading and other languages. There are literally billions of combine choices was selling easily these parameters and every Hadoop easily cluster in an operator struggles how to configure that Starfish addresses this problem by providing automatic health checks that use profiling data from Hadoop cluster to identify misconfigure parameters Starfish will also recommend input settings Y. Sai Pramoda, IJRIT

700

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697- 701

for all these parameters in many cases Starfish self-tune the setting of these parameters for the simplifying the task of tuning application performance. 3) STRATEGIZE: Another major challenge today is now to allocate resources for the Hadoop and within Hadoop. Intelligent resource allocation was needed in order to meet the requirements and applications condition time, $ cost, Income to run Hadoop clusters on platforms. To find optimal EC2 instances for workloads. Meet time and cost budgets with ease. D. STARFISH USAGE: Using Starfish it is easy. It consisting of three steps 1. The first step is to collect the profiling the data from your Hadoop cluster. Starfish supports different forms of profiling data such as job history logs or Hadoop logs, MR runtimes and Ganglia metrics 2. The second step is to import the profiling data into profile store such as Local Cache System, AmazonS3, and RDBMS. 3. The third step is you can then start/fire up the Graphical or Command Line interfaces to invoke visualize, optimize and strategize features.

IV. CONCLUSION AND FUTURE WORK Hadoop is now a viable competitor to existing systems for big data analytics. While Hadoop currently trails existing systems in peak query performance, a number of research efforts are addressing this issue. Starfish fills a different void by enabling Hadoop users and applications to get good performance automatically throughout the data lifecycle in analytics; without any need on their part to understand and manipulate the many tuning knobs available. A system like Starfish is essential as Hadoop usage continues to grow beyond companies like Facebook and Yahoo! That has considerable expertise in Hadoop. New practitioners of big data analytics like computational scientists and systems researchers lack the expertise to tune Hadoop to get good performance. Starfish’s tuning goals and solutions are related to projects like Hive, Manimal, MRShare, Nectar, Pig, Quincy, and Scope. The novelty in Starfish’s approach comes from how it focuses simultaneously on different workload granularities—overall workload, workflows, and jobs (procedural and declarative)—as well as across various decision points—provisioning, optimization, scheduling, and data layout. This approach enables Starfish to handle the significant interactions arising among choices made at different levels. REFERENCES [1] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B.Cetin, and S. Babu, “Starfish: A self-tuning system for bigdata analytics,” in In CIDR, 2011, pp. 261–272. [2] J. Dittrich, J.-A. Quian´e-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah. PVLDB, 3(1), 2010. . [3]HadoopMapReduceTutorial. http://hadoop.apache.org/common/docs/r0. 20.2/mapred_tutorial.html. [4] [Ferguson12] Ferguson, Mike. "Architecting A Big Data Platform for Analytics." A Whitepaper Prepared for IBM (2012). http://public.dhe.ibm.com/common/ssi/ecm/en/iml14333usen/IML14333USEN.PDF [5] A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1), 2009. [6]Pluggable Block Placement Policies in HDFS.issues.apache.org/jira/browse/HDFS-385. [7] D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1), 2010. [8]J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for BigData. PVLDB, 2(2), 2009. [9]J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD Skills: New Analysis Practices for BigData. PVLDB, 2(2), 2009.

Y. Sai Pramoda, IJRIT

701

A Self-Tuning System Tuning System Tuning System ...

Hadoop is a MAD system that is becoming popular for big data analytics. An entire ecosystem of tools is being developed around Hadoop. Hadoop itself has two ...

1MB Sizes 5 Downloads 345 Views

Recommend Documents

Fuzzy Controller Tuning for a Multivariable System ...
When level is near to zero, temperature has an important variation. System parameters are not exactly known and approximate values are used. Level sensor.

System Global Area: The Focal Point for Automated Database Tuning
DBMS those are responsible for poor response time. These may be categories as software component (database design, SQL query parsing and optimize etc.) ...

Metasys System Configuration Guide (formerly Metasys System ...
Extended Application and Data Server System Requirements (Unified 10 or 25 User .... System Extended Architecture Overview LIT-1201527) - 12011832.PDF.

System and method for protecting a computer system from malicious ...
Nov 7, 2010 - so often in order to take advantage of neW virus detection techniques (e. g. .... and wireless Personal Communications Systems (PCS) devices ...

System and method for protecting a computer system from malicious ...
Nov 7, 2010 - ABSTRACT. In a computer system, a ?rst electronic data processor is .... 2005/0240810 A1 10/2005 Safford et al. 6,505,300 ... 6,633,963 B1 10/2003 Ellison et a1' ...... top computers, laptop computers, hand-held computers,.

Electric power system protection and control system
Dec 19, 2002 - Bolam et al., “Experience in the Application of Substation ... correlation circuit to poWer system monitoring and control host through ...

Embedded-System-Design-Introduction-To-SoC-System ...
Architecture eBooks. Totally free Books, regardless of whether Embedded System Design: Introduction To SoC System. Architecture PDF eBooks or in other format, are accessible in a heap around the net. Embedded System Design: Introduction To SoC System

Background System
Software Defined Radio with Commercial. Detection Technology. System. The commercial-detecting radio automatically changes the station to one of four preset ...

Background System
This project brings life back into radio and improves the overall ... the development of Radio Commercial Detection ... Software Defined Radio with Commercial.

Controller Tuning - nptel
Review Questions. 1. What does controller tuning mean? 2. Name the three techniques for controller tuning, those are commonly known as Ziegler-. Nichols method. 3. Explain the reaction curve technique for tuning of controller. What are its limitation

system programming and operating system by dhamdhere pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. system ...

'System' in the International Monetary System - National Bureau of ...
May 2013. Paper prepared for the Conference on “Money in the Western Legal Tradition”, ..... In the early 19th century, the monetary systems of France, Belgium, ...

What System Issues? System Issues General Structure - GitHub
First we make a domain ob ect. here we ... in this case, we specify a regular 100x100 grid over the .... ...and generally making your code assumption-free. Hence ...

Lecture 14 Digestive System Digestive System Parts 1 ...
Digestive System Parts. 1) oral cavity. 2) salivary glands. 3) esophagus. 4) stomach. 5) pylorus. 6) small intestine. 7) pancreas & liver. 8) large intestine. 9) anus.

Decision Support System And Intelligent System 7th Edition ...
There was a problem previewing this document. Retrying. ... Decision Support System And Intelligent System 7th Edition- Turban_Aronson_Liang_2005.pdf.

A Novel Optimal PID Tuning and On-line Tuning Based ...
Evolutionary Computation. IEEE Service Center: Piscataway, NJ, pp. 84-. 88, 2000. [4] D. Whitley, “An overview of evolutionary algorithms: Practical issues and.

[PDF BOOK] SCO UNIX Operating System V: System Administrator s ...
[PDF BOOK] SCO UNIX Operating System V: System. Administrator s Guide EPUB By Santa Cruz Operation. Book Synopsis none. Book details. Author : Santa ...

Review of Iris Recognition System Iris Recognition System Iris ... - IJRIT
Abstract. Iris recognition is an important biometric method for human identification with high accuracy. It is the most reliable and accurate biometric identification system available today. This paper gives an overview of the research on iris recogn

The Islamic System of Government The Islamic System ...
policy, political parties, government income, education, health, wealth, crime, freedom ... if not extinct. For example under the leadership of the Prophet Muhammad ... To do this s/he must either: • gain sufficient .... 18 M. B. Majlesi (ed.), “

Putting the 'system' in the international monetary system
It is no coincidence that the chronology of monetary history is driven by warfare - ..... Conference provided a forum for discussing wider monetary co-operation and changes and also adoption of a gold standard ... 17 The International Monetary Confer

Review of Iris Recognition System Iris Recognition System Iris ...
It is the most reliable and accurate biometric identification system available today. This paper gives an overview of the research on iris recognition system. The most ... Keywords: Iris Recognition, Personal Identification. 1. .... [8] Yu Li, Zhou X