CS2950-U Project Proposal: OLAP in MapReduce Alex Kalinin [email protected] September 30, 2011

1

Introduction

Decision-support applications need to query large amounts of data with ad-hoc queries. One approach is to use On-Line Analytical Processing (OLAP) frameworks. We are particularly interested in ROLAP (Relational OLAP). In ROLAP, every tuple may have a number of attributes called dimensions and another set of attributes called measures. The main idea is that we can aggregate measures for every possible subset of dimension attributes. The one common example involves Sales table with the scheme {P roduct, Customer, Supplier, Sales}. A user can explore this dataset by aggregating (e.g., SU M (Sales)) over any subset of those three attributes, like (P roduct, Customer), which would give total sales for every customer and product bought from all suppliers. Alternatively, a user can just ask about total sales for every product by picking the subset (P roduct). It can be seen that the number of possible aggregations is 2n , where n is the number of dimension attributes. The whole collection of all aggregations is called a cube[1]. All components of a cube (i.e., subset of aggregates) form a lattice with a well-known structure. The important optimization consists of materializing some nodes of the lattice by precomputing corresponding aggregates. Then, other nodes can be computed using the materialized ones. Particularly, a node can be computed from any of its ancestors. There is a number of cost models that pick the nodes to materialize. We, however, are interested in a simple one[2], where the only information needed consists of the approximate cardinality of all the aggregates (nodes). Notice, that it is usually not possible to materialize all the aggregations due to huge amounts of data involved. Another important concept, that comes from OLAP, is slicing and dicing. When looking at a particular cube, like (P roduct, Customer), the user can make selection on one of the dimension attributes. For example, P roduct = “laptop” results in sales of laptops for all customers.

2

Tentative Proposal

The main idea is to implement an application that will take the input data and will be answering aggregate and selection queries using parallel framework (e.g, MapReduce) as the backend. In particular the application will be able to: 1. Take the specification and the input data as a number of tuples with dimension and measure attributes. The format of the input data is not important: CSV, relational table, etc. 2. Sample a fraction of the data using MapReduce to get approximate cardinalities for the nodes in the lattice. This would allow to determine initial nodes to materialize.

1

3. Materialize some of the nodes of the cube lattice in parallel using MapReduce. 4. Answer queries in form of (d1 , . . . , dn , s1 , . . . , sn ), where di are chosen dimensions and si are selection attributes. The queries, again, are answered using MapReduce. 5. By keeping statistics, destroy some materializations and compute new ones that are deemed to be more useful. This should happen in the background using MapReduce. The optional part consists of choosing a storage for the materialized results. The simplest one would be to store them in HDFS files and keeping track of the files in the application. We are going to investigate other, BigTable-like, options (e.g. HBase) to see if they could bring any benefits with them.

3

Experiments

The data will probably come from TPC-H benchmark lineitem table, possibly joined together with other tables. We are not going to perform joins or any other fancy SQL operations in our implementation. The queries can be also taken from TPC-H, but in a simplified form: as a number of dimension attributes, selections and measures. Ideally, it would be great to compare the application with at least on existing parallel DBMS. However, since these DBMS are notoriously hard to set up properly and they might not be available for a free evaluation the comparison might be not possible. So, alternatively we should be able to conduct experiments to measure scalability, sampling factor and the impact of materialization/dynamic materialization.

References [1] Jim Gray, Adam Bosworth, Andrew Layman, and Hamid Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proceedings of the Twelfth International Conference on Data Engineering, ICDE ’96, pages 152–159, Washington, DC, USA, 1996. IEEE Computer Society. [2] Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. Implementing data cubes efficiently. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, pages 205–216, New York, NY, USA, 1996. ACM.

2

CS2950-U Project Proposal: OLAP in MapReduce -

CS2950-U Project Proposal: OLAP in MapReduce. Alex Kalinin akalinin@cs.brown.edu. September 30, 2011. 1 Introduction. Decision-support applications ...

92KB Sizes 0 Downloads 201 Views

Recommend Documents

Project 4.3 - Project Proposal - GitHub
Nov 5, 2013 - software will find the optimal meet time for all users. This component is similar to the ... enjoy each others company! Existing Approaches:.

Project Proposal
A lot of applications emerge in both academic and industrial areas. Examples are simulation, monitoring, business process, knowledge representation, environmental modeling, and active database ... implemented using Java. It supports ...

Project Proposal Project Management Suite
have to make sure no one doing same job and no one override others work. ○ Project manager has to ensure all listed job done. Page 3. The Needs. ○ Version System. ○ Issue tracker. ○ Collaboration tools. ○ Mail Server. ○ Identity Managemen

Voltha Project Proposal -
Dec 31, 2016 - set of abstract APIs via which north-bound systems can interact with the ... Python was ... assistance in system testing framework for VOLTHA.

a project proposal
of a Bachelor of Science (B.Sc Hons) degree in Computer Science and Engineering,. Obafemi Awolowo ... 4.4.2 The probability of dropping packet. 48 .... Over the past few years, researchers have come out with several congestion avoidance.

Voltha Project Proposal -
Dec 31, 2016 - Abstraction) is a software module that acts as an isolator between an abstract (vendor agnostic) PON management system and a set of vendor-.

pdf project proposal
Download. Connect more apps... Try one of the apps below to open or edit this item. pdf project proposal. pdf project proposal. Open. Extract. Open with. Sign In.

Final Robotics Project Proposal pdf.pdf
Final Robotic ... posal pdf.pdf. Final Robotics ... oposal pdf.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Final Robotics Project Proposal pdf.pdf.

Template- New Project Proposal
Definitely, since bug and issue reports are piling and managing them manually is getting out of control. Key Stakeholders: At the moment only the development team of the product line, but in the future this product might also serve the helpdesk team,

Project proposal v2.pdf
A preprocedural checklist improves the safety of emergency department. intubation of trauma patients. Academic Emergency Medicine; 22(80):989-92.

honors project proposal form -
Parkland College, 2400 W. Bradley Ave., Champaign, Illinois 61821. Must be degree seeking to earn scholarships. Consult Student Advising for information on ...

Activism Project Proposal for sending.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Activism Project ...

Master's Project Proposal Prithviraj Deshmane
Comparison of Clustered WSNs employing Distance-based Sleep ... sensor network is said to have perished owing to the hole in coverage and functionality.

Project Plan Samples Sample 1 Author's Name Project Proposal ...
find it within the HTML source code and copy it to the place where you need it.] ... 2. Project Scope + Deliverables. 2.1 Scaling Plan. 2.2 Partnerships. 3.

DIFFERENTIAL DRIVE PROJECT PROPOSAL ...
Bachelor of Science in Electromechanical Engineering, exp. ... Computer Science I Using C ... BOSTON UNIVERSITY, College of Communication, Boston, MA.

A Project Proposal by
It would appear, then, that it is more the functionalities of the resident species that would ... been placed on functional grouping of species, which is non-phylogenetic. ... bank, we are tempted to attribute their proportional numbers and kinds ...

Master's Project Proposal: High Performance ...
Jun 15, 2009 - software package to create and run workflows on the Grid. ... For our purposes, each activity is an execution of an application (such as BLAST) ... Resources hosted on the TeraGrid typically include clusters, massively parallel ...

1. WanMuShu HuangHeYuan - Tree Planting Project Proposal ...
... trees in the surroundings, flooding jeopardize local people and their. Page 3 of 7. 1. WanMuShu HuangHeYuan - Tree Planting Project Proposal Revised.pdf.

SIGMETRICS Tutorial: MapReduce
Jun 19, 2009 - A programming model for large-scale distributed data ..... Could be hard to debug in .... Reading from local disk is much faster and cheaper.

Cloud MapReduce: a MapReduce Implementation on ...
a large-scale system design and implementation if we build on top of it. Unfortunately .... The theorem states that, of the three properties of shared-data systems ...

Cloud MapReduce: a MapReduce Implementation on ...
The theorem states that, of the three properties of shared-data systems – data ...... then copies over the results to the hard disks on the destination node when ...

MapReduce/Bigtable for Distributed Optimization
our global model parameters, we use a distributed data-store known as Bigtable ... allows backup workers to duplicate effort without producing erroneous output.

project proposal for collaboration between oss and ... -
under a Creative Commons license and freely available on the internet. We recommend acquiring from Public Labs the open source spectrometer, which is ...

MS Project Pre-Proposal QUERYING THE ...
Department of Computer Science, Rochester Institute of Technology. Visualization of a series constituting a problem, phases of operation towards acquiring a ...