CS2950-U Project Proposal: OLAP in MapReduce Alex Kalinin
[email protected] September 30, 2011
1
Introduction
Decision-support applications need to query large amounts of data with ad-hoc queries. One approach is to use On-Line Analytical Processing (OLAP) frameworks. We are particularly interested in ROLAP (Relational OLAP). In ROLAP, every tuple may have a number of attributes called dimensions and another set of attributes called measures. The main idea is that we can aggregate measures for every possible subset of dimension attributes. The one common example involves Sales table with the scheme {P roduct, Customer, Supplier, Sales}. A user can explore this dataset by aggregating (e.g., SU M (Sales)) over any subset of those three attributes, like (P roduct, Customer), which would give total sales for every customer and product bought from all suppliers. Alternatively, a user can just ask about total sales for every product by picking the subset (P roduct). It can be seen that the number of possible aggregations is 2n , where n is the number of dimension attributes. The whole collection of all aggregations is called a cube[1]. All components of a cube (i.e., subset of aggregates) form a lattice with a well-known structure. The important optimization consists of materializing some nodes of the lattice by precomputing corresponding aggregates. Then, other nodes can be computed using the materialized ones. Particularly, a node can be computed from any of its ancestors. There is a number of cost models that pick the nodes to materialize. We, however, are interested in a simple one[2], where the only information needed consists of the approximate cardinality of all the aggregates (nodes). Notice, that it is usually not possible to materialize all the aggregations due to huge amounts of data involved. Another important concept, that comes from OLAP, is slicing and dicing. When looking at a particular cube, like (P roduct, Customer), the user can make selection on one of the dimension attributes. For example, P roduct = “laptop” results in sales of laptops for all customers.
2
Tentative Proposal
The main idea is to implement an application that will take the input data and will be answering aggregate and selection queries using parallel framework (e.g, MapReduce) as the backend. In particular the application will be able to: 1. Take the specification and the input data as a number of tuples with dimension and measure attributes. The format of the input data is not important: CSV, relational table, etc. 2. Sample a fraction of the data using MapReduce to get approximate cardinalities for the nodes in the lattice. This would allow to determine initial nodes to materialize.
1
3. Materialize some of the nodes of the cube lattice in parallel using MapReduce. 4. Answer queries in form of (d1 , . . . , dn , s1 , . . . , sn ), where di are chosen dimensions and si are selection attributes. The queries, again, are answered using MapReduce. 5. By keeping statistics, destroy some materializations and compute new ones that are deemed to be more useful. This should happen in the background using MapReduce. The optional part consists of choosing a storage for the materialized results. The simplest one would be to store them in HDFS files and keeping track of the files in the application. We are going to investigate other, BigTable-like, options (e.g. HBase) to see if they could bring any benefits with them.
3
Experiments
The data will probably come from TPC-H benchmark lineitem table, possibly joined together with other tables. We are not going to perform joins or any other fancy SQL operations in our implementation. The queries can be also taken from TPC-H, but in a simplified form: as a number of dimension attributes, selections and measures. Ideally, it would be great to compare the application with at least on existing parallel DBMS. However, since these DBMS are notoriously hard to set up properly and they might not be available for a free evaluation the comparison might be not possible. So, alternatively we should be able to conduct experiments to measure scalability, sampling factor and the impact of materialization/dynamic materialization.
References [1] Jim Gray, Adam Bosworth, Andrew Layman, and Hamid Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proceedings of the Twelfth International Conference on Data Engineering, ICDE ’96, pages 152–159, Washington, DC, USA, 1996. IEEE Computer Society. [2] Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. Implementing data cubes efficiently. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, pages 205–216, New York, NY, USA, 1996. ACM.
2