Processing Big Data with Hive

What is Hive?

SELECT…

• A metadata service that projects tabular schemas over folders • Enables the contents of folders to be queried as tables, using SQL-like query semantics • Queries are translated into jobs – Execution engine can be Tez or MapReduce

set hive.execution.engine=mr; SELECT… Map

set hive.execution.engine=tez; SELECT…

Map Map

Reduce

Reduce Map

Map Reduce

Map

Map

Reduce

Map

Map

Reduce

Map

Reduce Reduce

Reduce Map

Map

Reduce

Reduce

Hive client tools include… Hive Shell

Visual Studio

Any ODBC Client

SELECT… SELECT…

Query Console (Hue)

SELECT…

PowerShell

How do I create and load Hive tables?

• Use the CREATE TABLE HiveQL statement – Defines schema metadata to be projected onto data in a folder when the table is queried (not when it is created)

• Specify file format and file location – Defaults to textfile format in the / folder • Default database is in /hive/warehouse • Create additional databases using CREATE DATABASE

• Create internal or external tables – Internal tables manage the lifetime of the underlying folders – External tables are managed independently from folders

CREATE TABLE table1 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';

CREATE TABLE table2 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/data/table2';

CREATE EXTERNAL TABLE table3 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/data/table3';

Internal table (folders deleted when table is dropped) Default location (/hive/warehouse/table1)

Stored in a custom folder (but still internal, so the folder is deleted when table is dropped)

External table (folders and files are left intact in Azure Blob Store when the table is dropped)

Hive data types : • Numeric – Integers: TINYINT, SMALLINT, INT, BIGINT – Fractional: FLOAT, DOUBLE, DECIMAL

• Character – STRING, VARCHAR, CHAR

• Date/Time – TIMESTAMP – DATE

• Special – BOOLEAN, BINARY, ARRAY, MAP, STRUCT, UNIONTYPE

• Save data files in table folders (or create table on existing files!) PUT myfile.txt /data/table1

• Use the LOAD statement LOAD DATA [LOCAL] INPATH '/data/source' INTO TABLE MyTable;

• Use the INSERT statement INSERT INTO TABLE Table2 SELECT Col1, UPPER(Col2), FROM Table1;

• Use a CREATE TABLE AS SELECT (CTAS) statement CREATE TABLE Table3 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/data/summarytable' AS SELECT Col1, SUM(Col2) As Total FROM Table1 GROUP BY Col1;

How do I query Hive tables?

• Query data using the SELECT HiveQL statement SELECT Col1, SUM(Col2) AS TotalCol2 FROM MyTable WHERE Col3 = 'ABC' AND Col4 < 10 GROUP BY Col1 ORDER BY Col4;

• Hive translates the query into jobs and applies the table schema to the underlying data files

• Views are named queries that abstract underlying tables CREATE VIEW v_SummarizedData AS SELECT col1, SUM(col2) AS TotalCol2 FROM mytable GROUP BY col1;

SELECT col1, TotalCol2 FROM v_SummarizedData;

Partitioning, Skewing, and Clustering Tables

CREATE TABLE part_table (col1 INT, col2 STRING) PARTITIONED BY (col3 STRING);

INSERT INTO TABLE part_table PARTITION(col3='A') SELECT col1, col2 FROM stg_table WHERE col3 = 'A';

part_table

col3='A'

SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode=nonstrict;

col3='B'

INSERT INTO TABLE part_table PARTITION(col3) SELECT col1, col2, col3 FROM stg_table;

col3='C'

CREATE TABLE skewed_table (col1 INT, skewed_table col2 STRING, col3 STRING) SKEWED BY (col3) ON ('A') [STORED AS DIRECTORIES]; col3='A'

INSERT INTO TABLE skewed_table SELECT col1, col2, col3 FROM stg_table; Others

CREATE TABLE clust_table (col1 INT, col2 STRING, col3 STRING) CLUSTERED BY (col3) INTO 3 BUCKETS;

INSERT INTO TABLE clust_table SELECT col1, col2, col3 FROM stg_table;

clust_table

How do I use Hive in Visual Studio?

Azure SDK for .NET includes HDInsight tools for Visual Studio – Visual Hive table designer – Hive query editor

How do I access Hive via ODBC?

1. Download and install the Hive ODBC Driver for HDInsight – 32-bit and 64-bit versions

2. Optionally, create a data source name (DSN) for your HDInsight cluster 3. Use an ODBC connection to query Hive tables

©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Processing Big Data with Hive - GitHub

Processing Big Data with Hive ... Defines schema metadata to be projected onto data in a folder when ... STORED AS TEXTFILE LOCATION '/data/table2';.

932KB Sizes 3 Downloads 323 Views

Recommend Documents

Processing Big Data with Azure Data Lake - GitHub
Processing Big Data with Azure Data Lake. Lab 3 – Using C# in U-SQL. Overview. U-SQL is designed to blend the declarative nature of SQL with the procedural ...

Processing Big Data with Azure Data Lake - GitHub
Processing Big Data with Azure Data Lake. Lab 4 – Monitoring U-SQL Execution. Overview. U-SQL jobs are executed in parallel. You can use the job graph, and ...

Processing Big Data With Hadoop In Azure HDInsight - GitHub
Enter the following command to query the table, and verify that no rows are returned: SELECT * FROM rawlog;. Load the Source Data into the Raw Log Table. 1. In the Hive command line interface, enter the following HiveQL statement to move the log file

Processing Big Data With Hadoop In Azure HDInsight - GitHub
Name: DataDB. • Subscription: Select your Azure subscription. • Resource Group: Select the resource group you created previously. • Select source: Blank database. • Server: Create a new server with the following settings: • Server name: Ent

Beyond Hive – Pig and Python - GitHub
Pig performs a series of transformations to data relations based on Pig Latin statements. • Relations are loaded using schema on read semantics to project table structure at runtime. • You can run Pig Latin statements interactively in the Grunt s

ALOJA: Cost-effective Big Data deployments - GitHub
Cost-effective Big Data deployments. SEVERO .... Guide the future development and deployment of Big Data ... Analytical models of Hadoop cost-effectiveness.

Concurrent Stream Processing - GitHub
... to SQL and execute federated queries across data sources. ... where source data arrives in streams. .... or a single input pipe that is copied to two destinations.

Digital Signal Processing - GitHub
May 4, 2013 - The course provides a comprehensive overview of digital signal processing ... SCHOOL OF COMPUTER AND COMMUNICATION. SCIENCES.

Data Mining with Big Data
storage or memory. ... Visualization of data patterns, considerably 3D visual image, represents one of the foremost ... Big Data Characteristics: Hace Theorem.

Connectionist Symbol Processing - GitHub
Department of Computer Science, University of Toronto,. 10 Kings College Road, Toronto, Canada M5S 1A4. Connectionist networks are composed of relatively ...

Waveform processing apparatus with versatile data bus
Dec 2, 2010 - A plurality of transmitting nodes transmit the data signals to the bus. A ..... hard disk and a waveform memory are connected on a com mon bus ...

Processing Geo-Data using the OpenWebGlobe Tools - GitHub
All commands run on normal computers (regular laptops and work stations) and on high performance ... documentation/ dataprocessing. pdf . 1.1 Why Data ..... [date_time ]: creating LOD directory: process/bugaboos/tiles /10. [date_time ]: ...

Ontology-Based Data Access with Ontop - GitHub
Benjamin Cogrel (Free University of Bozen-Bolzano). OBDA/Ontop. 22/04/2016. (1/40) .... Users: domain experts. ∼ 900 geologists et geophysicists ... Exploitation and Production Data Store: ∼ 1500 tables (100s GB). Norwegian Petroleum ...

Lecture 4 Big Data Storage and Processing
A collection of large and complex data sets which are difficult to process using common database management tools or traditional data processing applications. Big data is not just about size. ▫ Finds insights from complex, noisy, heterogeneous, str

PDF Spark: The Definitive Guide: Big Data Processing ...
PDF Spark: The Definitive Guide: Big Data Processing Made ... including national and world stock market news business news financial news and more BibMe ...

read ePub Spark: The Definitive Guide: Big Data Processing Made ...
read ePub Spark: The Definitive Guide: Big Data Processing Made. Simple FREE Download eBook ... techniques and scenarios for employing MLlib, Spark's.