Graeme Malcolm | Snr Content Developer, Microsoft

• What is a Stream?

• What is Apache Storm? • How is Storm Supported in Azure HDInsight?

• What is a Storm Topology? • How is Event Data Defined? • How Does Storm Distribute Stream Processing? • How Does Storm Guarantee Message Processing? • How Do I Aggregate Data in a Stream?

What is a Stream?

01100101

01100101

01100101

01100101

01100101

01100101

• A unbounded sequence of event data • Stream processing is continuous • Aggregation is based on temporal windows

01100101

01100101

01100101

What is Apache Storm?

• An event processor for data streams • Defines a streaming topology that consists of:

Spout

– Spouts: Consume data sources and emit streams that contain tuples – Bolts: Operate on tuples in streams

• Storm topologies run continuously on streams of data – Real-time monitoring – Event aggregation and logging

Bolt

How is Storm Supported in Azure HDInsight?

• HDInsight supports an Storm cluster type – Choose Cluster Type in the Azure Portal

• Can be provisioned in a virtual network

DEMO Provisioning a Storm Cluster

What is a Storm Topology?

• Spouts emit tuples in streams

Spout

• Spouts can emit multiple streams Bolt

• Bolts process tuples • Bolts can also emit tuples • There can be multiple spouts and bolts in a topology • Bolts can process multiple streams

Bolt Spout

Bolt

Bolt

How do I Create a Topology?

• Implement Spout and Bolt classes – Native language of Storm is Java – Microsoft SCP.NET package enables development in C#

• Use a TopologyBuilder class to connect the components • Build and package the code, and submit the topology to a Storm cluster

TopologyBuilder tb = new TopologyBuilder(); tb.setSpout ("spout", myspout,…); tb.setBolt("bolt", mybolt…).shuffleGrouping("spout"); public class myspout { ... } public class mybolt { ... }

How is Event Data Defined?

• Declare schema for each stream in each component • Java OutputFieldsDeclarer class defines output schema for a stream • Microsoft SCP.NET class templates include input and output schema declarations for spouts and bolts

Spout

Field1

Field2

Field3

Integer

String

Integer

Output

Input

Input Output

Bolt

Bolt

DEMO Creating a Storm Topology with C#

How Does Storm Distribute Stream Processing?

• Master node runs Nimbus – Assigns processing across the cluster

• Worker nodes run Supervisor – Manages processing on the node

• Cluster coordination is managed using Zookeeper – Apache project for distributed processing

• A topology has one or more worker processes

• A worker process spawns one or more executors (threads) per component



– Set using parallelism hint TopologyBuilder tb = new TopologyBuilder(); tb.setSpout ("spout", myspout, 1, …); tb.setBolt("bolt", mybolt, 3, …).shuffleGrouping("spout");

• Each executor runs one or more task

Task

Task

Executor

Executor



 Task

Task Executor

Worker Process Topology

Executor

• Use stream groupings to determine affinity between tasks – Shuffle grouping TopologyBuilder tb = new TopologyBuilder(); tb.setSpout ("spout", myspout, 1, …); tb.setBolt("bolt", mybolt, 3, …).shuffleGrouping("spout");



– Fields grouping TopologyBuilder tb = new TopologyBuilder(); tb.setSpout ("spout", myspout, 1, …); tb.setBolt("bolt", mybolt, 3, …).fieldsGrouping("spout", "f1");

– Others • All, Global, …

Task

f1=A

Executor

Task Executor

f1=B

 Task

Executor

 f1=C

Worker Process Topology

Task Executor

DEMO Using the Parallelism Hint

How Does Storm Guarantee Message Processing?

• Non-Transactional (no Ack) – Enforces at most once semantics – Simplest programming model – Possible data loss

seq, tuple seq, tuple seq, tuple

• Non-Transactional (with Ack) – Enforces at least once semantics – Requires explicit retry logic

Acker



• Transactional – Enforces exactly once semantics – Works well for batches – Use TransactionalTopologyBuilder – Implement a committer bolt



DEMO Implementing Guaranteed Message Processing

How Do I Aggregate Data in a Stream?

01100101 01100101

01100101 01100101

01100101

01100101

01100101

01100101

01100101

• Events are aggregate within temporal windows • Use a tumbling window to aggregate events in a fixed timespan – For example: every hour, count the events in the preceding hour

• Use a sliding window to aggregate events in overlapping timespans – For example: every 10 minutes, count the events in the preceding hour

• Cache field values from each tuple • Configure a Tick Tuple for the window duration • On each tick, start a new window: – For a tumbling window: • Aggregate cached fields • Delete all cached fields

– For a sliding window • Delete stale fields • Aggregate remaining fields

 



3 1 2

6

DEMO Implementing a Sliding Window

• What is a Stream?

• What is Apache Storm? • How is Storm Supported in Azure HDInsight?

• What is a Storm Topology? • How is Event Data Defined? • How Does Storm Distribute Stream Processing? • How Does Storm Guarantee Message Processing? • How Do I Aggregate Data in a Stream?

©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Graeme Malcolm | Snr Content Developer, Microsoft - GitHub

Graeme Malcolm | Snr Content Developer, Microsoft. Page 2. • What is a Stream? • What is Apache Storm? • How is Storm Supported in Azure HDInsight?

871KB Sizes 3 Downloads 227 Views

Recommend Documents

Graeme Malcolm | Snr Content Developer, Microsoft - GitHub
To create a standalone application: – Create a Python script. – Use Maven to build Scala or Java apps. – Include code to create Spark context. • To run a standalone application: – Use the spark-submit script sc = SparkContext ... ... spark-

Content-Preserving Graphics - GitHub
audience possible and offers a democratic and deliberative style of data analysis. Despite the current ... purpose the content of an existing analytical graphic because sharing the graphic is currently not equivalent to ... These techniques empower a

Microsoft Learning Experiences - GitHub
Page 1 ... A web browser and Internet connection. Create an Azure ... Now you're ready to start learning how to build data science and machine learning solutions.

Microsoft Learning Experiences - GitHub
In this lab, you will explore and visualize the data Rosie recorded. ... you will use the Data Analysis Pack in Excel to apply some statistical functions to Rosie's.

Microsoft Learning Experiences - GitHub
created previously. hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. /data/storefile Stocks. 8. Wait for the MapReduce job to complete. Query the Bulk Loaded Data. 1. Enter the following command to start the HBase shell. hbase shell. 2.

Microsoft Learning Experiences - GitHub
videos and demonstrations in the module to learn more. 1. Search for the Evaluate Recommender module and drag it onto the canvas. Then connect the. Results dataset2 (right) output of the Split Data module to its Test dataset (left) input and connect

Microsoft Learning Experiences - GitHub
In this lab, you will create schemas and tables in the AdventureWorksLT database. Before starting this lab, you should view Module 1 – Designing a Normalized ...

Microsoft Learning Experiences - GitHub
Challenge 1: Add Constraints. You have been given the design for a ... add DEFAULT constraints to columns based on the requirements. Challenge 2: Test the ...

Microsoft Learning Experiences - GitHub
Performance for SQL Based Applications. Then, if you have not already done so, ... In the Save As dialog box, save the file as plan1.sqlplan on your desktop. 6.

Microsoft Learning Experiences - GitHub
A Windows, Linux, or Mac OS X computer. • Azure Storage Explorer. • The lab files for this course. • A Spark 2.0 HDInsight cluster. Note: If you have not already ...

Microsoft Learning Experiences - GitHub
Start Microsoft SQL Server Management Studio and connect to your database instance. 2. Click New Query, select the AdventureWorksLT database, type the ...

Microsoft Learning Experiences - GitHub
performed by writing code to manipulate data in R or Python, or by using some of the built-in modules ... https://cran.r-project.org/web/packages/dplyr/dplyr.pdf. ... You can also import custom R libraries that you have uploaded to Azure ML as R.

Microsoft Learning Experiences - GitHub
Developing SQL Databases. Lab 4 – Creating Indexes. Overview. A table named Opportunity has recently been added to the DirectMarketing schema within the database, but it has no constraints in place. In this lab, you will implement the required cons

Microsoft Learning Experiences - GitHub
create a new folder named iislogs in the root of your Azure Data Lake store. 4. Open the newly created iislogs folder. Then click Upload, and upload the 2008-01.txt file you viewed previously. Create a Job. Now that you have uploaded the source data

Microsoft Learning Experiences - GitHub
will create. The Azure ML Web service you will create is based on a dataset that you will import into. Azure ML Studio and is designed to perform an energy efficiency regression experiment. What You'll Need. To complete this lab, you will need the fo

Microsoft Learning Experiences - GitHub
Lab 2 – Using a U-SQL Catalog. Overview. In this lab, you will create an Azure Data Lake database that contains some tables and views for ongoing big data processing and reporting. What You'll Need. To complete the labs, you will need the following

Microsoft Learning Experiences - GitHub
The final Execute R/Python Script. 4. Edit the comment of the new Train Model module, and set it to Decision Forest. 5. Connect the output of the Decision Forest Regression module to the Untrained model (left) input of the new Decision Forest Train M

Microsoft Learning Experiences - GitHub
Data Science and Machine Learning ... A web browser and Internet connection. ... Azure ML offers a free-tier account, which you can use to complete the labs in ...

Microsoft Learning Experiences - GitHub
Processing Big Data with Hadoop in Azure. HDInsight. Lab 1 - Getting Started with HDInsight. Overview. In this lab, you will provision an HDInsight cluster.

Microsoft Learning Experiences - GitHub
Real-Time Big Data Processing with Azure. Lab 2 - Getting Started with IoT Hubs. Overview. In this lab, you will create an Azure IoT Hub and use it to collect data ...

Microsoft Learning Experiences - GitHub
Real-Time Big Data Processing with Azure. Lab 1 - Getting Started with Event Hubs. Overview. In this lab, you will create an Azure Event Hub and use it to collect ...

Microsoft Learning Experiences - GitHub
Data Science Essentials. Lab 6 – Introduction to ... modules of this course; but for the purposes of this lab, the data exploration tasks have already been ... algorithm requires all numeric features to be on a similar scale. If features are not on

Microsoft Learning Experiences - GitHub
Selecting the best features is essential to the optimal performance of machine learning models. Only features that contribute to ... Page 3 .... in free space to the right of the existing modules: ... Use Range Builder (all four): Unchecked.

Microsoft Learning Experiences - GitHub
Implementing Predictive Analytics with. Spark in Azure HDInsight. Lab 3 – Evaluating Supervised Learning Models. Overview. In this lab, you will use Spark to ...