Implementing Predictive Analytics with Spark in Azure HDInsight Lab 3 – Evaluating Supervised Learning Models

Overview In this lab, you will use Spark to evaluate classification and regression models. You will then validate parameters to optimize the performance of your models.

What You’ll Need To complete the labs, you will need the following:       

A web browser A Microsoft account A Microsoft Azure subscription A Windows, Linux, or Mac OS X computer Azure Storage Explorer The lab files for this course A Spark 2.0 HDInsight cluster.

Note: If you have not already done so, set up the required environment for the lab by following the instructions in the Setup document for this course. Then follow the instructions in Lab 1 to provision an HDInsight cluster.

Evaluating a Classification Model First, you will evaluate a classification model that predicts whether or not a flight will be late.

Upload Source Data to Azure Storage Note: If you have already uploaded the flights.csv data file to your Azure storage container, you can skip this procedure. In this lab, you will build a model based on data about flights. Before you can do this, you must store the flight data files in the shared storage used by your cluster. The instructions here assume you will use Azure Storage Explorer to do this, but you can use any Azure Storage tool you prefer.

1. In the folder where you extracted the lab files for this course on your local computer, in the data folder, verify that the flights.csv file exists. This file contains flight data that has been cleaned and prepared for modeling.

2. Start Azure Storage Explorer, and if you are not already signed in, sign into your Azure subscription. 3. Expand your storage account and the Blob Containers folder, and then double-click the blob container for your HDInsight cluster. 4. In the Upload drop-down list, click Upload Files. Then upload flights.csv as a block blob to a folder named data in root of the container.

Upload a Jupyter Notebook You will use a Jupyter Notebook to create and evaluate your classification model. You can choose to work with Python or Scala. 1.

In the Azure portal, in the blade for your HDInsight cluster, under Quick Links, click Cluster Dashboards. 2. Click Jupyter Notebook, and if prompted, log in using the cluster login name you specified when provisioning your cluster (be sure you use login name for HTTP connections and not the SSH user name.) 3. Click Upload, and browse to the Lab03 folder in the folder where you extracted the lab files. Then select either Python Classification Evaluation.ipynb or Scala Classification Evaluation.ipynb, depending on your preferred choice of language, and upload it.

Evaluate a Classification Model Now that you have uploaded the notebook, you can use the code it contains to evaluate your model. 1.

Open the notebook you uploaded and then read the notes and run the code it contains to build and evaluate a classification model.

Evaluating a Regression Model Having evaluated a classification model that predicts whether or not a flight will be late, you will now evaluate a regression model that predicts how late (or early) flights will arrive.

Evaluate a Regression Model You will use a Jupyter Notebook to create and evaluate your regression model. You can choose to work with Python or Scala. 1.

From the Lab03 folder in the folder where you extracted the lab files, upload Python Regression Evaluation.ipynb or Scala Regression Evaluation.ipynb, depending on your preferred choice of language, to the Jupiter Dashboard for your cluster. 2. Open the notebook you uploaded and then read the notes and run the code it contains to build a regression model.

Tuning Parameters You can optimize the performance of a model by tuning the parameters that you specify when training it. In this exercise, you will explore two common techniques for tuning parameters.

Tune Parameters using a Training / Validation Split You will use a Jupyter Notebook to tune the parameters for your model. You can choose to work with Python or Scala. 1.

From the Lab03 folder in the folder where you extracted the lab files, upload Python Parameter Tuning.ipynb or Scala Parameter Tuning.ipynb, depending on your preferred choice of language, to the Jupiter Dashboard for your cluster. 2. Open the notebook you uploaded and then read the notes and run the code it contains to build a classification model and use a TrainValidationSplit class to tune the parameters.

Tune Parameters using Cross-Validation You will use a Jupyter Notebook to create your regression model. You can choose to work with Python or Scala. 1.

From the Lab03 folder in the folder where you extracted the lab files, upload Python Cross Validation.ipynb or Scala Cross Validation.ipynb, depending on your preferred choice of language, to the Jupiter Dashboard for your cluster. 2. Open the notebook you uploaded and then read the notes and run the code it contains to build a classification model and use a CrossValidation class to tune the parameters.

Clean Up If you intend to proceed straight to the next lab, skip this section. Otherwise, follow the steps below to delete your cluster and avoid being charged for cluster resources when you are not using them.

Delete the Resource Group 1. Close the browser tab containing the Jupyter Notebooks dashboard if it is open. 2. In the Azure portal, view your Resource groups and select the resource group you created for your cluster. This resource group contains your cluster and the associated storage account. 3. In the blade for your resource group, click Delete. When prompted to confirm the deletion, enter the resource group name and click Delete. 4. Wait for a notification that your resource group has been deleted. 5. Close the browser.

Microsoft Learning Experiences - GitHub

Implementing Predictive Analytics with. Spark in Azure HDInsight. Lab 3 – Evaluating Supervised Learning Models. Overview. In this lab, you will use Spark to ...

708KB Sizes 10 Downloads 302 Views

Recommend Documents

Microsoft Learning Experiences - GitHub
Performance for SQL Based Applications. Then, if you have not already done so, ... In the Save As dialog box, save the file as plan1.sqlplan on your desktop. 6.

Microsoft Learning Experiences - GitHub
A Windows, Linux, or Mac OS X computer. • Azure Storage Explorer. • The lab files for this course. • A Spark 2.0 HDInsight cluster. Note: If you have not already ...

Microsoft Learning Experiences - GitHub
Start Microsoft SQL Server Management Studio and connect to your database instance. 2. Click New Query, select the AdventureWorksLT database, type the ...

Microsoft Learning Experiences - GitHub
performed by writing code to manipulate data in R or Python, or by using some of the built-in modules ... https://cran.r-project.org/web/packages/dplyr/dplyr.pdf. ... You can also import custom R libraries that you have uploaded to Azure ML as R.

Microsoft Learning Experiences - GitHub
Developing SQL Databases. Lab 4 – Creating Indexes. Overview. A table named Opportunity has recently been added to the DirectMarketing schema within the database, but it has no constraints in place. In this lab, you will implement the required cons

Microsoft Learning Experiences - GitHub
create a new folder named iislogs in the root of your Azure Data Lake store. 4. Open the newly created iislogs folder. Then click Upload, and upload the 2008-01.txt file you viewed previously. Create a Job. Now that you have uploaded the source data

Microsoft Learning Experiences - GitHub
will create. The Azure ML Web service you will create is based on a dataset that you will import into. Azure ML Studio and is designed to perform an energy efficiency regression experiment. What You'll Need. To complete this lab, you will need the fo

Microsoft Learning Experiences - GitHub
Lab 2 – Using a U-SQL Catalog. Overview. In this lab, you will create an Azure Data Lake database that contains some tables and views for ongoing big data processing and reporting. What You'll Need. To complete the labs, you will need the following

Microsoft Learning Experiences - GitHub
The final Execute R/Python Script. 4. Edit the comment of the new Train Model module, and set it to Decision Forest. 5. Connect the output of the Decision Forest Regression module to the Untrained model (left) input of the new Decision Forest Train M

Microsoft Learning Experiences - GitHub
Page 1 ... A web browser and Internet connection. Create an Azure ... Now you're ready to start learning how to build data science and machine learning solutions.

Microsoft Learning Experiences - GitHub
In this lab, you will explore and visualize the data Rosie recorded. ... you will use the Data Analysis Pack in Excel to apply some statistical functions to Rosie's.

Microsoft Learning Experiences - GitHub
created previously. hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. /data/storefile Stocks. 8. Wait for the MapReduce job to complete. Query the Bulk Loaded Data. 1. Enter the following command to start the HBase shell. hbase shell. 2.

Microsoft Learning Experiences - GitHub
videos and demonstrations in the module to learn more. 1. Search for the Evaluate Recommender module and drag it onto the canvas. Then connect the. Results dataset2 (right) output of the Split Data module to its Test dataset (left) input and connect

Microsoft Learning Experiences - GitHub
In this lab, you will create schemas and tables in the AdventureWorksLT database. Before starting this lab, you should view Module 1 – Designing a Normalized ...

Microsoft Learning Experiences - GitHub
Challenge 1: Add Constraints. You have been given the design for a ... add DEFAULT constraints to columns based on the requirements. Challenge 2: Test the ...

Microsoft Learning Experiences - GitHub
Data Science and Machine Learning ... A web browser and Internet connection. ... Azure ML offers a free-tier account, which you can use to complete the labs in ...

Microsoft Learning Experiences - GitHub
Processing Big Data with Hadoop in Azure. HDInsight. Lab 1 - Getting Started with HDInsight. Overview. In this lab, you will provision an HDInsight cluster.

Microsoft Learning Experiences - GitHub
Real-Time Big Data Processing with Azure. Lab 2 - Getting Started with IoT Hubs. Overview. In this lab, you will create an Azure IoT Hub and use it to collect data ...

Microsoft Learning Experiences - GitHub
Real-Time Big Data Processing with Azure. Lab 1 - Getting Started with Event Hubs. Overview. In this lab, you will create an Azure Event Hub and use it to collect ...

Microsoft Learning Experiences - GitHub
Data Science Essentials. Lab 6 – Introduction to ... modules of this course; but for the purposes of this lab, the data exploration tasks have already been ... algorithm requires all numeric features to be on a similar scale. If features are not on

Microsoft Learning Experiences - GitHub
Selecting the best features is essential to the optimal performance of machine learning models. Only features that contribute to ... Page 3 .... in free space to the right of the existing modules: ... Use Range Builder (all four): Unchecked.

Microsoft Learning Experiences - GitHub
Microsoft Azure Machine Learning (Azure ML) is a cloud-based service from Microsoft in which you can create and run data science experiments, and publish ...

Microsoft Learning Experiences - GitHub
A Microsoft Windows, Apple Macintosh, or Linux computer ... In this case, you must either use a Visual Studio Dev Essentials Azure account, or ... NET SDK for.

Microsoft Learning Experiences - GitHub
In the new browser tab that opens, note that a Jupyter notebook named ... (Raw) notebook has been created, and that it contains two cells. The first ..... Page 9 ...