Processing Big Data with Azure Data Lake Lab 5 – Final Challenge

Overview This final lab is designed to test your understanding of the concepts and techniques discussed in the course by presenting you with some real challenges to solve for yourself. You will be given some source data and some high-level instructions, but it’s up to you to figure out how to meet the required goals.

What You’ll Need To complete this lab, you will need the following:     

A web browser A Microsoft account A Microsoft Azure subscription A Windows, Linux, or Mac OS X computer The lab files for this course

Note: To set up the required environment for the lab, follow the instructions in the Setup document for this course. Then if you have not done so already, provision an Azure Data Lake Analytics resource in your Azure subscription.

Challenge 1: Explore Stock Items Data 1. Examine stock.txt in the retail folder where you extracted the lab files for this course. This is a tab-delimited text file containing a list of stock items sold by a retailer. Each item has a numeric identifier and a name. 2. Upload stock.txt to your Azure Data Lake store. 3. Create a database named retail in your Azure Data Lake Analytics instance, and then in the database create a schema named sales containing a table named stock in the retail database. The stock table should contain the following columns, and an appropriate index:  id (int)  item (string) 4. Run a U-SQL job to load the data in stock.txt into the sales.stock table in the retail database. 5. Run a U-SQL job to query the sales.stock table, and generate a comma-delimited text file that contains the id and item values of products where item contains the text “chocolate” (in any case-format – for example, the results should include “Chocolate”, “CHOCOLATE”, and “chocolate”).

Hint: Review the documentation for the IndexOf, ToLower, and ToUpper methods of the C# String class at https://msdn.microsoft.com/enus/library/system.string_methods(v=vs.110).aspx. These may be useful when filtering the data. 6. When the job has finished, note the number of rows that were output to the results file.

Challenge 2: Explore Orders Data 1. Examine the orders-XXXX.txt files in the retail\orders folder where you extracted the lab files for this course. These tab-delimited text files contain details of orders for the years 2013 to 2016. Each row represents an order line item, and includes a StockItemID field that references the id field in the stock.txt file you explored previously. A single order is identified by an OrderID field, so for orders that include multiple items, the same OrderID is referenced in multiple rows of the orders data file. 2. Upload the orders-XXXX.txt files to a folder in your Azure Data Lake Store. 3. Create a table named orders in the sales schema of the retail database, using the following schema with an index on the orderid field, and load the data from the orders-XXXX.txt files into the table:  orderid (int)  orderdate (string)  customername (string)  phonenumber (string)  deliveryaddressline1 (string)  deliveryaddressline2 (string)  cityname (string)  stockitemid (int)  quantity (int?)  unitprice (decimal?) Hint: The orders files are tab-delimited, and the first row contains the column names. Use the appropriate built-in extractor with a suitable value for the skipFirstNRows parameter to extract the data. 4. Write a query to return the year and the number of orders placed for each year. Output the results to a comma-delimited file in ascending order of year. Hint: The orderdate field is a string in the format YYYY-MM-DD, so you need to extract the first four characters of this to determine the year – take a look at the Substring methods of the C# String class at https://msdn.microsoft.com/enus/library/system.string_methods(v=vs.110).aspx. The challenge is to count the number of orders, not the number of order lines – so your query will need to count the number of DISTINCT orderid values for each year.

Challenge 3: Query Stock and Orders 1. Write a query to return the item name from the stock table and the sum of quantity from the corresponding orders in the orders table for the items that contain the word “chocolate” as identified in the previous exercise. Output the results to a comma-delimited file with the rows in

descending order of total quantity sold, and then view the output file to identify the chocolatebased item for which most units have been sold. 2. Write a query to return the customername name and the sum of quantity from the orders table for the items that contain the word “chocolate” as identified in the previous exercise. Output the results to a comma-delimited file with the rows in descending order of total quantity sold, and then view the output file to identify the customer who has purchased the most chocolatebased items. Note: You have now completed the labs in this course. Unless you want to experiment further, you can delete the resource group containing your Azure Data Lake Store and Azure Data Lake Analytics resources.

Microsoft Learning Experiences - GitHub

Processing Big Data with Azure Data Lake. Lab 5 – Final Challenge. Overview. This final lab is designed to test your understanding of the concepts and ...

626KB Sizes 94 Downloads 230 Views

Recommend Documents

Microsoft Learning Experiences - GitHub
Performance for SQL Based Applications. Then, if you have not already done so, ... In the Save As dialog box, save the file as plan1.sqlplan on your desktop. 6.

Microsoft Learning Experiences - GitHub
A Windows, Linux, or Mac OS X computer. • Azure Storage Explorer. • The lab files for this course. • A Spark 2.0 HDInsight cluster. Note: If you have not already ...

Microsoft Learning Experiences - GitHub
Start Microsoft SQL Server Management Studio and connect to your database instance. 2. Click New Query, select the AdventureWorksLT database, type the ...

Microsoft Learning Experiences - GitHub
performed by writing code to manipulate data in R or Python, or by using some of the built-in modules ... https://cran.r-project.org/web/packages/dplyr/dplyr.pdf. ... You can also import custom R libraries that you have uploaded to Azure ML as R.

Microsoft Learning Experiences - GitHub
Developing SQL Databases. Lab 4 – Creating Indexes. Overview. A table named Opportunity has recently been added to the DirectMarketing schema within the database, but it has no constraints in place. In this lab, you will implement the required cons

Microsoft Learning Experiences - GitHub
create a new folder named iislogs in the root of your Azure Data Lake store. 4. Open the newly created iislogs folder. Then click Upload, and upload the 2008-01.txt file you viewed previously. Create a Job. Now that you have uploaded the source data

Microsoft Learning Experiences - GitHub
will create. The Azure ML Web service you will create is based on a dataset that you will import into. Azure ML Studio and is designed to perform an energy efficiency regression experiment. What You'll Need. To complete this lab, you will need the fo

Microsoft Learning Experiences - GitHub
Lab 2 – Using a U-SQL Catalog. Overview. In this lab, you will create an Azure Data Lake database that contains some tables and views for ongoing big data processing and reporting. What You'll Need. To complete the labs, you will need the following

Microsoft Learning Experiences - GitHub
The final Execute R/Python Script. 4. Edit the comment of the new Train Model module, and set it to Decision Forest. 5. Connect the output of the Decision Forest Regression module to the Untrained model (left) input of the new Decision Forest Train M

Microsoft Learning Experiences - GitHub
Page 1 ... A web browser and Internet connection. Create an Azure ... Now you're ready to start learning how to build data science and machine learning solutions.

Microsoft Learning Experiences - GitHub
In this lab, you will explore and visualize the data Rosie recorded. ... you will use the Data Analysis Pack in Excel to apply some statistical functions to Rosie's.

Microsoft Learning Experiences - GitHub
created previously. hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles. /data/storefile Stocks. 8. Wait for the MapReduce job to complete. Query the Bulk Loaded Data. 1. Enter the following command to start the HBase shell. hbase shell. 2.

Microsoft Learning Experiences - GitHub
videos and demonstrations in the module to learn more. 1. Search for the Evaluate Recommender module and drag it onto the canvas. Then connect the. Results dataset2 (right) output of the Split Data module to its Test dataset (left) input and connect

Microsoft Learning Experiences - GitHub
In this lab, you will create schemas and tables in the AdventureWorksLT database. Before starting this lab, you should view Module 1 – Designing a Normalized ...

Microsoft Learning Experiences - GitHub
Challenge 1: Add Constraints. You have been given the design for a ... add DEFAULT constraints to columns based on the requirements. Challenge 2: Test the ...

Microsoft Learning Experiences - GitHub
Data Science and Machine Learning ... A web browser and Internet connection. ... Azure ML offers a free-tier account, which you can use to complete the labs in ...

Microsoft Learning Experiences - GitHub
Processing Big Data with Hadoop in Azure. HDInsight. Lab 1 - Getting Started with HDInsight. Overview. In this lab, you will provision an HDInsight cluster.

Microsoft Learning Experiences - GitHub
Real-Time Big Data Processing with Azure. Lab 2 - Getting Started with IoT Hubs. Overview. In this lab, you will create an Azure IoT Hub and use it to collect data ...

Microsoft Learning Experiences - GitHub
Real-Time Big Data Processing with Azure. Lab 1 - Getting Started with Event Hubs. Overview. In this lab, you will create an Azure Event Hub and use it to collect ...

Microsoft Learning Experiences - GitHub
Data Science Essentials. Lab 6 – Introduction to ... modules of this course; but for the purposes of this lab, the data exploration tasks have already been ... algorithm requires all numeric features to be on a similar scale. If features are not on

Microsoft Learning Experiences - GitHub
Selecting the best features is essential to the optimal performance of machine learning models. Only features that contribute to ... Page 3 .... in free space to the right of the existing modules: ... Use Range Builder (all four): Unchecked.

Microsoft Learning Experiences - GitHub
Implementing Predictive Analytics with. Spark in Azure HDInsight. Lab 3 – Evaluating Supervised Learning Models. Overview. In this lab, you will use Spark to ...

Microsoft Learning Experiences - GitHub
Microsoft Azure Machine Learning (Azure ML) is a cloud-based service from Microsoft in which you can create and run data science experiments, and publish ...

Microsoft Learning Experiences - GitHub
A Microsoft Windows, Apple Macintosh, or Linux computer ... In this case, you must either use a Visual Studio Dev Essentials Azure account, or ... NET SDK for.