Lecture I: Course Overview, Intro to Data Science, and R Data Science for Business Analytics

Thibault Vatter Department of Statistics, Columbia University and HEC Lausanne, UNIL

26.02.2018

Outline

1 Course overview

2 Intro to data science

3 R

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

1 / 32

A little about me Born and raised in Geneva Education: I I I

B.Sc. Physics (EPFL, ’10) M.Sc. Physics with minor in Financial Engineering (EPFL, ’12) Ph.D. Statistics (HEC Lausanne, ’16)

Worked a bit as a quant in finance Currently: I I

Post-doctoral fellow at Columbia University Live in New York city

Hobbies: I I I

T. Vatter

Flying planes Watching bay area teams (go 49ers and Warriors!) Beers (formerly at Satellite, now in Brooklyn micro-breweries) Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

2 / 32

The basics Every two weeks: 02.26/03.12/03.26/04.09/04.30/05.14/05.28 Lectures: I

focus on introducing the concepts

I

8:15-9:00am/9:15-10:00am + 1:15-3:00pm/3:15-4:00pm

I

classroom 237, Internef building

Exercise sessions: I

focus on the assignments and project

I

10:15-11:00am/11:15-12:00pm + 3:15-4:00pm/4:15-5:00pm

I

lab room 143, Internef building

TA: Natasha Tagasovska, [email protected]

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

3 / 32

Grading 4 assignments (50%) and one project (50%) I I

Detailed reports for each assignment and final project Presentation during last lecture for the project

Final grade I

According to P4 GRADE =

I I

i=1

HWi · 12.5 + PR · 50 100

HWi for i = {1, 2, 3, 4} and PR are from 0 to 100 GRADE will then be adjusted from 1 to 6

Groups of 1 or 2 members I I I

Email to Natasha with the group members One email per group is enough Deadline for group registration is March 19

Grades based on academic performance only! T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

4 / 32

Learning outcomes Manage and analyze data Develop data products Use data science in a business context

source: r4ds.had.co.nz

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

5 / 32

Lectures Date 02.26 02.26 03.12 03.12 03.26 03.26 04.09 04.09 04.30 04.30 05.14 05.14 05.28 05.28 T. Vatter

(am) (pm) (am) (pm) (am) (pm) (am) (pm) (am) (pm) (am) (pm) (am) (pm)

Topic Intro R workflow and RMarkdown Wrangling (I) Visualization (I) Wrangling (II) Visualization (II) Modeling (I) Modeling (II) Shiny Guest presentation Big data Hadoop and Spark Projects presentations Projects presentations

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

6 / 32

Lab sessions Date 02.26 02.26 03.12 03.12 03.26 03.26 04.09 04.09 04.30 04.30 05.14 05.14

T. Vatter

(am) (pm) (am) (pm) (am) (pm) (am) (pm) (am) (pm) (am) (pm)

Topic R Refresher Workflow, RMarkdown, data w. & v. (I) Project Workflow, RMarkdown, data w. & v. (I) Project Data w. & v. (II), modeling (I and II) Project Data w. & v. (II), modeling (I and II) Project Shiny app Project Spark

Lecture I: Course Overview, Intro to Data Science, and R

HW# / HW1 / HW1 / HW2 / HW2 / HW3 / HW4

26.02.2018

7 / 32

Milestones

Date 03.26 04.09 04.30 05.14 05.14 05.28 05.28

Assignment HW1 Project proposal HW2 HW3 Project update HW4 Project report

To be submitted before midnight of the due date No late submission without medical certificate

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

8 / 32

Course website

All lecture notes, the syllabus, assignments, and additional resources are available at:

https://tvatter.github.io/dsfba 2018/

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

9 / 32

Resources

R for data science The CRAN website Rstudio cheat sheets Much more in the resources section of the course website

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

10 / 32

Best place to look for answers?

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

11 / 32

Outline

1 Course overview

2 Intro to data science

3 R

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

12 / 32

What is Data Science? Wikipedia: “the extraction of knowledge from data” Precise definition a bit unclear and controversed... Practitioners “agree” on the components of data science: I

database management

I

gathering and cleaning

I

exploratory analysis

I

predictive modeling

I

data summary and visualization

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

13 / 32

A brief (and opinionated) history 1960, Peter Naur publishes Datalogy: the science of data and its place in education 1962, John Tuckey, The Future of Data Analysis: . . . as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. . . I have come to feel that my central interest is in data analysis. . . 1974, Peter Naur, Concise Survey of Computer Methods: The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences. 1977, John Tuckey publishes Exploratory Data Analysis T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

14 / 32

A brief (and opinionated) history 1989, first Knowledge Discovery in Databases (KDD) workshop (later maturing into the ACM SIGKDD Conference) 1994, Database marketing: Companies are collecting mountains of information about you, crunching it to predict how likely you are to buy a product, and using that knowledge to craft a marketing message precisely calibrated to get you to do so. 1996, International Federation of Classification Societies (IFCS) meet in Tokyo (Data science, classification, and related methods) 1997, C.F. Jeff Wu inaugural’s lecture Statistics = Data Science? for appointment to the H. C. Carver Professorship at the University of Michigan. T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

15 / 32

A brief (and opinionated) history 1997, the journal Data Mining and Knowledge Discovery is launched 2001, William Cleveland (Bell Labs) published Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics. [a plan] to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called “data science.” 2001, Leo Breiman, Statistical Modeling: The Two Cultures There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

16 / 32

A brief (and opinionated) history 2002, creation of Data Science Journal (CS focus) 2003, creation of Journal of Data Science (stats focus) 2006, Hadoop 2008, DJ Patil (LinkedIn) and Jeff Hammerbacher (Facebook) coined the term data scientist to define their jobs 2009, reintroduction of the term NoSQL 2009, Hal Varian (chief economist at Google) the sexy job in the next 10 years will be statisticians 2012, Harvard Business Review publishes Data Scientist: The Sexiest Job of the 21st Century 2012, job listings for Data Scientists increased by 10,000% 2014, Bin Yu’s Let us own Data Science speech 2015, DJ Patil as White House’s first Chief Data Scientist T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

17 / 32

Applications

Some of the hiring partners of The Data Incubator E-marketing

Credit scoring

Recommender systems

E-commerce

Sport analytics

Government analysis

Biotechnology

Gaming

Image or speech recognition

Price comparisons

Fraud and risk detection

Airline routes planing

Social media

Delivery logistics

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

18 / 32

Data scientists?

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

19 / 32

The data science toolbox

source: datasciencecentral.com T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

20 / 32

Technology ecosystem

source: rosebt.com T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

21 / 32

Most popular?

source: kdnuggets.com T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

22 / 32

Outline

1 Course overview

2 Intro to data science

3 R

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

23 / 32

S and R S I

A statistical programming language

I

First appeared in 1976

I

Developed by John Chambers and (in earlier versions) Rick Becker and Allan Wilks of Bell Labs

I

John Chambers, [the aim is] o turn ideas into software, quickly and faithfully

I

Modern implementation of S

I

First appeared in 1993

I

Created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand

I

Currently developed by the R Development Core Team

R

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

24 / 32

Some “technical” details about R Part of the GNU free software project Source code written primarily in C, Fortran, and R Available for Windows, macOS, and Linux Multi-paradigm: object-oriented, functional, procedural Dynamically typed Scripting language (interpreted) Wide variety of statistical and graphical techniques Easily extensible through functions and packages Read/write from/to various data sources

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

25 / 32

What about Excel?

source: fantasyfootballanalytics.net

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

26 / 32

Excel is great for certain things...

source: github.com/jdwilson4 T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

27 / 32

...but not everything

R’s advantages: Easier automation

Easier to find and fix errors

Better reproducibility

Free & open source

Faster computation

Advanced statistics capabilities

Supports larger data sets State-of-the-art graphics Reads any type of data Runs on many platforms More powerful data manipulation capabilities Easier project organization T. Vatter

Anyone can contribute packages to improve its functionality

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

28 / 32

Automation and reproducibility

source: trendct.org T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

29 / 32

How about Python?

source: python.org

T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

30 / 32

CRAN

source: cran.r-project.org T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

31 / 32

RStudio An open-source integrated development environment (IDE) RStudio Desktop available for Windows, macOS, and Linux

source: rstudio.com T. Vatter

Lecture I: Course Overview, Intro to Data Science, and R

26.02.2018

32 / 32

Lecture I: Course Overview, Intro to Data Science, and R - GitHub

Lecture I: Course Overview,. Intro to Data Science, and R. Data Science for Business Analytics. Thibault Vatter . Department of Statistics, Columbia University and HEC Lausanne, UNIL. 26.02.2018 ...

4MB Sizes 2 Downloads 287 Views

Recommend Documents

lecture 3: more statistics and intro to data modeling - GitHub
have more parameters than needed by the data: posteriors can be ... Modern statistical methods (Bayesian or not) .... Bayesian data analysis, Gelman et al.

lecture 2: intro to statistics - GitHub
Continuous Variables. - Cumulative probability function. PDF has dimensions of x-1. Expectation value. Moments. Characteristic function generates moments: .... from realized sample, parameters are unknown and described probabilistically. Parameters a

Intro to Webapp - GitHub
The Public Data Availability panel ... Let's look at data availability for this cohort ... To start an analysis, we're going to select our cohort and click the New ...

Intro to Webapp IGV - GitHub
Home Page or the IGV Github Repository. We are grateful to the IGV team for their assistance in integrating the IGV into the ISB-CGC web application.

Intro to Google Cloud - GitHub
The Cloud Datalab web UI has two main sections: Notebooks and Sessions. ... When you click on an ipynb file in GitHub, you see it rendered (as HTML).

Intro to Google Cloud - GitHub
Now that you know your way around the Google Cloud Console, you're ready to start exploring further! The ISB-CGC platform includes an interactive Web App, ...

Intro to Webapp SeqPeek - GitHub
brought to you by. The ISB Cancer Genomics Cloud. An Introduction to the ISB-CGC Web App SeqPeek. Page 2. https://isb-cgc.appspot.com. Main Landing ...

Intro to Google Cloud - GitHub
known as “Application Default Credentials” are now created automatically. You don't really need to click on the “Go to. Credentials”, but in case you do the next ...

Data Science - GitHub
Exploratory Data Analysis ... The Data Science Specialization covers the concepts and tools for ... a degree or official status at the Johns Hopkins University.

Introduction to R - GitHub
Nov 30, 2015 - 6 Next steps ... equals, ==, for equality comparison. .... invoked with some number of positional arguments, which are always given, plus some ...

Overview - GitHub
This makes it impossible to update clones. When this happens, ... versions of the Yocto kernel (from the Yocto repository, or the Intel Github repositories on ...

Overview - GitHub
Switch system is mobile Cashier backend sale system for merchants, which provides the following base features: Management of Partners, Merchants, Users, Cashiers, Cash registers, mPOS Terminals and Merchant's Product catalogues. Processing Sales with

Data Science and Machine Learning Essentials - GitHub
computer. Enter the following details as shown in the image below, and then click the ✓icon. • This is a ... Python in data science experiments in later modules.

intro slides - GitHub
Jun 19, 2017 - Learn core skills for doing data analysis effectively, efficiently, and reproducibly. 1. Interacting with your computer on command line (BASH/shell).

Lecture 1 - GitHub
Jan 9, 2018 - We will put special emphasis on learning to use certain tools common to companies which actually do data ... Class time will consist of a combination of lecture, discussion, questions and answers, and problem solving, .... After this da

Transcriptomics Lecture - GitHub
Jan 17, 2018 - Transcriptomics Lecture Overview. • Overview of RNA-Seq. • Transcript reconstruc囉n methods. • Trinity de novo assembly. • Transcriptome quality assessment. (coffee break). • Expression quan懿a囉n. • Differen鶯l express

Introduction to visualising spatial data in R - GitHub
An up-to-date pdf version of this tutorial is maintained for teaching purposes in the file ... 1. Introduction: provides a guide to R's syntax and preparing for the tutorial .... To check the classes of all the variables in a spatial dataset, you can

Data Visualization Using R & ggplot2 - GitHub Pages
Feb 22, 2015 - 3. 1.4 .2 setosa. # Note the use of the . function to allow Species to be used ..... Themes are a great way to define custom plots. ... Then just call your function to generate a plot. ... ggsave(file = "/path/to/figure/filename.pdf") 

Data 8R Intro to Visualizations Summer 2017 1 Similarity and ... - GitHub
Jun 27, 2017 - The chips that are present in your computer contain electrical components called transistors. ... Here's another attempt to improve the plot:.

Iraq Country Overview - GitHub
is widespread contamination through sophisticated explosive devices, pockets of volatility and reports of violence countrywide. (UN OCHA July. Humanitarian Bulletin). • Internal displacement continues in low numbers throughout Ninewa. Families arri

Overview Instructions - GitHub
The build produces a kernel image, a root file system, and kernel header ... git1+973494766d7ca2401e3138f28b6257a5b899cf1d-r0/linux-lsisim-standard-build.

MeerKAT Overview - GitHub
Youth Into Science – skills development and training programme. ○. African VLBI Network. MeerKAT focus today… SKA SKA Project .... KAT-7 Software ...