TGen June 19-20th, 2017 Instructors: Nick Banovich Emily Davenport Helpers: Chistophe Legendre Elizabeth Hutchins Eric Alsop Ryan Richholt

Goal: Learn core skills for doing data analysis effectively, efficiently, and reproducibly. 1. Interacting with your computer on command line (BASH/shell) 2. Programming fundamentals ® 3. Version control (Git)

Do you suffer from any of the following? -

I usually manage data in excel, but that’s caused some errors with dates and I want to learn a different way. My advisor insists that we store 50,000 barcodes in a spreadsheet, and something must be done about that. I’m having a hard time analyzing microarray, SNP, or multivariate data with Excel and Access. I want to use publicly available data, but it’s confusing to download it through command line. I’m interested in going into industry and companies are asking for data analysis experience. I’m trying to reboot my lab’s worker to manage data and analysis in a more sustainable way. I’m re-entering data over and over again by hand and know there’s a better way. I'm tired of feeling out of my depth on computation and want to increase my confidence. I see other people’s figures and wonder if I could generate something like that with my data.

Notes before we start

- Website: https://erdavenport.github.io/2017-06-19-tgen/ - Etherpad: http://pad.software-carpentry.org/2017-06-19-tgen - Can you see the screen? - Bathrooms, breaks…. - Getting help: raising hand vs. stickies vs. ether pad

Raise your hand for a question everyone would benefit from.

Sticky note when your code doesn’t work and you need a helper.

Etherpad for all of the above and for off topic questions.

Reproducible Research

- Well documented and repeatable science. - Data analysis: - Data and analysis can be re-created by anyone -

Including you in the future!

-

Manages and analyzes

Repeat analysis on updated data. Repeat analysis on similar datasets.

- Scripted data management and analysis Provides a record of what was done Easy to edit and re-run

Raw Data data cleaning script Cleaned Data summarizing script

• • • • • •

Find/replace values merge grouping labels re-code variables fix typos convert dates missing values

• • • •

subset data for particular project transform variables average, min, max by group imputation

Working Data analysis script

• linear models • search for correlates • general functions

Analysis Results figure script

Figures

table formatting script

Tables

• plotting • table making

Publication

Fame

Updated Raw Data

X

Raw Data data cleaning script

Cleaned Data summarizing script Working Data analysis script Analysis Results figure script

Figures

table formatting script

Tables

Publication

Fame

Tuesday morning Monday morning Raw Data

BASH/shell

Monday afternoon

Cleaned Data

Intro to R R: variables R: data types R: loading data R: subsetting data R: loops and functions

git Working Data

Analysis Results

Tuesday afternoon R: dplyr R: ggplot2

Figures

Tables

intro slides - GitHub

Jun 19, 2017 - Learn core skills for doing data analysis effectively, efficiently, and reproducibly. 1. Interacting with your computer on command line (BASH/shell).

539KB Sizes 3 Downloads 328 Views

Recommend Documents

Slides - GitHub
Android is an open source and Linux-based Operating System for mobile devices. ○ Android application run on different devices powered by ... Page 10 ...

Slides - GitHub
A Brief Introduction. Basic dataset classes include: ... All of these must be composed of atomic types. 12 .... type(f.root.a_group.arthur_count[:]) list. >>> type(f.root.a_group.arthur_count) .... a word on a computer screen (3 seconds), then. 27 ..

SSTIC 2011 slides - GitHub
Relies upon data structures configuration .... Unreal mode (fiat real, big real mode) .... USB specification: no direct data transfers between host controllers.

Slides [PDF] - GitHub
[capture parallel data. write to register/shared memory]. [configurable bit ... driver. Callbacks and. APIs parallel_bus_interface driver. Callbacks and. APIs.

malofiej title slides copy - GitHub
Page 23. A tool for making responsive · graphics with Adobe Illustrator. Page 24. Thanks, I hope you had fun! @archietse bit.ly/nytgraphics2015 ai2html.org.

Intro to Webapp - GitHub
The Public Data Availability panel ... Let's look at data availability for this cohort ... To start an analysis, we're going to select our cohort and click the New ...

Intro to Webapp IGV - GitHub
Home Page or the IGV Github Repository. We are grateful to the IGV team for their assistance in integrating the IGV into the ISB-CGC web application.

Intro to Google Cloud - GitHub
The Cloud Datalab web UI has two main sections: Notebooks and Sessions. ... When you click on an ipynb file in GitHub, you see it rendered (as HTML).

Intro to Google Cloud - GitHub
Now that you know your way around the Google Cloud Console, you're ready to start exploring further! The ISB-CGC platform includes an interactive Web App, ...

Intro to Webapp SeqPeek - GitHub
brought to you by. The ISB Cancer Genomics Cloud. An Introduction to the ISB-CGC Web App SeqPeek. Page 2. https://isb-cgc.appspot.com. Main Landing ...

Scientific python + IPython intro - GitHub
2. Tutorial course on wavefront propagation simulations, 28/11/2013, XFEL, ... written for Python 2, and it is still the most wide- ... Generate html and pdf reports.

Intro to Google Cloud - GitHub
known as “Application Default Credentials” are now created automatically. You don't really need to click on the “Go to. Credentials”, but in case you do the next ...

Slides
int var1 = 5; //declares an integer with value 5 var1++;. //increments var1 printf(“%d”, var1); //prints out 6. Page 17. Be Careful!! 42 = int var;. Page 18. Types. Some types in C: int: 4 bytes goes from -231 -> 231 - 1 float: 4 bytes (7-digit p

lecture 2: intro to statistics - GitHub
Continuous Variables. - Cumulative probability function. PDF has dimensions of x-1. Expectation value. Moments. Characteristic function generates moments: .... from realized sample, parameters are unknown and described probabilistically. Parameters a

lecture 3: more statistics and intro to data modeling - GitHub
have more parameters than needed by the data: posteriors can be ... Modern statistical methods (Bayesian or not) .... Bayesian data analysis, Gelman et al.

Quarterly Earnings Slides
Please see Facebook's Form 10-K for the year ended December 31, 2012 for definitions of user activity used to .... Advertising Revenue by User Geography.

slides
make it easier for other lenders and borrowers to find partners. These “liquidity provision services”to others receive no compensation in the equilibrium, so individual agents ignore them when calculating their equilibrium payoffs. The equilibriu

Slides-DominanceSolvability.pdf
R (6.50 ; 4.75) (10.00 ; 5.00). B. A. l r. L (9.75 ; 8.50) ( 9.75 ; 8.50). R (3.00 ; 8.50) (10.00 ; 10.00). Game 1 Game 2. This game clearly captures both key facets of ...

Download the slides - Portworx
In this workshop we will: ○ deploy a stateful app. ○ demonstrate HA by doing failover on the app. ○ snapshot a volume. ○ deploy a test workload against the ...

Slides
Key tool from potential theory : minimal thiness - the notion of a set in D being 'thin' at a Point of T. Recall: the Poisson Remel for TD Ös : f(z) = 1 - \ z (2 e D, well). 12 - w. D W. Definition. A set E cli) a called minimals thin at well if the

Prize Lecture slides
Dec 8, 2011 - Statistical Model for government surplus net-of interest st st = ∞. ∑ ... +R. −1 bt+1,t ≥ 0. Iterating backward bt = − t−1. ∑ j=0. Rj+1st+j−1 + Rtb0.

Slides
T. Xie and J. Pei: Data Mining for Software Engineering. 3. Introduction. • A large amount of data is produced in software development. – Data from software ...

slides-NatL.pdf
strangely enough, they are still aware of these models to different extents. An. interesting intertwining between inferential logic, lexical contents, common. sense ...

slides in pdf
Oct 3, 2007 - 2. Peter Dolog, ERP Course, ERP Development. ERP Implementation. Phases stay: • Planning. • Requirements analysis. • Design. • Detailed design. • Implementation. • Maintanance. Focus changes. • To fit the existing software