THE+ new #RDATATABLE PACKAGE developments in v1.9.7 Arun Srinivasan

SEP 3’16, BUDAPEST

@arun_sriniv

who am i? •

Bioinformatician / Comp. Biologist



data.table user, co-developer since late 2013



Previous: Data scientist @Open Analytics



Future: Lead engineer @investment mgmt. firm

most underrated package

August 2016

most underrated package

May 2015



Homepage: http://r-datatable.com



Since 2006 on CRAN, >30 releases so far



>5500 unit tests, ~89% coverage (using covr)



>260 packages import/depend/suggest data.table •

~12.6 packages per month since Sep’15



8th most starred R package on Github (METACRAN)



>4400 Q on StackOverflow. 3rd amongst R packages

Monthly data.table questions from 2012−2016 150

100 Year 2012

count

2013 2014 2015 2016

50

0

powerful

(With no intent on fuelling language wars)

great sadness

data.table data.table data.table

talk overview •



data.table’s philosophy •

concise + straightforward code



fast + memory efficient

New features and improvements in v1.9.7 •

fwrite, conditional joins, parallel sort & other optimisations

talk overview •



data.table’s philosophy •

concise + straightforward code



fast + memory efficient

New features and improvements in v1.9.7 •

fwrite, conditional joins, parallel sort & other optimisations

data frames •

are columnar data structures

X

id 1 b 2 a 3 a 4 c 5 c 6 b

val 4 2 3 1 5 6

2 column data.frame

data frames are columnar data structures

• •

2D — rows and columns

X

id 1 b 2 a 3 a 4 c 5 c 6 b

val 4 2 3 1 5 6

2 column data.frame

data frames are columnar data structures

• •

2D — rows and columns

X

id 1 b 2 a 3 a 4 c 5 c 6 b

val 4 2 3 1 5 6

2 column data.frame

data frames are columnar data structures

• • •

2D — rows and columns subset rows — X[X$id != “a”, ]

X

id 1 b 2 a 3 a 4 c 5 c 6 b

val 4 2 3 1 5 6

2 column data.frame

data frames are columnar data structures

• •

2D — rows and columns



subset rows — X[X$id != “a”, ]



select columns — X[, “val”]

X

id 1 b 2 a 3 a 4 c 5 c 6 b

val 4 2 3 1 5 6

2 column data.frame

data frames are columnar data structures

• •

2D — rows and columns



subset rows — X[X$id != “a”, ]



select columns — X[, “val”]



subset rows & select columns — X[X$id != “a”, “val”]

X

id 1 b 2 a 3 a 4 c 5 c 6 b

val 4 2 3 1 5 6

2 column data.frame

data frames are columnar data structures

• •

2D — rows and columns



subset rows — X[X$id != “a”, ]



select columns — X[, “val”]



subset rows & select columns — X[X$id != “a”, “val”]



that’s pretty much it…

X

id 1 b 2 a 3 a 4 c 5 c 6 b

val 4 2 3 1 5 6

2 column data.frame

1. how to compute on columns? DF

id code valA valB 1 1

abc

0.1

11

2 1

abc 0.6

7

3 1 abd 1.5

5

4 2 apq 0.9

10

5 2 apq 0.3

13

For code != “abd”, get sum(valA)

1.9

1. how to compute on columns? DF

id code valA valB 1 1

abc

0.1

11

2 1

abc 0.6

7

3 1 abd 1.5

5

4 2 apq 0.9

10

5 2 apq 0.3

13

For code != “abd”, sum(DF[DF$code != “abd”, “valA”]) get sum(valA)

1.9

2. Grouped aggregate DF

id code valA valB 1 1

abc

0.1

11

2 1

abc 0.6

7

3 1 abd 1.5

5

4 2 apq 0.9

10

5 2 apq 0.3

13

For code != “abd”, get sum(valA) and sum(valB) for each id

id

valA

valB

1

1

0.7

18

2

2

1.2

23

2. Grouped aggregate DF

id code valA valB 1 1

abc

0.1

11

2 1

abc 0.6

7

3 1 abd 1.5

5

4 2 apq 0.9

10

5 2 apq 0.3

13

For code != “abd”, aggregate(cbind(valA, valB) ~ id, get sum(valA) and sum(valB) DF[DF$code != “abd”, ], sum) for each id

id

valA

valB

1

1

0.7

18

2

2

1.2

23

3. Simple update DF

id code valA valB 1 1

abc

0.1

11

2 1

abc 0.6

7

3 1 abd 1.5

5

4 2 apq 0.9

10

5 2 apq 0.3

13

For code == “abd”, update valA with NA

3. Simple update DF

id code valA valB 1 1

abc

0.1

11

2 1

abc 0.6

7

3 1 abd NA 1.5

5

4 2 apq 0.9

10

5 2 apq 0.3

13

For code == “abd”, update valA with NA

3. Simple update DF

id code valA valB 1 1

abc

0.1

11

2 1

abc 0.6

7

3 1 abd NA 1.5

5

4 2 apq 0.9

10

5 2 apq 0.3

13

For code == “abd”, update valA DF[DF$code == “abd”, “valA”] <- NA with NA

can we be more consistent? sum(DF[DF$code != “abd”, “valA”])

How to get sum of both valA and valB? Or sum of valA and valB combined?

aggregate(cbind(valA, valB) ~ id, DF[DF$code != “abd”, ], sum)

New function. Formula interface. Unwanted columns are subsetted. How to get sum(valA) and mean(valB)?

DF[DF$code == “abd”, “valA”] <- NA

Entire expression is now to the left of the “<-“ operator

Enhanced data frames •

Three main enhancements: 1. Allow column names to be seen as variables within […] 2. Since they’re variables, we can do computations on them directly, i.e, within […] 3. Additional argument by

data tables •

X

are columnar data structures as well

1: 2: 3: 4: 5: 6:

id b a a c c b

val 4 2 3 1 5 6

2 column data.table

data tables X

are columnar data structures as well

• •

2D — rows and columns

1: 2: 3: 4: 5: 6:

id b a a c c b

val 4 2 3 1 5 6

2 column data.table

data tables X

are columnar data structures as well

• •

2D — rows and columns

1: 2: 3: 4: 5: 6:

id b a a c c b

val 4 2 3 1 5 6

2 column data.table

data tables • •

X

are columnar data structures as well



2D — rows and columns subset rows — X[id != “a”, ]

1: 2: 3: 4: 5: 6:

id b a a c c b

val 4 2 3 1 5 6

2 column data.table

data tables X

are columnar data structures as well

• •

2D — rows and columns



subset rows — X[id != “a”, ]



select columns — X[, val]

1: 2: 3: 4: 5: 6:

id b a a c c b

val 4 2 3 1 5 6

2 column data.table

data tables X

are columnar data structures as well

• •

2D — rows and columns



subset rows — X[id != “a”, ]



select columns — X[, val]



compute on columns — X[, mean(val)]

1: 2: 3: 4: 5: 6:

id b a a c c b

val 4 2 3 1 5 6

2 column data.table

mean 3.5

data tables X

are columnar data structures as well

• •

2D — rows and columns



subset rows — X[id != “a”, ]



select columns — X[, val]



compute on columns — X[, mean(val)]



subset rows & select / compute on columns — X[id != “a”, mean(val)]

1: 2: 3: 4: 5: 6:

id b a a c c b

val 4 2 3 1 5 6

2 column data.table

mean 4.0 3.5

data tables •

2D — rows and columns



subset rows — X[id != “a”, ]



select columns — X[, val]



compute on columns — X[, mean(val)]



subset rows & select / compute on columns — X[id != “a”, mean(val)]



X

are columnar data structures as well



virtual 3rd dimension — group by

1: 2: 3: 4: 5: 6:

id b a a c c b

val 4 2 3 1 5 6

2 column data.table

mean 4.0 3.5

data tables •

think in terms of basic units — rows, columns and groups



data.table syntax provides placeholder for each of them

General form:

On which rows

DT[i, j, by]

What to do?

Grouped by what?

Equivalent data table code sum(DF[DF$code != “abd”, “valA”])

DT[code != “abd”, sum(valA)]

aggregate(cbind(valA, valB) ~ id, DF[DF$code != “abd”, ], sum)

DT[code != “abd”, .(sum(valA), sum(valB)), by = id]

DF[DF$code == “abd”, “valA”] <- NA

DT[code == “abd”, valA := NA]

two tables A

B

id code valA valB

id code mul

1: 1

abc

0.1

11

1: 1

abd

2.0

2: 1

abc 0.6

7

2: 2

apq

0.5

3: 1 abd 1.5

5

3: 3

abc

1.7

4: 2 apq 0.9

10

5: 2 apq 0.3

13

Update valA with valA*mul while matching on id, code

two tables A

B

id code valA valB

id code mul

1: 1

abc

0.1

11

1: 1

abd

2.0

2: 1

abc 0.6

7

2: 2

apq

0.5

3: 1 abd 1.5

5

3: 3

abc

1.7

4: 2 apq 0.9

10

5: 2 apq 0.3

13

Update valA with valA*mul while matching on id, code

A[B, on = .(id, code), valA := valA * mul]

on which rows? what to do?

two tables A

B

id code valA valB

id code mul

1: 1

abc

0.1

11

1: 1

abd

2.0

2: 1

abc 0.6

7

2: 2

apq

0.5

3: 1 abd 3.0 1.5

5

3: 3

abc

1.7

4: 2 apq 0.45 0.9 10 5: 2 apq 0.15 0.3 13

Update valA with valA*mul while matching on id, code

A[B, on = .(id, code), valA := valA * mul]

on which rows? what to do?

talk overview •



data.table’s philosophy •

concise + straightforward code



fast + memory efficient

New features and improvements in v1.9.7 •

fwrite, conditional joins, parallel sort & other optimisations

fwrite - parallel file writer

SOURCE: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/

fsort - parallel sort length

size in RAM

threads

base R

v1.9.7

500m

3.8GB

8

65s

3.9s

1b

7.6GB

32

140m

3.5s

10b

76GB

32

25m

48s

SOURCE: https://www.r-project.org/dsc/2016/slides/ParallelSort.pdf

parallel row subsets DT[sample(.N, .N/2)]

200e6 rows, 4 cols~4.6GB v1.9.6

20.0s

v1.9.7 (C, parallelised)

3.6s (16 threads)

run time

%between% x %between% c(2000, 20000)

length(x) = 500e6, int, ~1.9GB v1.9.6

15.7s

7.2GB

v1.9.7 (C, parallelised)

1.1s

3.8GB

(4 threads)

run time

peak memory

median 1e6 rows, 61 columns, ~460MB 10,000 unique groups

conditional operations A

B

id code valA valB 1: 1

abc

0.1

11

2: 1

abc 0.6

7

3: 1 abd 1.5

5

4: 2 apq 0.9

10

5: 2 apq 0.3

13

id begin end 1:

1

0.1

0.9

2: 2

0.6

0.8

Update valB with NA while matching on id, valA > begin, valA < end

conditional operations A

B

id code valA valB 1: 1

abc

0.1

11

2: 1

abc 0.6 NA 7

3: 1 abd 1.5

5

4: 2 apq 0.9

10

5: 2 apq 0.3

13

id begin end 1:

1

0.1

0.9

2: 2

0.6

0.8

Update valB with NA while matching on id, valA > begin, valA < end

A[B, on = .(id, valA>begin, valA
summary and future directions •

data.table allows for concise and straightforward code, and is fast and memory efficient



More efforts towards parallelisation in future



File backed data.tables would be great feature to have soon, #1336



Give data.table a go :-)

Thank you for your attention!

Questions?

data tables - GitHub

fwrite - parallel file writer. SOURCE: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/ ... SOURCE: https://www.r-project.org/dsc/2016/slides/ParallelSort.pdf length.

2MB Sizes 4 Downloads 355 Views

Recommend Documents

Data 8R Tables and more Visualizations Summer 2017 1 ... - GitHub
Jul 11, 2017 - At the same time, the researcher also records the number of ... A business has graphed the proportion of outputs in each year as a bar chart.

Data 8R Tables and more Visualizations Summer 2017 1 ... - GitHub
number of colds each volunteer gets. Is this an observational ... questions about it. A business has graphed the proportion of outputs in each year as a bar chart.

Tables and Data Script.pdf
add constraint DCust_ID_FK foreign key (Customer_ID) references. PBTRAINING.XXPB_Customers(customer_id);. Page 3 of 10. Tables and Data Script.pdf.

Javascript Data Exploration - GitHub
Apr 20, 2016 - Designers. I'm a sort of. « social data scientist ». Paris. Sciences Po médialab. I just received a CSV. Let me grab my laptop ... Page 9 ...

Tabloid data set - GitHub
The Predictive Analytics team builds a model for the probability the customer responds given ... 3 Summary statistics .... Predictions are stored for later analysis.

RStudio Data Import - GitHub
“A data model in which the data is organized into a tree-like structure” - Wikipedia. Page 10. WHAT IS XML, HTML AND JSON? XML: Extensible Markup ...

Data Science - GitHub
Exploratory Data Analysis ... The Data Science Specialization covers the concepts and tools for ... a degree or official status at the Johns Hopkins University.

My precious data - GitHub
Open Science Course 2016 ... It's part of my contribution to science community ... Exports several formats (pdf, docx, csv, text, json, html, xml) ... http://dataverse.org/blog/scientific-data-now-recommends-harvard-dataverse-all-areas-s · cience ...

Open Data Canvas - GitHub
Top need for accessing data online. What data is most needed? Solution. How would you solve this problem? ... How big is the universe of users? Format/Use.

Reading in data - GitHub
... handles import from SPSS. Once installed, the package contents can be loaded into R (made available to the R system) with the function call. > library(Hmisc) ...

meteor's data layer - GitHub
Full-stack JavaScript Framework for both Web and. Mobile. □. Built on top of the NodeJs. □. Open Source. □ ... Meteor doesn't send HTML over the network. The server sends data ... All layers, from database to template, update themselves ...

High Blood Pressure Awareness in Colorado Appendix: data tables ...
High Blood Pressure Awareness in Colorado Appendix: data tables.pdf. High Blood Pressure Awareness in Colorado Appendix: data tables.pdf. Open. Extract.

Research Data Management Training - GitHub
Overview. Research Data management Training Working Group: Approach and. Methodology ... CC Australia ported licence) licence. ... http://www.griffith.edu.au/__data/assets/pdf_file/0009/528993/Best_Practice_Guidelines.pdf. University of ...

Hash Tables
0. 12. 15. 1. 2 ... 10. Hash. Function banana apple cantaloupe mango kiwi pear apple banana cantaloupe kiwi mango pear. Hash Tables ...

RN-171 Data Sheet - GitHub
Jan 27, 2012 - 171 is perfect for mobile wireless applications such as asset monitoring ... development of your application. ... sensor data to a web server.

Tables Graphs.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Tables Graphs.

Physics Reference Tables
NCDPI Reference Tables for Physics (October 2006). Page 2. Mechanics. Energy x v t. Δ= Δ. = + f i x x vt. 2. 1 x x. 2 f i i vt at. = +. +. Δ= Δ v a t. 2. 2. 2 x.

Finding Related Tables - CiteSeerX
[25] A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In CONLL ...

fusion tables 1page_01 Earth
tables.googlelabs.com. Google Fusion Tables. Share, visualize and analyze data with your team, your partners, your world. Google Fusion Tables makes it easy ...

Compressing Forwarding Tables
Abstract—With the rise of datacenter virtualization, the number of entries in forwarding tables is expected to scale from several thousands to several millions. Unfortunately, such forwarding table sizes can hardly be implemented today in on-chip m

Wolfthorne (tables).pdf
used as a weapon (d6) at a push. 13. Smart Arm - The fanciest weapon you. own is also intelligent, communicating. psychically to whoever is holding it. It.

Supplementary Materials 1. Supplementary Tables
3.87. 27. -75. -48. Orbitofrontal Cortex. 11. R. 76. 3.72. 30. 48. -15. Decreased DMN (Controls > FTD). Dorsomedial PFC. 32. L. 312. 3.98. -27. 51. 21. Anterior Cingulate. 24. -. 107. 3.83. -3. 21. 33 .... correlation pattern, corresponding to a Type