Data wrangling with R’s `data.table` package

Sara Colando & Erin Franke

Department of Statistics & Data Science
Carnegie Mellon University

Pre-Lecture Survey

Pre-Lecture Survey Link

Motivation

What is `data.table`?

A versatile R package that is a high performance version of base R’s data.frame.¹
Benefits include:
- Computational efficiency
- Concise syntax
- No dependencies
- Tested against old versions of R
Uses: data wrangling, reading/writing files, handling large data, and much more!

Main verbs of data wrangling

The six main verbs

Select: extract columns in data
Mutate: create new variables (columns) in data
Filter: extract rows in data that meet certain logical conditions
Arrange: sort rows (observations) by variables
Group by: group the data based on variable(s) in data
Summarize: create summary statistics, usually on grouped data

`data.table` syntax

The six main verbs all follow a three part syntax:

Example data

pets_dt is collected on students who have one pet
- pet: animal student has
- n_classes: number of classes the student is
- hours_hw: hours of homework the student has

pets_dt <- rowwiseDT(pet =, n_classes=, hours_hw=, 
                    "cat", 3, 18, 
                    "dog", 4, 24, 
                    "fish", 2, 10, 
                    "dog", 3, 12, 
                    "dog", 1, 8, 
                    "cat", 1, 3, 
                    "fish", 4, 16, 
                    "fish", 1, 6)

Selecting columns

We can select columns to keep in our dataset

Mutating variables

We can add (mutate) variables to the dataset, keeping the same number of rows as before

Filtering rows

We can filter for particular type of row based on a logical statement

Arranging rows

We can arrange the dataset by a particular variable(s)

Grouping by and summarizing

We can create summary statistics (summarize) by particular groups (group_by)

Let’s poll!

Slido Clicker Question

Joining datasets

We might also want to join two datasets, meaning we combine them based on the information in each respective one.

In data.table, joins are supported by the base syntax:

Note that joins are right joins by default in data.table!

Equi joins

Equi joins: find common elements between the two datasets to combine

Inner Join Example:

Advanced join types

Non-Equi Joins: match rows based on comparison operators other than strict equality
- for example: \(>\), \(<\), \(\geq\), or \(\leq\) operators
Overlapping Joins: match rows based on overlapping ranges between elements
- see the foverlaps() function
Rolling Joins: match rows based on the nearest value in a sorted column
- useful for time series data or for imputing missing values with nearest available data

Piping statements together

What if we wanted to pair multiple of the six main verbs together? For example, filter pets_dt to remove the cats and then arrange by hours_hw?

We could save the filtered dataset to no_cats_dt and then order that.
We could also pipe statements together!

Three equivalent ways to pipe

Reading and writing files

`fread()` and `fwrite()`

To (a) read a file into R or (b) save a file to home directory, fread() and fwrite() are efficient and support several file types.

pets_dt <- fread("../data/pets_dt.csv")

Supports:

.csv, .tsv
Other delimited files (semicolon, colon, pipe)
Compressed files with .gz, .bz2 extensions

Nice arguments of `fread()`

nrow: number of rows to read in

fread("../data/pets_dt.csv", nrow = 3)

      pet n_classes hours_hw
   <char>     <int>    <int>
1:    cat         3       18
2:    dog         4       24
3:   fish         2       10

skip: row number or string match to start reading from

fread("../data/pets_dt.csv", skip = "fish")

       V1    V2    V3
   <char> <int> <int>
1:   fish     2    10
2:    dog     3    12
3:    dog     1     8
4:    cat     1     3
5:   fish     4    16
6:   fish     1     6

Nice arguments of `fread()` (cont)

select: columns to keep

fread("../data/pets_dt.csv", select = "pet")

      pet
   <char>
1:    cat
2:    dog
3:   fish
4:    dog
5:    dog
6:    cat
7:   fish
8:   fish

drop: columns to remove

fread("../data/pets_dt.csv", drop = c("pet", "n_classes"))

   hours_hw
      <int>
1:       18
2:       24
3:       10
4:       12
5:        8
6:        3
7:       16
8:        6

Bonus features

Rowwise `data.table` object creation

Convenient and readable for creating small datasets
- For instance, creating toy examples or summary tables

pets_dt <- rowwiseDT(pet =, n_classes=, hours_hw=, 
                    "cat", 3, 18, 
                    "dog", 4, 24, 
                    "fish", 2, 10)

gt(pets_dt)

pet	n_classes	hours_hw
cat	3	18
dog	4	24
fish	2	10

Visualizing by a grouping variable

Let’s say we are investigating the relationship between lot area and sale price and we want to see how it varies by driveway pavement.

house_prices[, print(
  ggplot(.SD, aes(x=Lot.Area, y=SalePrice))+
    geom_point(alpha=0.5)+
    theme_classic()+
    ggtitle(paste("Paved.Drive:", Paved.Drive[1]))), by=.(Paved.Drive)]

Modeling by a grouping variable

With data.table we can also create a distinct model for each group. For example, we can fit a linear model for sale price of homes in Iowa (SalePrice) for each neighborhood in the dataset:

grouped_models <- house_prices[, .(mods = list(
  lm(SalePrice ~ Lot.Area + Bedroom.AbvGr + Full.Bath + Year.Built*Yr.Sold, data = .SD))),
  by = Neighborhood ]

We can then cross-compare the coefficient estimates and p-values for the covariates in our model to see how the differ between neighborhoods:

Old Town, Ames, Iowa:

Characteristic	Beta	95% CI	p-value
Lot.Area	5.0	3.5, 6.4	<0.001
Bedroom.AbvGr	6,895	1,299, 12,490	0.016
Full.Bath	17,015	5,888, 28,143	0.003
Year.Built	-279,362	-651,881, 93,157	0.14
Yr.Sold	-268,748	-625,248, 87,751	0.14
Year.Built * Yr.Sold	139	-46, 325	0.14
Abbreviation: CI = Confidence Interval

Northridge Heights, Ames, Iowa:

Characteristic	Beta	95% CI	p-value
Lot.Area	18	15, 22	<0.001
Bedroom.AbvGr	-14,055	-30,965, 2,855	0.10
Full.Bath	5,483	-36,505, 47,471	0.8
Year.Built	-2,313,155	-13,425,305, 8,798,994	0.7
Yr.Sold	-2,315,556	-13,410,216, 8,779,105	0.7
Year.Built * Yr.Sold	1,156	-4,376, 6,689	0.7
Abbreviation: CI = Confidence Interval

Connection to `tidyverse`

Weighing the options

Both data.table and tidyverse are great tools for wrangling data. Luckily, we are not confined to just one of them!¹
It all comes down to personal preference:
- Comfort with syntax
- Brevity of syntax
- Consistency with collaborators
- Computational efficiency ²

`dtplyr`

Perhaps you have some experience with data wrangling in the tidyverse, but you need the efficiency of data.table.

The dtplyr allows for data.table-like efficiency while still writing code in tidyverse, with a few small modifications.¹

library(dtplyr)

house_prices2 <- lazy_dt(house_prices) # extra step needed to use dtplyr

house_prices2 %>%
  filter(Lot.Area >= 10000) %>%
  as_tibble() # extra step needed to use dtplyr

So, what do we mean when we say `data.table` is computationally efficient?

Interested in learning more about computational effiency?

This Stack Overflow thread explains a bit about memory usage
Visit this vignette to learn about keys and this vignette to learn about indexing.
- These features are super helpful when working with large data!

Thanks! Questions?

Post-Lecture Survey

Post-Lecture Survey Link