Try it Yourself Exercises for data.table R package

In this lab, we will be working with data on homes sold in Ames, Iowa, including their attributes and sale prices. This dataset was produced by the Ames, Iowa Assessor’s Office and made accessible by Dean De Cock, Truman State University. To learn more about the data and download it, visit the CMU data repository.

library(data.table)
library(tidyverse)
library(dtplyr)
homes <- fread("../../data/ames-housing.csv") 

Recalling the 6 main verbs of data.table

We will start by recalling the data.table syntax of the 6 main verbs. Run the following code and write a brief description of what each line of code is doing.

homes_sub <- homes[,.(Year.Built, Mo.Sold, Yr.Sold, SalePrice)]

ANSWER HERE:

homes_sub[, ym_sold := as.Date(paste(Yr.Sold, Mo.Sold, "01", sep = "-"), "%Y-%m-%d")] 

ANSWER HERE:

homes[Full.Bath == 4]

ANSWER HERE:

homes_sub[order(Year.Built)]

ANSWER HERE:

homes_sub[, .N, by=Yr.Sold]

ANSWER HERE:

Write code to for a prompt

In this section we are interested in investigating the sale price of homes during the Great Recession that occurred from December 2007 to June 2009. For each prompt, write the corresponding code.

Mutate a variable called sold_ym to the homes dataset that gives the month and year that the home was sold. Hint: see previous problem.

Select ym_sold and SalePrice and save this to new dataset called sales.

Filter sales to only include homes sold in the recession (December 2007-June 2009). Resave this dataset as sales.

Group_by the year/month the home was sold and summarize the average selling price in that month. Resave this dataset as sales.

Arrange the dataset by date.

Adding pipes

In the prompts above, we iteratively overwrote the sales dataset each step. This can get clunky—instead we could pipe together the above prompts. Give it a try!

Combine steps to answer a question

Now that we have a gotten some practice with the 6 main verbs, your challenge is to utilize them to answer the following questions.

  1. Start by finding the top 3 Neighborhoods with the most homes sold. For these neighborhoods, find their average Year.Built and SalePrice and the arrange the end dataset from oldest to most recent average year built.
  1. Are homes with larger garages (garages that fit 3 or more cars) typically built on larger lots (Lot.Area)? How does the average price for these homes compare to homes with 2 or fewer car garages?

Bonus Exercises

Joining

We will now get some practice with joining datasets together. Run the following code chunk in order to create the dataset expensive, which gives the 5 neighborhoods with the most expensive average selling price. We also have stores, which says how many grocery stores are in each neighborhood (this is made up).

expensive <- homes[,.(avg_selling = mean(SalePrice), N= .N), by=Neighborhood][
  N>10][
    order(-avg_selling)[1:5]]
stores <- rowwiseDT(Neighborhood =, Grocery =, 
                    "NoRidge", 3,
                    "StoneBr", 1,
                    "NridgHt", 3,
                    "Veenker", 2,
                    "Timber", 0)

Write code to join the two datasets so that the number of grocery stores is a column in the expensive dataset.

What if we wanted to get all the homes sold from those top 5 most expensive neighborhood? Instead of typing out each of these neighborhoods in a filter statement, we can use a join. Write code to do this.

If you want to learn more about joins, check out this great vignette!

Visualization by a grouping variable

Let’s visualize the relationship between above ground living area in square feet (Gr.Liv.Area) and SalePrice for three neighborhoods: NAmes, CollgCr, and OldTown. Fill in the missing parts of the code.

homes[Neighborhood %in% c("NAmes", "CollgCr", "OldTown")][, print(
  ggplot(#FILL, 
    aes(x=Gr.Liv.Area, y=SalePrice))+
    geom_point(alpha=0.5)+
    theme_classic()+
    ggtitle(paste(Neighborhood[1]))), by=#FILL
    ]

dtyplr

Finally, let’s get a little bit of practice using the dtplyr package. Suppose we know how to answer the Great Recession question with dplyr. Fill in the missing pieces to run this code. Remember, you would use something like this if you have a computationally intensive code chunk and need that under the hood data.table speed, but aren’t yet familiar with the syntax.

library(dtplyr)

homes2 <- #FILL

homes2 |>
  mutate(ym_sold := as.Date(paste(Yr.Sold, Mo.Sold, "01", sep = "-"), "%Y-%m-%d")) |>
  select(ym_sold, SalePrice) |>
  filter(ym_sold >= "2007-12-01" & ym_sold <= "2009-06-01") |>
  group_by(ym_sold) |>
  summarize(avg_price = mean(SalePrice)) |>
  arrange(ym_sold) |>
  #FILLL the extra step at the end needed to use dtplyr