library(data.table)
library(tidyverse)
library(dtplyr)
<- fread("../../data/ames-housing.csv") homes
Try it Yourself Exercises for data.table R package
In this lab, we will be working with data on homes sold in Ames, Iowa, including their attributes and sale prices. This dataset was produced by the Ames, Iowa Assessor’s Office and made accessible by Dean De Cock, Truman State University. To learn more about the data and download it, visit the CMU data repository.
Recalling the 6 main verbs of data.table
We will start by recalling the data.table syntax of the 6 main verbs. Run the following code and write a brief description of what each line of code is doing.
<- homes[,.(Year.Built, Mo.Sold, Yr.Sold, SalePrice)] homes_sub
ANSWER HERE:
:= as.Date(paste(Yr.Sold, Mo.Sold, "01", sep = "-"), "%Y-%m-%d")] homes_sub[, ym_sold
ANSWER HERE:
== 4] homes[Full.Bath
ANSWER HERE:
order(Year.Built)] homes_sub[
ANSWER HERE:
=Yr.Sold] homes_sub[, .N, by
ANSWER HERE:
Write code to for a prompt
In this section we are interested in investigating the sale price of homes during the Great Recession that occurred from December 2007 to June 2009. For each prompt, write the corresponding code.
Mutate a variable called sold_ym
to the homes
dataset that gives the month and year that the home was sold. Hint: see previous problem.
Select ym_sold
and SalePrice
and save this to new dataset called sales
.
Filter sales
to only include homes sold in the recession (December 2007-June 2009). Resave this dataset as sales
.
Group_by the year/month the home was sold and summarize the average selling price in that month. Resave this dataset as sales
.
Arrange the dataset by date.
Adding pipes
In the prompts above, we iteratively overwrote the sales
dataset each step. This can get clunky—instead we could pipe together the above prompts. Give it a try!
Combine steps to answer a question
Now that we have a gotten some practice with the 6 main verbs, your challenge is to utilize them to answer the following questions.
- Start by finding the top 3
Neighborhood
s with the most homes sold. For these neighborhoods, find their averageYear.Built
andSalePrice
and the arrange the end dataset from oldest to most recent average year built.
- Are homes with larger garages (garages that fit 3 or more cars) typically built on larger lots (
Lot.Area
)? How does the average price for these homes compare to homes with 2 or fewer car garages?
Bonus Exercises
Joining
We will now get some practice with joining datasets together. Run the following code chunk in order to create the dataset expensive
, which gives the 5 neighborhoods with the most expensive average selling price. We also have stores
, which says how many grocery stores are in each neighborhood (this is made up).
<- homes[,.(avg_selling = mean(SalePrice), N= .N), by=Neighborhood][
expensive >10][
Norder(-avg_selling)[1:5]]
<- rowwiseDT(Neighborhood =, Grocery =,
stores "NoRidge", 3,
"StoneBr", 1,
"NridgHt", 3,
"Veenker", 2,
"Timber", 0)
Write code to join the two datasets so that the number of grocery stores is a column in the expensive
dataset.
What if we wanted to get all the homes sold from those top 5 most expensive neighborhood? Instead of typing out each of these neighborhoods in a filter statement, we can use a join. Write code to do this.
If you want to learn more about joins, check out this great vignette!
Visualization by a grouping variable
Let’s visualize the relationship between above ground living area in square feet (Gr.Liv.Area
) and SalePrice
for three neighborhoods: NAmes, CollgCr, and OldTown. Fill in the missing parts of the code.
%in% c("NAmes", "CollgCr", "OldTown")][, print(
homes[Neighborhood ggplot(#FILL,
aes(x=Gr.Liv.Area, y=SalePrice))+
geom_point(alpha=0.5)+
theme_classic()+
ggtitle(paste(Neighborhood[1]))), by=#FILL
]
dtyplr
Finally, let’s get a little bit of practice using the dtplyr
package. Suppose we know how to answer the Great Recession question with dplyr. Fill in the missing pieces to run this code. Remember, you would use something like this if you have a computationally intensive code chunk and need that under the hood data.table speed, but aren’t yet familiar with the syntax.
library(dtplyr)
<- #FILL
homes2
|>
homes2 mutate(ym_sold := as.Date(paste(Yr.Sold, Mo.Sold, "01", sep = "-"), "%Y-%m-%d")) |>
select(ym_sold, SalePrice) |>
filter(ym_sold >= "2007-12-01" & ym_sold <= "2009-06-01") |>
group_by(ym_sold) |>
summarize(avg_price = mean(SalePrice)) |>
arrange(ym_sold) |>
#FILLL the extra step at the end needed to use dtplyr