Lab: data wrangling

Reading and previewing data

Our data are usually stored as a .csv file and after loading a .csv file into RStudio, we will have a “data frame”. A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the tidyverse suite of packages, we use the read_csv() function to load the 2024 NBA regular season stats dataset from yesterday:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

nba_stats <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/nba_stats.csv")

Rows: 657 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): player, position, team
dbl (20): age, games, games_started, minutes_played, field_goals, field_goal...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

By default, read_csv() reads in the dataset as a tbl (aka tibble) object instead of a data.frame object. You can read about the differences here, but it’s not that meaningful for purposes.

We can use the functions slice_head() and slice_tail() to view a sample of the data. Use the slice_head() function to view the first 6 rows, then use the slice_tail() function to view the last 3 rows:

# INSERT CODE HERE

View the dimensions of the data with dim():

# INSERT CODE HERE

Quickly view summary statistics for all variables with the summary() function:

# Uncomment the following code by deleting the # at the front
# summary(nba_stats)

View the data structure types with str():

# str(nba_stats)

What’s the difference between the output from the two functions?

Data manipulation with `dplyr`

An easier way to manipulate the data frame is through the dplyr package, which is in the tidyverse suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The “split-apply-combine” concept can be achieved by dplyr.

Selecting columns with `select()`

The function select() can be use to select certain column with the column names. First create a new table called nba_stats_pg that only contains the player and games columns:

# INSERT CODE HERE

To select all columns except a specific column, use the - (subtraction) operator. For example, view the output from uncommenting the following line of code:

# select(nba_stats, -player)

To select a range of columns by name (that are in consecutive order), use the : (colon) operator. For example, view the output from uncommenting the following line of code:

# select(nba_stats, player:games)

To select all columns that start with certain character strings, use the function starts_with(). Other matching options are:

ends_with(): select columns that end with a character string
contains(): select columns that contain a character string
matches(): select columns that match a regular expression
one_of(): select columns names that are from a group of names

# Uncomment the following lines of code
# select(nba_stats, starts_with("three"))
# select(nba_stats, contains("throw"))

Extracting rows using `filter()`

We can also extract the rows/observations that satisfy certain criteria. Try extracting the rows with more than 500 assists:

# INSERT CODE HERE

We can also filter on multiple criteria. Subset the rows with age above 30 and the team is either “HOU” or “GSW”:

# INSERT CODE HERE

Arranging rows using `arrange()`

To arrange the data frame by a specific order we need to use the function arrange(). The default is by increasing order and the desc() function will provide the decreasing order. First arrange the nba_stats table by personal_fouls in ascending order:

# INSERT CODE HERE

Next by descending order:

# INSERT CODE HERE

Try combining a pipeline of select(), filter(), and arrange() steps together with the |> operator by:

Selecting the player, team, age, and games columns,
Filter to select only rows with games above 50,
Sort by age in descending order

# INSERT CODE HERE

Creating new columns using `mutate()`

Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame. Create a new column fouls_per_game by taking the personal_fouls and dividing by games (reassign this output to the nba_stats table following the commented code chunk so this column is added to the table):

# nba_stats <- nba_stats |>
#   mutate(INSERT CODE HERE)

Creating summaries with `summarize()`

To create summary statistics for a given column in the data frame, we can use summarize() function. Compute the mean, min, and max number of assists:

# INSERT CODE HERE

The advantage of summarize() is more obvious if we combine it with group_by(), the group operators. Since players at the different position tend to have very different statistics, first group_by() position and then compute the same summary statistics for assists:

# INSERT CODE HERE

Reading and previewing data

Data manipulation with dplyr

Selecting columns with select()

Extracting rows using filter()

Arranging rows using arrange()