Lab: data wrangling

Reading and previewing data

Our data are usually stored as a .csv file and after loading a .csv file into RStudio, we will have a “data frame”. A data frame can be considered a special case of matrix where each column represents a measurement or variable of interest for each observation which correspond to the rows of the dataset. After loading the tidyverse suite of packages, we use the read_csv() function to load the heart_disease dataset from yesterday:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

heart_disease <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/heart_disease.csv")

Rows: 788 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Gender
dbl (9): Cost, Age, Interventions, Drugs, ERVisit, Complications, Comorbidit...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

By default, read_csv() reads in the dataset as a tbl (aka tibble) object instead of a data.frame object. You can read about the differences here, but it’s not that meaningful for purposes.

We can use the functions slice_head() and slice_tail() to view a sample of the data. Use the slice_head() function to view the first 6 rows, then use the slice_tail() function to view the last 3 rows:

# INSERT CODE HERE

View the dimensions of the data with dim():

# INSERT CODE HERE

Quickly view summary statistics for all variables with the summary() function:

# Uncomment the following code by deleting the # at the front
# summary(heart_disease)

View the data structure types with str():

# str(heart_disease)

What’s the difference between the output from the two functions?

You can find a description of the dataset here.

Data manipulation with `dplyr`

An easier way to manipulate the data frame is through the dplyr package, which is in the tidyverse suite of packages. The operations we can do include: selecting specific columns, filtering for rows, re-ordering rows, adding new columns and summarizing data. The “split-apply-combine” concept can be achieved by dplyr.

Selecting columns with `select()`

The function select() can be use to select certain column with the column names. First create a new table called heart_disease_ad that only contains the Age and Drugs columns:

# INSERT CODE HERE

To select all columns except a specific column, use the - (subtraction) operator. For example, view the output from uncommenting the following line of code:

# select(heart_disease, -Interventions)

To select a range of columns by name (that are in consecutive order), use the : (colon) operator. For example, view the output from uncommenting the following line of code:

# select(heart_disease, Drugs:Duration)

To select all columns that start with certain character strings, use the function starts_with(). Other matching options are:

ends_with(): select columns that end with a character string
contains(): select columns that contain a character string
matches(): select columns that match a regular expression
one_of(): select columns names that are from a group of names

# Uncomment the following lines of code
# select(heart_disease, starts_with("Com"))
# select(heart_disease, contains("er"))

Extracting rows using `filter()`

We can also extract the rows/observations that satisfy certain criteria. Try extracting the rows with Cost of more than 400:

# INSERT CODE HERE

We can also filter on multiple criteria. Subset the rows with Age above 60 and the gender is “Male”:

# INSERT CODE HERE

Arranging rows using `arrange()`

To arrange the data frame by a specific order we need to use the function arrange(). The default is by increasing order and the desc() will provide the decreasing order. First arrange the heart_disease table by Duration in ascending order:

# INSERT CODE HERE

Next by descending order:

# INSERT CODE HERE

Try combining a pipeline of select(), filter(), and arrange() steps together with the |> operator by:

Selecting the Age, Cost, ERVisit, and Duration columns,
Filter to select only rows with Age above 60,
Sort by Duration in descending order

# INSERT CODE HERE

Creating new columns using `mutate()`

Sometimes the data does not include the variable that we are interested in and we need to manipulate the current variables to add new variables into the data frame. Create a new column cost_per_day by taking the Cost and dividing by Duration (reassign this output to the heart_disease table following the commented code chunk so this column is added to the table):

# heart_disease <- heart_disease |>
#   mutate(INSERT CODE HERE)

Creating summaries with `summarize()`

To create summary statistics for a given column in the data frame, we can use summarize() function. Compute the mean, min, and max amount of Cost:

# INSERT CODE HERE

The advantage of summarize() is more obvious if we combine it with group_by(), the group operators. Try to group_by() the Gender column first and then compute the same summary statistics for Cost:

# INSERT CODE HERE

Reading and previewing data

Data manipulation with dplyr

Selecting columns with select()

Extracting rows using filter()

Arranging rows using arrange()