EDA project: WTA Grand Slam statistics


Overview

This project will be released on Thursday, June 6 and conclude with an 8-minute presentation on Tuesday, June 18 during lecture time.

Students will be randomly placed into groups of three and each group will be randomly assigned a dataset.

The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.

Deliverables

Each group is expected to make slides to accompany the 8-minute presentation.

The presentation should feature the following:

  • Overview of the structure of your dataset

  • Three questions/hypotheses you are interested in exploring

  • Three data visualizations exploring the questions, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization

  • One clustering analysis

  • Conclusions for the hypotheses based on your EDA and data visualizations

Timeline

There will be two submission deadlines:

Thursday, June 13 at 5pm ET - Each student will push their individual code for the project thus far to GitHub for review. We will then provide feedback.

Monday, June 17 at 5pm ET - Slides and full code must be completed and ready for presentation. Send your slides to Quang (quang@stat.cmu.edu). All code must be written in R; but the slides may be created in any software. Take advantage of examples from lectures, but also feel free to explore online resources that may be relevant. (But be sure to always consult the R help documentation first before attempting to google around or ask ChatGPT.)

Data

This dataset contains all WTA matches between 2018 and 2023, courtesy of the famous tennis data repository maintained by Jeff Sackmann.

Each row in the data corresponds to a single WTA Grand Slam match. The columns contain general information about the matches:

  • tourney_name: name of the Grand Slam tournament
  • surface: type of court surface
  • tourney_date: eight digits, YYYYMMDD, usually the Monday of the tournament week
  • {winner, loser}_seed: seed of winning/losing player
  • {winner, loser}_name: Name of the winning/losing player
  • {winner, loser}_hand: R: right, L: left, U: unknown (for ambidextrous players, this is their serving hand)
  • {winner, loser}_ht: height in centimeters, where available
  • {winner, loser}_ioc: three-character country code
  • {winner, loser}_age: age, in years, as of the tourney_date
  • score: final match score
  • round: tournament round
  • minutes: match length in minutes
  • {w, l}_ace: winner/loser’s number of aces
  • {w, l}_df: winner/loser’s number of doubles faults
  • {w, l}_svpt: winner/loser’s number of serve points
  • {w, l}_1stIn: winner/loser’s number of first serves made
  • {w, l}_1stWon: winner/loser’s number of first-serve points won
  • {w, l}_2ndWon: winner/loser’s number of second-serve points won
  • {w, l}_SvGms: winner/loser’s number of serve games
  • {w, l}_bpSaved: winner/loser’s number of break points saved
  • {w, l}_bpFaced: winner/loser’s number of break points faced
  • {winner, loser}_rank: winner/loser’s WTA rank, as of the tourney_date, or the most recent ranking date before the tourney_date

Note that a full glossary of the features available for match data can be found here.

Starter code

library(tidyverse)
wta_grand_slam_matches <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/wta_grand_slam_matches.csv")

In case you’re curious, the code to build this dataset can be found below.

library(tidyverse)

get_grand_slam_matches <- function(year) {
  year_url <- str_c(
    "https://raw.githubusercontent.com/JeffSackmann/tennis_wta/master/wta_matches_",
     year, ".csv"
  )
  year_matches <- year_url |> 
    read_csv() |> 
    filter(tourney_level == "G") |> 
    mutate(winner_seed = as.character(winner_seed),
           loser_seed = as.character(loser_seed),
           tourney_name = str_to_upper(tourney_name)) |> 
    select(-tourney_id, -tourney_level, -best_of, -draw_size, -match_num, -winner_id,
           -winner_entry, -loser_id, -loser_entry, -winner_rank_points, -loser_rank_points)
  
  return(year_matches)
}

years <- 2018:2023
wta_grand_slam_matches <- years |> 
  map(get_grand_slam_matches) |> 
  bind_rows()