EDA project: NFL passing


Overview

This project will be released on Thursday, June 6 and conclude with an 8-minute presentation on Tuesday, June 18 during lecture time.

Students will be randomly placed into groups of three and each group will be randomly assigned a dataset.

The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.

Deliverables

Each group is expected to make slides to accompany the 8-minute presentation.

The presentation should feature the following:

  • Overview of the structure of your dataset

  • Three questions/hypotheses you are interested in exploring

  • Three data visualizations exploring the questions, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization

  • One clustering analysis

  • Conclusions for the hypotheses based on your EDA and data visualizations

Timeline

There will be two submission deadlines:

Thursday, June 13 at 5pm ET - Each student will push their individual code for the project thus far to GitHub for review. We will then provide feedback.

Monday, June 17 at 5pm ET - Slides and full code must be completed and ready for presentation. Send your slides to Quang (quang@stat.cmu.edu). All code must be written in R; but the slides may be created in any software. Take advantage of examples from lectures, but also feel free to explore online resources that may be relevant. (But be sure to always consult the R help documentation first before attempting to google around or ask ChatGPT.)

Data

This dataset contains all passing plays from the 2023 NFL regular season accessed via the nflfastR package (part of the nflverse).

Each row in the dataset corresponds to a single passing play (including sacks) and the columns are:

  • passer_player_name: name for the player that attempted the pass
  • passer_player_id: unique identifier for the player that attempted the pass
  • posteam: abbreviation for the team with possession
  • complete_pass: indicator denoting whether or not the pass was completed
  • interception: indicator denoting whether or not the pass was intercepted by the defense
  • yards_gained: yards gained (or lost) by the possessing team, excluding yards gained via fumble recoveries and laterals
  • touchdown: indicator denoting if the play resulted in a touchdown
  • pass_location: categorical location of pass
  • pass_length: categorical length of pass
  • air_yards: distance in yards perpendicular to the line of scrimmage at where the targeted receiver either caught or didn’t catch the ball
  • yards_after_catch: distance in yards perpendicular to the yard line where the receiver made the reception to where the play ended
  • epa: expected points added (EPA) by the posteam for the given play
  • wpa: win probability added (WPA) for the posteam
  • shotgun: indicator for whether or not the play was in shotgun formation
  • no_huddle: indicator for whether or not the play was in no_huddle formation
  • qb_dropback: indicator for whether or not the QB dropped back on the play (pass attempt, sack, or scrambled)
  • qb_hit: indicator if the QB was hit on the play
  • sack: indicator for if the play ended in a sack
  • receiver_player_name: name for the targeted receiver
  • receiver_player_id: unique identifier for the receiver that was targeted on the pass
  • defteam: abbreviation for the team on defense
  • posteam_type: indicating whether the posteam team is home or away
  • play_id: unique identifier for a single play
  • yardline_100: distance in the number of yards from the opponent’s endzone for the posteam
  • side_of_field: abbreviation for which team’s side of the field the team with possession is currently on
  • down: down for the given play
  • qtr: quarter of the game (5 is overtime)
  • play_clock: time on the playclock when the ball was snapped
  • half_seconds_remaining: seconds remaining in the half
  • game_half: indicating which half the play is in
  • game_id: ten digit identifier for NFL game
  • home_team: abbreviation for the home team
  • away_team: abbreviation for the away team
  • home_score: total points scored by the home team
  • away_score: total points scored by the away team
  • desc: detailed description for the given play

Note that a full glossary of the features available for NFL play-by-play data can be found here.

Starter code

library(tidyverse)
nfl_passing <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/nfl_passing.csv")

In case you’re curious, the code to build this dataset can be found below.

library(tidyverse)
library(nflfastR) # install.packages("nflverse")
nfl_pbp <- load_pbp(2023)
nfl_passing <- nfl_pbp |> 
  filter(play_type == "pass", season_type == "REG", 
         !is.na(epa), !is.na(posteam), posteam != "") |> 
  select(# player info attempting the pass
         passer_player_name, passer_player_id, posteam,
         # info about the pass:
         complete_pass, interception, yards_gained, touchdown,
         pass_location, pass_length, air_yards, yards_after_catch, epa, wpa,
         shotgun, no_huddle, qb_dropback, qb_hit, sack,
         # context about the receiver:
         receiver_player_name, receiver_player_id,
         # team context
         posteam, defteam, posteam_type,
         # play and game context
         play_id, yardline_100, side_of_field, down, qtr, play_clock,
         half_seconds_remaining, game_half, game_id,
         home_team, away_team, home_score, away_score,
         # description of play
         desc)