EDA project: NHL shooting


Overview

This project will be released on Thursday, June 6 and conclude with an 8-minute presentation on Tuesday, June 18 during lecture time.

Students will be randomly placed into groups of three and each group will be randomly assigned a dataset.

The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.

Deliverables

Each group is expected to make slides to accompany the 8-minute presentation.

The presentation should feature the following:

  • Overview of the structure of your dataset

  • Three questions/hypotheses you are interested in exploring

  • Three data visualizations exploring the questions, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization

  • One clustering analysis

  • Conclusions for the hypotheses based on your EDA and data visualizations

Timeline

There will be two submission deadlines:

Thursday, June 13 at 5pm ET - Each student will push their individual code for the project thus far to GitHub for review. We will then provide feedback.

Monday, June 17 at 5pm ET - Slides and full code must be completed and ready for presentation. Send your slides to Quang (quang@stat.cmu.edu). All code must be written in R; but the slides may be created in any software. Take advantage of examples from lectures, but also feel free to explore online resources that may be relevant. (But be sure to always consult the R help documentation first before attempting to google around or ask ChatGPT.)

Data

This dataset contains all shot attempts from the 2023 NHL playoffs, courtesy of MoneyPuck.com.

Each row in the dataset corresponds to a shot attempt and the columns are:

  • shooterPlayerId: player id of the skater taking the shot
  • shooterName: first and Last name of the player taking the shot
  • team: team taking the shot
  • shooterLeftRight: whether the shooter is a left or right shot
  • shooterTimeOnIce: playing time in seconds that have passed since the shooter started their shift
  • shooterTimeOnIceSinceFaceoff: minimum of the playing time in seconds since the last faceoff and the playing time that has passed since the shooter started their shift
  • event: whether the shot was a shot on goal (SHOT), goal, (GOAL), or missed the net (MISS)
  • location: the zone the shot took place in
  • shotType: type of shot
  • shotAngle: angle of the shot in degrees, positive if the shot is from the left side of the ice.
  • shotAnglePlusRebound: difference in angle between the previous shot and this shot if this shot is a rebound, is otherwise set to 0
  • shotDistance: distance from the net of the shot in feet, net is defined as being at the (89,0) coordinates
  • shotOnEmptyNet: whether the shot was on an empty net
  • shotRebound: whether the shot is a rebound, i.e., if the last event was a shot and within 3 seconds of this shot
  • shotRush: whether the shot was on a rush, i.e., ff the last event was in another zone and within 4 seconds
  • shotWasOnGoal: whether the shot was on net - either a goal or a goalie save
  • shotGeneratedRebound: whether the shot generated a rebound shot within 3 seconds of the this shot
  • shotGoalieFroze: whether the goalie froze the puck within 1 second of the shot
  • arenaAdjustedShotDistance: shot distance adjusted for arena recording bias - uses the same methodology as War On Ice proposed by Schuckers and Curro
  • arenaAdjustedXCord: x coordinate of the arena adjusted shot location, always a positive number
  • arenaAdjustedYCord: y coordinate of the arena adjusted shot location
  • goalieIdForShot: player id for the goalie the shot is on
  • goalieNameForShot: first and Last name of the goalie the shot is on
  • teamCode: team code of the shooting team
  • isHomeTeam: whether the shooting team is the home team
  • homeSkatersOnIce: number of skaters on ice for the home team (does not count the goalie)
  • awaySkatersOnIce: number of skaters on ice for the away team (does not count the goalie)
  • game_id: game id of the game the shot took place in
  • homeTeamCode: home team in the game
  • awayTeamCode: away team in the game
  • homeTeamGoals: home team goals before the shot took place
  • awayTeamGoals: away team goals before the shot took place
  • time: seconds into the game of the shot
  • period: period of the game

Note that a full glossary of the features available for NHL shot data can be found here.

Starter code

library(tidyverse)
nhl_shots <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/nhl_shots.csv")

In case you’re curious, the code to build this dataset can be found below. (Note that the data were originally downloaded from the MoneyPuck site.)

# download and unzip
# https://peter-tanner.com/moneypuck/downloads/shots_2022.zip
nhl_shots <- read_csv("shots_2022.csv") # might need to modify file path

nhl_shots <- nhl_shots |> 
  filter(isPlayoffGame == 1) |> 
  select(# shooter info
         shooterPlayerId, shooterName, team, shooterLeftRight, 
         shooterTimeOnIce, shooterTimeOnIceSinceFaceoff,
         # shot info
         event, location, shotType, shotAngle, shotAnglePlusRebound, 
         shotDistance, shotOnEmptyNet, shotRebound, shotRush, 
         shotWasOnGoal, shotGeneratedRebound, shotGoalieFroze,
         # arena-adjusted locations
         arenaAdjustedShotDistance, arenaAdjustedXCord, arenaAdjustedYCord,
         # goalie info
         goalieIdForShot, goalieNameForShot,
         # team context
         teamCode, isHomeTeam, homeSkatersOnIce, awaySkatersOnIce,
         # game context
         game_id, homeTeamCode, awayTeamCode, homeTeamGoals, awayTeamGoals,
         time, period)