library(tidyverse)
<- read_csv("https://raw.githubusercontent.com/36-SURE/2025/main/data/mlb_batted_balls.csv") mlb_batted_balls
EDA project: MLB batting
Overview
This project will be released on Thursday, June 5 and conclude with a 6-minute presentation on Tuesday, June 17 during lab.
Students will be randomly placed into groups of 2–3 and each group will be randomly assigned a dataset.
The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.
Deliverables
Each group is expected to make slides to accompany the 6-minute presentation. All group members must be present for the presentation and participate. No notes/scripts are allowed during the presentation.
The presentation should feature the following:
Overview of the structure of your dataset
Two questions/hypotheses you are interested in exploring
Two data visualizations exploring the questions—both must be multivariate (i.e., involving 2+ variables) and in different formats
One clustering analysis
Conclusions for the hypotheses based on your EDA and data visualizations
Timeline
There will be two submission deadlines:
Thursday, June 12 at 5pm ET - Each student will push their individual code for the project thus far to GitHub for review.
Tuesday, June 17 at 1pm ET - Slides and full code must be completed and ready for presentation. Send your slides to Quang (quang@stat.cmu.edu
). All code must be written in R
; but the slides may be created in any software. Take advantage of examples from lectures, but also feel free to explore online resources that may be relevant. (But be sure to always consult the R
help documentation first before attempting to google around or ask ChatGPT.)
Data
This dataset contains batted balls from the 2024 MLB regular season (from April 3, 2024 (first day of swing tracking data) onward), courtesy of Baseball Savant and accessed via the sabRmetrics
package.
Each row of the dataset corresponds to a batted ball and the columns are:
batter_name
: name of the batter (Last name, First name)batter_id
: unique identifier for the batterbatter_side
: handedness of hitter, eitherL
(left) orR
(right)bb_type
: batted ball typepitcher_id
: unique identifier for the pitcherpitch_number
: total pitch number of the plate appearancepitch_type
: type of pitch thrown (using Statcast’s algorithm)pitch_name
: full name of the pitch typepitch_hand
: handedness of pitcherevents
: categorical event denoting the outcome of the batted ballhit_coord_x
: horizontal coordinate of where a batted ball is first touched by a fielderhit_coord_y
: vertical coordinate of where a batted ball is first touched by a fielder (note: multiply this by -1 when plotting)hit_distance_sc
: distance of the batted ball in feet according to Statcastlaunch_speed
: exit velocity of the ball off the bat (mph)launch_angle
: the vertical angle of the ball off the bat measured from a line parallel to the groundbat_speed
: speed at the “sweet spot” of the bat, taking the average of the top 90% of a batter’s swings (mph)swing_length
: distance traveled by bat head from the start of a swing until the impact point (feet)arm_angle
: horizontal line extending from the location of the pitcher’s throwing shoulder and the location of ball at the time of the pitch (0° means perfectly horizontal to the flat ground (a sidearmer), while 90° means straight over the top)hit_location
: positional number of the player who fielded the ball, possible values are 1-9 for each player (see here)release_speed
: speed of the pitch measured when the ball is released (mph)effective_speed
: perceived velocity of the ball, i.e., the velocity of the pitch is adjusted for how close it is to home when it is released (mph)strike_zone_{top, bottom}
: top/bottom of the batter’s strike zone set by the operator when the ball is halfway to the platerelease_spin_rate
: total spin rate of pitch after it is released (rpm)extension
: how far from the rubber the ball is when it is released (feet)a{x, y, z}
: acceleration of pitch in{x, y, z}
dimensions, determined aty = 50
feet (ft/s\(^2\))v{x0, y0, z0}
: velocity of pitch in{x, y, z}
dimensions, determined aty = 50
feet (ft/s)release_pos_y
: release position of the ball from the catcher’s perspective (feet)release_pos_{x, z}
: horizontal/vertical release position of the ball from the catcher’s perspective (feet)plate_{x, z}
: horizontal/vertical position of the ball when it crosses home plate from the catcher’s perspectivepfx_{x, z}
: horizontal/vertical movement from the catcher’s perspective (feet)if_fielding_alignment
: type of infield shift by defenseof_fielding_alignment
: type of outfield shift by defensegame_date
: date of the game (mm/dd/yyyy)balls
: number of balls in the countstrikes
: number of strikes in the countouts
: number of outs when the batter is uppre_runner_1b_id
: unique identifier for a runner on first base (if there is one)pre_runner_2b_id
: unique identifier for a runner on second base (if there is one)pre_runner_3b_id
: unique identifier for a runner on third base (if there is one)inning
: the inning numberinning_topbot
: top or bottom of the inninghome_score
: home team score before batted ballaway_score
: away team score before batted ball
post_home_score
: home team score after batted ballpost_away_score
: away team score after batted ball
des
: description of the batted ball and play
Note that a full glossary of the features available from MLB Statcast data can be found here.
Starter code
In case you’re curious, the code to build this dataset can be found below.
# devtools::install_github(repo = "saberpowers/sabRmetrics")
library(tidyverse)
library(sabRmetrics)
<- download_baseballsavant(start_date = "2024-04-03",
mlb_batted_balls end_date = "2024-12-31")
|>
mlb_batted_balls filter(type == "X") |>
select(batter_name, batter_id, bat_side, bb_type,
pitcher_id, pitch_number, pitch_type, pitch_name, pitch_hand,
events, hit_coord_x, hit_coord_y, launch_speed, launch_angle,
bat_speed, swing_length, arm_angle, hit_location,
release_speed, effective_speed,
strike_zone_top, strike_zone_bottom,
release_spin_rate, extension,
ax, ay, az, vx0, vy0, vz0,
release_pos_x, release_pos_y, release_pos_z,
plate_x, plate_z, pfx_x, pfx_z,
if_fielding_alignment, of_fielding_alignment,
game_date, balls, strikes, outs, inning, inning_topbot,
pre_runner_1b_id, pre_runner_2b_id, pre_runner_3b_id, home_score, away_score, post_home_score, post_away_score, des)