EDA project: MLB batting


Overview

This project will be released on Thursday, June 5 and conclude with a 6-minute presentation on Tuesday, June 17 during lab.

Students will be randomly placed into groups of 2–3 and each group will be randomly assigned a dataset.

The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.

Deliverables

Each group is expected to make slides to accompany the 6-minute presentation. All group members must be present for the presentation and participate. No notes/scripts are allowed during the presentation.

The presentation should feature the following:

  • Overview of the structure of your dataset

  • Two questions/hypotheses you are interested in exploring

  • Two data visualizations exploring the questions—both must be multivariate (i.e., involving 2+ variables) and in different formats

  • One clustering analysis

  • Conclusions for the hypotheses based on your EDA and data visualizations

Timeline

There will be two submission deadlines:

Thursday, June 12 at 5pm ET - Each student will push their individual code for the project thus far to GitHub for review.

Tuesday, June 17 at 1pm ET - Slides and full code must be completed and ready for presentation. Send your slides to Quang (quang@stat.cmu.edu). All code must be written in R; but the slides may be created in any software. Take advantage of examples from lectures, but also feel free to explore online resources that may be relevant. (But be sure to always consult the R help documentation first before attempting to google around or ask ChatGPT.)

Data

This dataset contains batted balls from the 2024 MLB regular season (from April 3, 2024 (first day of swing tracking data) onward), courtesy of Baseball Savant and accessed via the sabRmetrics package.

Each row of the dataset corresponds to a batted ball and the columns are:

  • batter_name: name of the batter (Last name, First name)
  • batter_id: unique identifier for the batter
  • batter_side: handedness of hitter, either L (left) or R (right)
  • bb_type: batted ball type
  • pitcher_id: unique identifier for the pitcher
  • pitch_number: total pitch number of the plate appearance
  • pitch_type: type of pitch thrown (using Statcast’s algorithm)
  • pitch_name: full name of the pitch type
  • pitch_hand: handedness of pitcher
  • events: categorical event denoting the outcome of the batted ball
  • hit_coord_x: horizontal coordinate of where a batted ball is first touched by a fielder
  • hit_coord_y: vertical coordinate of where a batted ball is first touched by a fielder (note: multiply this by -1 when plotting)
  • hit_distance_sc: distance of the batted ball in feet according to Statcast
  • launch_speed: exit velocity of the ball off the bat (mph)
  • launch_angle: the vertical angle of the ball off the bat measured from a line parallel to the ground
  • bat_speed: speed at the “sweet spot” of the bat, taking the average of the top 90% of a batter’s swings (mph)
  • swing_length: distance traveled by bat head from the start of a swing until the impact point (feet)
  • arm_angle: horizontal line extending from the location of the pitcher’s throwing shoulder and the location of ball at the time of the pitch (0° means perfectly horizontal to the flat ground (a sidearmer), while 90° means straight over the top)
  • hit_location: positional number of the player who fielded the ball, possible values are 1-9 for each player (see here)
  • release_speed: speed of the pitch measured when the ball is released (mph)
  • effective_speed: perceived velocity of the ball, i.e., the velocity of the pitch is adjusted for how close it is to home when it is released (mph)
  • strike_zone_{top, bottom}: top/bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate
  • release_spin_rate: total spin rate of pitch after it is released (rpm)
  • extension: how far from the rubber the ball is when it is released (feet)
  • a{x, y, z}: acceleration of pitch in {x, y, z} dimensions, determined at y = 50 feet (ft/s\(^2\))
  • v{x0, y0, z0}: velocity of pitch in {x, y, z} dimensions, determined at y = 50 feet (ft/s)
  • release_pos_y: release position of the ball from the catcher’s perspective (feet)
  • release_pos_{x, z}: horizontal/vertical release position of the ball from the catcher’s perspective (feet)
  • plate_{x, z}: horizontal/vertical position of the ball when it crosses home plate from the catcher’s perspective
  • pfx_{x, z}: horizontal/vertical movement from the catcher’s perspective (feet)
  • if_fielding_alignment: type of infield shift by defense
  • of_fielding_alignment: type of outfield shift by defense
  • game_date: date of the game (mm/dd/yyyy)
  • balls: number of balls in the count
  • strikes: number of strikes in the count
  • outs: number of outs when the batter is up
  • pre_runner_1b_id: unique identifier for a runner on first base (if there is one)
  • pre_runner_2b_id: unique identifier for a runner on second base (if there is one)
  • pre_runner_3b_id: unique identifier for a runner on third base (if there is one)
  • inning: the inning number
  • inning_topbot: top or bottom of the inning
  • home_score: home team score before batted ball
  • away_score: away team score before batted ball
  • post_home_score: home team score after batted ball
  • post_away_score: away team score after batted ball
  • des: description of the batted ball and play

Note that a full glossary of the features available from MLB Statcast data can be found here.

Starter code

library(tidyverse)
mlb_batted_balls <- read_csv("https://raw.githubusercontent.com/36-SURE/2025/main/data/mlb_batted_balls.csv")

In case you’re curious, the code to build this dataset can be found below.

# devtools::install_github(repo = "saberpowers/sabRmetrics")
library(tidyverse)
library(sabRmetrics)
mlb_batted_balls <- download_baseballsavant(start_date = "2024-04-03",
                                            end_date = "2024-12-31")

mlb_batted_balls |> 
  filter(type == "X") |> 
  select(batter_name, batter_id, bat_side, bb_type,
         pitcher_id, pitch_number, pitch_type, pitch_name, pitch_hand,
         events, hit_coord_x, hit_coord_y, launch_speed, launch_angle,
         bat_speed, swing_length, arm_angle, hit_location, 
         release_speed, effective_speed,
         strike_zone_top, strike_zone_bottom,
         release_spin_rate, extension,
         ax, ay, az, vx0, vy0, vz0, 
         release_pos_x, release_pos_y, release_pos_z,
         plate_x, plate_z, pfx_x, pfx_z,
         if_fielding_alignment, of_fielding_alignment,
         game_date, balls, strikes, outs, inning, inning_topbot,
         pre_runner_1b_id, pre_runner_2b_id, pre_runner_3b_id,
         home_score, away_score, post_home_score, post_away_score, des)