← back

MLB Data Tools

Baseball Analytics Library | 2024

Baseball analytics Python library exposing data unavailable elsewhere: Savant player page Statcast splits/percentiles and per-play OAA via hidden API. Type-safe DataFrames with plotting utilities.

The Problem

Baseball data comes from multiple sources with inconsistent schemas. pybaseball provides Statcast but lacks utility functions for common analytics workflows. More importantly, some of the most useful data on Baseball Savant isn't exposed through any official API.

Unique Data Access

  • Savant player pages: Scrapes embedded JSON for career Statcast splits and percentile rankings. No API exists for this.
  • Hidden OAA endpoint: Undocumented API for per-play outs above average, not just season aggregates.

Technical Approach

Built a type-safe library for modern baseball analytics:

  • Unified DataFrame interfaces across Statcast, Baseball Reference, and Fangraphs data
  • Multi-source data fetching with automatic schema normalization
  • Plotting utilities for common visualizations (spray charts, heatmaps, timeline graphs)
  • pandas and polars backends for performance flexibility

Interesting Challenges

Data quality issues in baseball data are subtle. Batter handedness splits, park factors, and even basic stats like ERA require careful handling. The library enforces validation at data ingestion time.

What I'd Do Differently

The library is useful but undermaintained. Modern alternatives like pybaseball have caught up on utility functions. The unique value now is primarily the hidden data access: the Savant player page scraping and undocumented OAA endpoint remain useful.

Tech

Pythonpandaspolarsmatplotlib