MLB Data Tools
Baseball analytics Python library exposing data unavailable elsewhere: Savant player page Statcast splits/percentiles and per-play OAA via hidden API. Type-safe DataFrames with plotting utilities.
The Problem
Baseball data comes from multiple sources with inconsistent schemas. pybaseball provides Statcast but lacks utility functions for common analytics workflows. More importantly, some of the most useful data on Baseball Savant isn't exposed through any official API.
Unique Data Access
- Savant player pages: Scrapes embedded JSON for career Statcast splits and percentile rankings. No API exists for this.
- Hidden OAA endpoint: Undocumented API for per-play outs above average, not just season aggregates.
Technical Approach
Built a type-safe library for modern baseball analytics:
- Unified DataFrame interfaces across Statcast, Baseball Reference, and Fangraphs data
- Multi-source data fetching with automatic schema normalization
- Plotting utilities for common visualizations (spray charts, heatmaps, timeline graphs)
- pandas and polars backends for performance flexibility
Interesting Challenges
Data quality issues in baseball data are subtle. Batter handedness splits, park factors, and even basic stats like ERA require careful handling. The library enforces validation at data ingestion time.
What I'd Do Differently
The library is useful but undermaintained. Modern alternatives like pybaseball have caught up on utility functions. The unique value now is primarily the hidden data access: the Savant player page scraping and undocumented OAA endpoint remain useful.
Tech
