mlbdatatools | Baseball Analytics Library
Baseball analytics Python library. Type-safe DataFrames, multi-source fetching, and plotting utilities for Statcast analysis.
The Problem
Baseball data comes from multiple sources with inconsistent schemas. pybaseball provides Statcast but lacks utility functions for common analytics workflows. Copy-pasting code between projects was error-prone.
Technical Approach
Built a type-safe library for modern baseball analytics:
- Unified DataFrame interfaces across Statcast, Baseball Reference, and Fangraphs data
- Multi-source data fetching with automatic schema normalization
- Plotting utilities for common visualizations (spray charts, heatmaps, timeline graphs)
- pandas and polars backends for performance flexibility
Interesting Challenges
Data quality issues in baseball data are subtle. Batter handedness splits, park factors, and even basic stats like ERA require careful handling. The library enforces validation at data ingestion time.
What I'd Do Differently
The library is useful but undermaintained. Modern alternatives like pybaseball have caught up on utility functions. The value now is primarily in my personal workflow rather than as a public library.