← back

mlbdatatools | Baseball Analytics Library

2024Baseball Modeling

Baseball analytics Python library. Type-safe DataFrames, multi-source fetching, and plotting utilities for Statcast analysis.

The Problem

Baseball data comes from multiple sources with inconsistent schemas. pybaseball provides Statcast but lacks utility functions for common analytics workflows. Copy-pasting code between projects was error-prone.

Technical Approach

Built a type-safe library for modern baseball analytics:

- Unified DataFrame interfaces across Statcast, Baseball Reference, and Fangraphs data

- Multi-source data fetching with automatic schema normalization

- Plotting utilities for common visualizations (spray charts, heatmaps, timeline graphs)

- pandas and polars backends for performance flexibility

Interesting Challenges

Data quality issues in baseball data are subtle. Batter handedness splits, park factors, and even basic stats like ERA require careful handling. The library enforces validation at data ingestion time.

What I'd Do Differently

The library is useful but undermaintained. Modern alternatives like pybaseball have caught up on utility functions. The value now is primarily in my personal workflow rather than as a public library.

Tech Stack

Pythonpandaspolarsmatplotlib