Methodology

The math behind the predictions.

The Core: 25-State Markov Chains

Every at-bat in baseball starts from one of 24 possible game states — a combination of 8 base-runner configurations (nobody on, runner on first, runners on first and second, etc.) and 3 out counts (0, 1, or 2 outs). Plus one absorbing state: 3 outs, inning over.

For every batter and pitcher in our database, we have a 25×25 transition probability matrix — the probability of moving from any state to any other state when that batter faces that pitcher. These matrices are built from 10 seasons of play-by-play data (2016-2025) covering over 1.7 million at-bats.

We call these matrices Markov Scoring Indices (MSIs).

Combining Batter + Pitcher

For each at-bat, we combine the batter's MSI and the pitcher's MSI into a single transition matrix. The combination weights each player's tendencies against league-average baselines, producing a matchup-specific probability distribution.

This is the heart of the simulation — every at-bat outcome is driven by the specific batter-pitcher matchup, not team-level averages.

Monte Carlo Simulation

To predict a game, we simulate it 10,000 times. Each simulation walks through the full lineup for both teams, simulating every at-bat using the combined transition matrices. Runs score naturally from state transitions (e.g., runner on third → 0 outs transitions to next state with a run scored).

From 10,000 simulations, we get full probability distributions for:

  • Win probability for each team
  • Expected score and score distribution
  • Over/under probabilities at various lines

Elo Ratings

Pure Markov simulation captures individual matchups but misses team-level factors: momentum, bullpen depth, coaching, and intangibles. We supplement with an Elo rating system that updates after every game.

Our production model blends Markov and Elo predictions 50/50. This ensemble consistently outperforms either component alone:

3-season average (7,289 games):

Markov only: ~53.9%

Elo only: ~56.0%

Blend 50/50: ~56.1%

Data Sources

  • Retrosheet — Play-by-play data for MSI construction (2016-2025). 22,764 games, 2,513 batters, 2,450 pitchers.
  • MySportsFeeds — Daily schedules, lineups, and live game data during the season.

What's Next

  • 🔄 Recency weighting — Exponential decay for older seasons so recent performance matters more.
  • 🏟️ Park factors — Adjusting for ballpark-specific run environments (Coors Field isn't Petco Park).
  • 🌦️ Weather integration — Temperature, wind, and humidity affect ball flight and scoring.