Deep Dive February 19, 2026

Everything We Tested (And What Actually Works)

We rebuilt DingerStats from scratch. Along the way, we tested every idea we could think of to squeeze more accuracy out of the model. Most of them failed. Here's the honest account.

The Baseline: 53.7%

Our core model is a 25-state Markov chain. Every batter and pitcher gets a 25×25 transition probability matrix (we call them MSIs — Markov Scoring Indices) built from 10 seasons of play-by-play data. To predict a game, we simulate it 10,000 times using the actual lineup's MSIs.

On the full 2023 season (2,430 games), the Markov chain alone picks the winner 53.7% of the time. That's better than a coin flip, but not by much.

The question: can we do better?

Attempt #1: Parameter Tuning 🔧

Result: ❌ No improvement

The Markov model has several parameters: anchor weights (how much to pull players toward league average), pitcher matchup weight, and home advantage. We ran a 59-configuration parameter sweep.

On a 200-game sample, the best configuration hit 58.5%. We were ecstatic — until we validated on the full 2,430 games.

200-game sample: 58.5% ✨

Full 2023 validation: 53.7% 💀

Every single "winning" configuration regressed to the baseline. The model's architecture matters more than its parameters.

Lesson: Small samples lie. Always validate on full seasons. Baseball has too much variance for 200-game tests to mean anything.

Attempt #2: Elo Ratings 📊

Result: ✅ Major improvement

Instead of simulating individual at-bats, Elo assigns each team a dynamic rating that updates after every game. It captures things the Markov chain can't: defense, baserunning, coaching quality, bullpen depth, roster construction, and momentum.

Markov alone: 53.7%

Elo alone: 56.1%

50/50 Blend: 56.5%

The Elo system alone outperforms Markov by over 2 percentage points. Blending them 50/50 captures the best of both — Elo's team-level signal plus Markov's matchup-specific granularity.

Lesson: Team strength matters more than individual matchups for picking winners. The best hitter in baseball still makes an out 70% of the time.

Attempt #3: Bullpen Modeling 🐂

Result: ❌ No improvement

We built composite bullpen MSIs from 886,290 relief at-bats across all 30 teams. The idea: switch from starter to bullpen MSI in the 6th or 7th inning for more realistic late-game simulation.

Bullpen-adjusted Markov accuracy: 53.0%. Worse than baseline.

Lesson: Composite bullpen MSIs average over too many different relievers and situations. A team's "average reliever" isn't who they actually use in high-leverage spots.

Attempt #4: Park Factors 🏟️

Result: ⚠️ Inconsistent

We calculated park factors from 7,292 games (2023-2025). The physics are real — Coors Field boosts scoring by 27%, T-Mobile Park suppresses it by 13%.

2023 backtest: +0.5% improvement

2024 backtest: -0.9% to -1.4% worse

Individual parks showed dramatic swings — Chicago gained 13.6% accuracy but the Mets' park lost 11.1%. The gains in one season vanished or reversed in the next.

Lesson: Park factors affect scoring (useful for over/under predictions) but don't reliably improve win prediction. The home team's park advantage is already captured by Elo's home-field factor.

Attempt #5: Recency Weighting ⏱️

Result: ❌ Made things worse

Intuition says a player's 2024 performance should matter more than their 2017 stats. We implemented exponential decay with configurable half-lives: 1, 2, 3, and 4 seasons.

A 50-game test showed "60% accuracy" with a 2-year half-life. We got excited. Then we ran the full 2,430-game validation:

Half-life 1.0 yr: 51.7% (worse)

Half-life 2.0 yr: 50.4% (much worse)

No decay (baseline): 53.7%

Recency weighting actively hurts the model. Why? Markov transition matrices need volume. A 25×25 matrix has 625 cells — you need thousands of at-bats to fill it reliably. Downweighting older data shrinks your effective sample size and makes the matrices noisier.

Lesson: More data beats recent data for transition probability estimation. The base-out state structure of baseball is stable enough that 2017 at-bats are still informative in 2023.

The Variance Problem 📉

The deepest issue with the Markov chain isn't accuracy — it's variance compression. The model predicts approximately 9 total runs for every single game. The standard deviation of predicted scores is 0.45 runs. In reality, it's 4.58 runs.

Predicted Score StDev

0.45

Actual Score StDev

4.58

Compression Ratio

10:1

This happens because 10,000 simulation averages converge. Individual simulations have plenty of variance — 4-1 blowouts, 12-9 slugfests, 1-0 pitchers' duels — but the mean of 10,000 games always lands near 4.5 per side.

This is actually correct math — the expected value IS about 4.5 runs per team. But it means the Markov chain's edge comes from the simulation distribution, not the mean. That's why we use median scores, percentile-based over/under calls, and full probability distributions instead of point estimates.

What We Actually Learned

Baseball is fundamentally hard to predict. The theoretical ceiling is probably 58-62%. Even Vegas doesn't do much better than 57-58%. Our 56.5% blend is competitive.

Elo does the heavy lifting for win prediction. It captures team-level factors that per-player simulations miss. The Markov chain's real value is score distributions and prop-level predictions.

Small samples are the enemy. Every "breakthrough" on 50-200 games disappeared at full scale. Test on 2,000+ games or don't bother.

Volume trumps recency for Markov chains. Transition matrices need data density. Ten seasons of at-bats produce better matrices than two weighted seasons.

Physics (park factors) is real but noisy for prediction. Coors Field increases scoring 27% — that's fact. But knowing that doesn't reliably tell you who wins.

What's Next

We're not done optimizing. Still on the roadmap:

Weather integration — Temperature, wind, and humidity affect ball flight. Needs game-time data.
Platoon splits — Separate MSIs for left/right-handed matchups. We built the handedness database; testing is next.
Starting pitcher Elo — A pitcher-specific Elo layer could capture ace vs. 5th starter differentiation.
Calibration tuning — Better mapping of simulation probabilities to real-world outcomes.

But we're also realistic: we're probably within 1-2% of our architectural ceiling. The bigger opportunity is in how we present the predictions — score distributions, confidence levels, trend analysis — not just "Team A wins."

The Numbers

Every claim in this article is backed by real backtests. Here are the receipts.

Approach	Games	Sims/Game	Markov	Blend
Baseline	2,430	10K	53.7%	56.5%
Parameter Tuning (best)	2,430	10K	53.7%	—
Bullpen Modeling	500	5K	53.0%	—
Park Factors (2023)	2,430	10K	54.2%	—
Park Factors (2024)	2,429	10K	52.3%	—
Recency (1yr half-life)	2,430	5K	51.7%	53.3%
Recency (2yr half-life)	2,430	5K	50.4%	53.9%

← Back to Blog