Baseball Logic
The Model Layer
How I structured the baseball logic underneath AiBS, what each modeling layer is responsible for, and why different outputs carry different levels of confidence.
ByColby Reichenbach
I built the model layer so the product could make baseball claims that are measurable, reviewable, and clear about their limits.
AiBS does not depend on one oversized model trying to explain everything about ABS. That approach makes it harder to understand what the system is actually doing, and it makes it easier to overstate what the outputs mean. The model layer is a stack of narrower pieces with different jobs.
That stack includes count-state value, run expectancy, win expectancy, called-pitch geometry, overturn probability, challenge evaluation, leverage, rubric layers, and controversy ranking. Some are empirical value models. Some are probabilistic. Some exist to help with interpretation and communication. They should not all be described the same way.
The strongest part of the current system is not a single metric. It is the combination of warehouse-first data handling, split-aware train and test control, held-out audits, model cards, and publication boundaries that reflect the actual strength of each layer.
Stack architecture
ABS is not one modeling problem, so I did not try to solve it with one model.
ABS looks simple when the public conversation reduces it to one question: was the pitch a strike or a ball? Once a challenge happens, the picture changes. The count can change, the base-out state matters, the inning and score matter, the number of remaining challenges matters, and the value of a state change depends on the full baseball context around the call.
The stack reflects that structure. Count-state value handles one job. Run expectancy handles another. Win expectancy handles another. Geometry and overturn probability address a different part of the chain. Challenge evaluation depends on how those upstream pieces interact. Leverage orders the pressure. Rubrics translate patterns into readable categories. Controversy ranks events for editorial surfaces.
Core value stack
The strongest work is in count-state value, run expectancy, and win expectancy.
These layers received the most disciplined rebuild. They sit on top of warehouse-first governance, split-aware train, validation, and test control, held-out audits, and model-card documentation. This is the part of the current system I would be most comfortable defending in front of a technical audience.
Count-state is the cleanest foundation because challenge analysis depends heavily on what the count means before and after a call changes. Run expectancy estimates expected runs to inning end from exact baseball state, so call reversals translate into run value. Win expectancy estimates batting-team win probability from the full game state, putting challenge consequences in terms of the game itself.
The RE state definition is RE(inning_bucket, outs, bases_state, count). The WE state definition is WE(inning, half_inning, outs, bases_state, score_diff_bucket, count). Those inputs capture the baseball context that determines how much any single call actually matters.
Count-state held-out wMAE
BA: 1.74%, BB: 0.36%, K: 0.78%, POS: 1.33%
RE held-out test
MAE: 0.105, RMSE: 0.199, mean signed error: 0.003
RE test rows / states
49,854 rows, 1,088 distinct states
WE held-out test
MAE: 15.5%, Brier: 5.4%, log loss: 0.4992
WE MLB benchmark
40,963 at-bats, mean gap: 2.5%, median gap: 2.2%
Geometry
Geometry now has one public contract, with diagnostics kept behind it.
The geometry layer matters because ABS begins with a strike-zone decision. Public baseball data alone cannot prove every operational detail of MLB's internal adjudication, so the product uses a single canonical public field: Savant edge distance when it is available, otherwise a radius-adjusted ABS edge calculation.
Center-only and raw radius-adjusted variants are still retained as internal diagnostics. They are useful for validation, but they should not leak into product language as competing public truths. The user-facing strike-zone and challenge-value surfaces should all speak through the canonical ABS margin.
Direction also matters independently. A called strike flipping to a ball is not the same baseball event as a called ball flipping to a strike. They move the count in opposite directions, create different state sequences, and produce different downstream value. The geometry layer is tied to challenge direction because the baseball consequences are direction-aware.
Overturn probability
Overturn probability is a real part of the stack, but I keep it in the product with qualified language.
Overturn probability addresses a straightforward question in the ABS challenge system: given the pitch location and challenge direction, how likely is the call to be overturned? The current model uses a tiered fallback structure. If an exact match exists for the challenge direction and edge bucket, it uses that. If not, it falls back to direction-only, then to a global baseline.
The validation path has improved substantially, but the layer is still geometry-sensitive and limited by sample size. The current audit covers 3,448 challenged rows, with 1,129 held out for test. The product default is canonical/radius-adjusted geometry: held-out Brier is 0.2556 and log loss is 0.7043. Center-only still slightly wins the tiny validation split, so the model remains qualified rather than declared settled.
That is enough to call the model promising and usable in context. It is not enough to describe it as final or club-grade.
Challenge evaluation
Challenge-now is much better engineered than it was, and I still will not oversell it.
The challenge-value decomposition is the most consequential layer in the stack. It combines all the upstream pieces into a single decision estimate: EV = P(overturn) x success_value + (1 - P(overturn)) x failed_challenge_value. Inventory is now paid only on the failed branch, where burning a challenge actually matters.
The inputs include exact base-out state, inning, score, count, challenge direction, canonical overturn probability, RE and WE value layers, terminal count-transition flags, and the inventory cost version. Terminal walk/strikeout branches are labeled and currently use the heuristic decision-value path until we have a dedicated post-PA terminal WE resolver.
The current evidence does not support presenting it as org-grade live optimization. The held-out opportunity audit covers 43,297 opportunities and 1,129 challenged rows. Historical challenge share is 2.6%, while the current raw recommendation share is 10.9%. A stricter 1.0% threshold brings the validation challenge share to 3.2%, and a two-per-team-game budget envelope lands at 2.6%. That is much healthier, but it is still descriptive evidence rather than causal proof.
Live challenge-now is framed in the product as an experimental lens for discussion. Postgame challenge evaluation is substantially stronger and more credible for retrospective use. Letting one ambitious layer undermine the credibility of the rest of the stack is not a trade worth making.
Leverage, rubrics, and controversy
Some layers exist to order, describe, or translate. They are not predictive truth.
Leverage is useful because the product needs a pressure-ordering layer, but it is not win probability added under a different name. The current audit treats it as a heuristic pressure proxy. It shows a Pearson correlation of 0.190 to absolute WE swing, with higher mean swing in the high-leverage bucket (5.5%) than in the low-leverage bucket (2.2%). That justifies using it for ordering. It does not justify calling it a calibrated model.
Rubrics and controversy serve different purposes. Rubrics translate challenge patterns and outcomes into readable categories. Controversy ranks events for editorial surfaces. Both help the product communicate clearly, but neither should borrow the tone of the stronger quantitative layers. The audits keep them in descriptive and editorial territory, which is exactly where they belong.
Products lose trust when interpretive layers quietly sound like predictive ones.
Downstream summaries
Team and umpire summary pages are downstream of challenge-level state changes, not independent truths.
When the product shows average RE change, average WE change, high-leverage share, or similar summary metrics for teams, umpires, or event groups, those numbers are built from challenge-level state transitions. Aggregate pages inherit both the strengths and the limits of the layers underneath them.
That is especially relevant on umpire pages and smaller samples. A directional summary can still be useful, but it should not automatically become a reputational claim just because it is presented cleanly. The product stays explicit about thin samples and qualified reads rather than letting a clean layout imply more than the data supports.
Current model status
Every layer has an explicit verdict from the dated audit process.
The model layer maintains a status for every layer, documented in model cards and a dated verdict file. The current standing is listed below.
- •Count-state value: green. Held-out audited empirical baseline.
- •Run expectancy: green. Held-out audited and split-governed.
- •Win expectancy: green. Held-out audited and externally benchmarked against MLB public WE.
- •Called-pitch geometry: yellow. Useful, but the geometry choice is still provisional.
- •Overturn probability: yellow. Credible early model, still sample- and geometry-limited.
- •Challenge-now: red for org-grade claims. Useful as an experimental live lens and for postgame review.
- •Leverage: green as heuristic. Pressure proxy, not a calibrated model.
- •Rubrics: green as descriptive. Translation layer, not predictive truth.
- •Controversy: green as editorial. Ranking layer for editorial surfaces.
I want each layer to be used for what it actually is, not for what would sound best in a product description.
