Validation Workflow
Trust, Audits, and Model Monitoring
How I validate AiBS across models, data freshness, and the AI layer so the product stays honest about what is strong, what is provisional, and what should stay qualified.
ByColby Reichenbach
Trust comes from whether the system can show its work and admit its limits, not from polished pages or confident wording.
AiBS makes analytical claims about baseball. That means I need a way to separate what is well supported from what is still early. The trust layer exists so the product does not rely on vague confidence or a one-time check that looked good during development.
Trust in this system has to cover more than one dimension. A product can be wrong because a model is weak, because the data is stale, because a route is presenting something too aggressively, or because an AI response is overreaching. I wanted those different failure types to be identifiable and addressable separately rather than mixed into one vague review process.
The current trust layer rests on four things: held-out model evaluation with dated audit artifacts, data freshness and serving-state monitoring, AI generation observability, and explicit publication boundaries matched to the actual strength of each layer.
How audits work
Every audit leaves behind a runnable script, a dated report, a JSON artifact, and a clear recommendation.
The audit process is built to be repeatable rather than one-off. The repo contains a suite of audit scripts, each producing a dated markdown report, a dated JSON artifact, and a short written recommendation: no change, monitor, recalibrate, or rebuild. Reports and artifacts are stored in version control so any claim boundary can be traced back to the specific evidence that produced it.
The intended cadence runs at three levels. Post-refresh checks are designed to run after final games are available and flag material shifts in error, fallback usage, or benchmark gaps. Weekly reviews look at accumulated artifacts together, focusing on convergence, calibration drift, and distribution changes. Deeper audits are run after major data milestones: when the spring training sample expands significantly, after the first regular-season week, after the first full month, or whenever model logic changes.
The current suite covers 10 audit types: product QA, current-state validation, controversy ranking, decision-value composite, leverage benchmarking, RE benchmarking, WE benchmarking against MLB public WE, overturn-probability calibration, rubric distribution, and zone-edge geometry.
What is strong vs. what stays qualified
Each layer has an explicit confidence boundary, and the audits are what set those boundaries.
The model layer maintains a traffic-light status for every component, documented in model cards and a dated verdict file. Count-state, RE, and WE are green: held-out audited and externally benchmarked. Called-pitch geometry and overturn probability are yellow: useful but still provisional. Challenge-now is red for org-grade claims: structurally improved but not ready to be presented as operational optimization. Leverage, rubrics, and controversy are green in their intended roles as heuristic, descriptive, and editorial layers.
The specific audit numbers behind those verdicts are in the Model Layer article. What matters here is that the verdicts are not based on feel. They come from the dated audit process and can move in either direction as the evidence develops. A layer can graduate from yellow to green, or a previously green layer can be downgraded if audit metrics start drifting.
Explicit classification by confidence level does more for credibility than polished wording ever could. A reader can look at the boundary and understand exactly what is and is not being claimed.
Data freshness
Trust also means the product knows when its own serving state is fresh and when it is not.
Model validation is only one dimension of trust. The product also has to trust its own data path. The polling workflow runs on a fixed ten-minute heartbeat with an ET-aware gate that decides whether real ingest work needs to happen. Stale systems beyond eight hours get a bounded catch-up. Nothing relevant in the schedule means a quick exit.
Live scoreboard serving reads from structured linescore state rather than depending on open-ended raw snapshot retention. Snapshot pruning keeps the serving environment from quietly accumulating archive-weight data. Those choices exist because stale or loosely shaped serving state can make the product look more certain than it should be.
Blurring the line between fresh structured state, stale operational state, and deeper archive material is a specific failure mode the data design is built to prevent.
AI observability
AI responses are reviewable because I cannot improve what I cannot trace.
The AI side of trust works differently from the model side, but it matters just as much. When the AI layer produces a response, the system persists the full conversation, individual messages, tool calls, safety events, cost events, usage ledger records, and generation metadata including surface, task family, prompt version, semantic tags, terminology bundle details, and structured output. That record makes it possible to inspect what actually happened rather than guess.
Knowing whether a weak response came from request routing, the selected surface, the prompt version, the terminology bundle, the available tool context, or the response structure is the difference between diagnosing a problem and replacing things at random.
The AI path also includes health checks that validate terminology seed completeness and run a test suite covering prompt behavior, telemetry correctness, and output structure. Prompt and response behavior can drift quietly while the product still appears to function. The tests exist to catch that drift.
Failure types
The product can be wrong in different ways, and each way needs its own detection path.
Trust in AiBS is organized around four failure types because they require different monitoring and different responses.
- •Model trust: a layer is weak, miscalibrated, or overfit. Caught by held-out audits and dated artifacts.
- •Data trust: serving state is stale, loosely shaped, or out of sync with the source. Caught by freshness monitoring and structured serving design.
- •Product trust: a route presents information too aggressively or implies stronger confidence than the underlying layer supports. Caught by publication boundaries and claim-level review.
- •AI trust: a generated response overreaches, drifts off-surface, or breaks wording consistency. Caught by generation records, health checks, and test suites.
I want the system to be able to tell me when its own confidence should go down.
