AI Systems
The AI Layer
How I designed the AI surfaces in AiBS, how context flows into each one, and how prompt behavior, usage, and generation details are tracked over time.
ByColby Reichenbach
I built the AI layer as application architecture, not as a chat box sitting on top of baseball data.
The AI layer in AiBS is not one assistant trying to handle every job. The product asks different kinds of questions, expects different response shapes, and has different failure modes depending on where the user is and what they are looking at. Three separate surfaces handle three separate responsibilities: copilot, chart insight, and visualizer.
Each surface receives different input, follows a different prompt contract, and returns a different kind of output. Copilot answers scoped baseball questions from server-side tool results. Chart insight explains one chart payload in a structured format. Visualizer returns a structured chart spec that the app can render, persist, and share. Keeping those jobs separate keeps response behavior close to what the user actually asked for.
The other half of the design is reviewability. I store more than the final response. Conversations, tool calls, safety events, cost records, usage ledger entries, prompt metadata, terminology selections, and generation details are all persisted so I can trace how any response was produced after the fact.
Surface contracts
Each surface has a distinct job, input shape, and output contract.
Treating every AI interaction as the same kind of request makes failures harder to diagnose. A chart explanation is a different job from a scoped baseball question, and a visual planning task is different again. Running all of those through one prompt and response path blurs behavior and makes the system harder to test.
Every inbound request declares its surface up front: copilot, chart_insight, or visualizer. From there, the server routes into a narrower path with its own prompt builder, surface runner, and output contract.
Copilot is the broadest surface but still bounded. It works from scoped context and server-side tool results. Chart insight only runs when the system has a structured chart payload, and it returns a structured interpretation, not loose prose. Visualizer returns a structured chart spec with axes, grouping, filters, signals, and caveats, and the product can persist that output into a shareable chart artifact. Planning a view is a separate product job from explaining one that already exists.
- •Copilot: scoped baseball question answering from current tool results
- •Chart insight: structured explanation of one chart payload
- •Visualizer: structured chart-planning output for a baseball question
Task routing
A task-family layer shapes responses by the kind of question, not just the surface name.
Surface type alone does not capture enough context. Within copilot, a game-summary question needs different treatment from an umpire profile explanation. A zone-map chart is different from an inventory deployment chart. A comparison plan differs from a timing-focused visual plan.
The task-family resolver classifies each request into a narrower family based on surface, route scope, chart type when present, and message heuristics. Copilot resolves into families like game summary, ABS explanation, anomaly diagnosis, team profile, umpire profile, and comparison. Chart insight resolves by chart type: decision brief, value timeline, umpire rhythm, zone map, scenario matrix, and inventory deployment. Visualizer resolves into question-to-visual, comparison, and timing-and-leverage plans. That family carries into prompt construction and execution telemetry.
Because the routing rules are explicit in code, they are easy to test. When a surface has a clear input and a clear output contract, I can validate whether it is behaving correctly without guessing at what it was supposed to do.
Prompt construction
Prompts are built from smaller versioned parts, not from one large file.
Each surface has its own prompt builder. Copilot prompts are structured around answering directly with evidence. Chart insight prompts follow a structured interpretation path. Visualizer prompts return a plan. Those differences live in separate files rather than being squeezed through one generic template.
Prompt changes are easier to track because the system records which prompt version a generation used. When response behavior shifts, I can trace it back to the specific prompt definition that produced it rather than treating prompt text as invisible glue between the request and the response.
Terminology
A file-backed terminology system keeps wording consistent without bloating the prompt.
I wanted the model to use ABS language consistently, but loading every request with a giant reference block tends to make responses worse. The current approach is more targeted.
Seed files for terminology cards, style packs, and surface rules live in the repository. At runtime, the server derives semantic tags from the request based on task family, message content, and chart type. It selects a bounded set of matching entries, compiles them into a short appendix, and injects that appendix into the prompt.
The selection is fully deterministic. Given the same surface, audience mode, task family, and request tags, the system picks the same wording guidance every time.
Terminology cards
61 (verified from seed file)
Style packs
6
Surface rules
16
Selection method
Deterministic, derived from task family, message content, and chart type
Context and controls
Context passing, usage limits, and request controls are wired into the AI system from the start.
Each surface gets the right amount of context scoped to its job. Chart insight gets a structured chart payload. Copilot gets route-aware tool results scoped to game, team, umpire, or global context. Visualizer works from the current scoped context rather than raw database access.
The request path is also wired into broader application controls: CSRF verification, user authentication, usage limits, request queueing, misuse detection, and rate-limit enforcement. Safety events and rate-limit events are recorded when they fire. That infrastructure shapes how the system behaves under real use, not just how a prompt reads in isolation.
Observability
I record more than the answer because the answer alone is not enough to debug or improve.
When a generation is stored, the telemetry record includes surface, task family, prompt version, semantic tags, terminology bundle details (style pack, card slugs, appendix character count), estimated prompt size, and structured output where relevant. That gives me a much clearer picture of system behavior than a message log alone.
The AI path also includes a health-check script that validates terminology seed completeness and runs a test and eval suite covering prompt behavior, telemetry correctness, terminology handling, chart insight output structure, and visualizer output structure. Prompt and response behavior can drift quietly while the product still looks like it is working. The surface starts answering the wrong kind of question, or the response shape wanders away from the contract. The tests exist to catch those shifts before they become decisions made on bad output.
A narrower surface is a more testable surface. Once the input and output contracts are well defined, checking for good behavior is specific rather than approximate.
I built the AI layer to behave like reviewed software, not like a black box that happens to return text.
