Data Engineering
The Data Layer
How I ingest baseball data, how I separate warehouse and serving responsibilities, and why the application reads from structured serving state rather than raw feed output.
ByColby Reichenbach
I built the data layer to turn public baseball feeds into stable product truth that the application can serve.
AiBS does not depend on raw baseball feeds arriving in exactly the shape the product needs. That assumption usually leads to one of two bad outcomes: the app becomes tightly coupled to source payloads that were never designed for a product, or every route starts doing too much transformation work at request time.
The data layer is built around a split instead. A warehouse path handles heavier ingest, historical working sets, and model development. A serving path holds the compact, structured state the product needs to render live pages, summaries, reports, and analytics. The database and ETL design reflect the full scope of what AiBS is: not just a stats viewer, but a system carrying baseball state, product state, AI state, editorial state, community state, and operational records all at once.
Pipeline architecture
Data flows through a defined sequence from public feeds to serving state.
The ingest pipeline pulls from MLB live game feeds and related historical baseball sources, including warehouse-side Statcast and Savant material. A Python ETL layer handles schedule data, live game feeds, play-by-play events, pitch-level detail, Statcast pitch history, and Savant ABS gamefeed data. The data is normalized, keyed to internal identifiers, and written to the database. The warehouse path holds raw snapshots and historical working sets. The serving path holds compact, product-shaped state that the Next.js application reads directly.
The ETL layer lives in Python rather than inside the application server. Ingest work and serving work have different lifecycles, different failure modes, and different schedules. Keeping them separate means a slow backfill job does not touch the application's request path, and a frontend deployment does not require the ETL to redeploy alongside it.
Warehouse and serving
The two sides of the data layer solve different problems.
The warehouse side handles heavier ingest, historical backfills, model development, and audit work. Raw baseball inputs, historical pitch-level data, and model-oriented tables live there without forcing the public-facing app to carry that weight in its normal serving path.
The serving side handles page-facing baseball state, product metadata, editorial records, community data, AI records, and operational bookkeeping. This is the data the routes actually read to render pages, along with the broader application data that makes AiBS more than a baseball feed viewer.
The product-facing system stays tighter because of that separation. The app does not need the full historical working set on every deployed path. It needs structured outputs, summaries, and compact serving tables that support the pages and workflows people actually use. In the current serving environment, that work is handled through SQL-first helpers and serving tables rather than a dedicated mart schema.
Database shape
The schema is organized by responsibility, not by baseball alone.
The baseball core includes tables for teams, games, officials, at-bats, play events, pitches, ABS challenges, and game-state snapshots. Those carry the product itself.
The database also serves the rest of the application. Product and identity tables handle users, profiles, roles, and organizations. Editorial tables track articles, revisions, generation runs, and contributors. AI tables store conversations, messages, tool calls, safety events, cost events, usage ledger entries, and generation records. Community tables cover threads, comments, reports, and moderation. Operational tables handle audit logs, rate limits, webhook deliveries, and job runs. Raw archival tables hold Statcast and Savant source data. The modeling namespace houses the called-pitch decision layer.
This structure matters because different parts of the application need different guarantees. Baseball data needs relational integrity. User state and AI usage need stable transactional behavior. Editorial history needs to be auditable. The namespace boundaries exist to keep those concerns clear rather than mixed together.
Total tables
61
Schema namespaces
8 active schemas: public, ops, product, editorial, community, ai, raw, modeling
Mart schema views
0 in the current serving environment
Live ingest
The scheduler is simple. The work gate is smarter.
The live polling path runs on a fixed ten-minute heartbeat. A simple cadence is easy to understand and reason about operationally. The scheduler wakes up on schedule, and the gate decides whether real work needs to happen.
The gate is Eastern Time aware. If no relevant games exist in the current ET window, the poller exits quickly. If games are scheduled but none are live, it exits again. If the system has been stale for more than eight hours, it runs a bounded catch-up across scheduled ET dates going back up to seven days. Each wake-up does the right amount of work, not a fixed amount.
Serving mode and archive mode are also separated. In serving mode, the hosted database stays focused on structured page-facing state. In archive mode, heavier raw material is preserved outside the normal serving footprint. Snapshot pruning keeps the serving environment from quietly accumulating data that belongs in an archive.
Live serving
Scoreboard data comes from structured linescore state, not open-ended snapshot retention.
The live scoreboard is one of the clearest examples of the serving design. The application reads from structured linescore state written into ops.game_linescores during ingest. That gives the product a compact, serving-friendly scoreboard source for preview, live, and final games.
The hosted serving environment does not need to behave like a long-term raw archive just to draw a scoreboard. It needs compact state that is easy to read, easy to refresh, and easy to reason about when something goes wrong. Recent operational snapshots can remain where they help the product, but serving should not be confused with archiving.
Application loading
The application is SQL-first and server-side on purpose.
Server components and route handlers read from shared helpers in src/lib. Analytics pages are built from page-model helpers rather than doing route-local math everywhere. Query logic and data-shaping logic live close to the server-side model of the product rather than scattered across individual routes.
Centralizing retrieval and shaping also helps with consistency. When the same summary or model output is needed in multiple places, there is one path to maintain rather than several slightly different versions. The browser is not treated as a trusted data or authorization layer. Product logic, data loading, and permission boundaries all hold on the server side.
Why this matters for modeling
The model layer gets stronger when the data layer makes state transitions explicit.
The model work in AiBS depends on more than pitch location. It depends on count state, base state, inning, score, team context, and the tracked review path that follows each challenged or unchanged call. None of that works unless the data layer preserves those transitions cleanly enough for both serving and audit purposes.
The warehouse side carries the strongest model layers. Count-state baselines, run expectancy, win expectancy, and challenge evaluation all sit on top of split-aware governance and held-out audits that trace back directly to the data layer. If the ingest is sloppy or the serving shapes are unclear, the modeling work becomes much harder to trust. A clean data layer is what lets the product carry stronger analytical claims without hand-waving.
I built the data layer so the application reads from product truth, not from whatever raw feed payloads happen to exist.
