Overview

Build systems in large game studios generate thousands of issues per changelist (CL) - failed shaders, missing audio cues, broken asset loads. Tracking which issues are new, which are resolved, and rolling up counts through a deep folder hierarchy (server → project → team → asset) at query time is expensive at scale.

This project models the problem properly. A synthetic data generator produces realistic build issue records across two projects and over 100 builds. A five-step Python pipeline ingests that data into a normalised PostgreSQL schema. A Streamlit dashboard makes the results explorable - treemaps, Gantt-style presence intervals, and team heatmaps.

Image description

No proprietary data is used. The full stack runs locally with a single docker compose up.

Stack: PostgreSQL 16, Python 3.10+, pandas, SQLAlchemy, Streamlit, Plotly, Docker


Architecture

Data flows through five discrete stages:

seed/generate_data.py
        │
        ▼
  issues_raw  (staging table)
        │
        ├── 00_populate_path_nodes.py  ──▶  path_node, path_closure
        ├── 01_ingest_changelists.py   ──▶  changelist
        ├── 02_ingest_issue_instances.py ─▶ issue_instance
        ├── 03_ingest_observations.py  ──▶  snapshot, issue_observation
        └── 04_build_presence_intervals.py ▶ issue_presence_interval
                                                    │
                                             03_metrics.sql ──▶ changelist_metrics
                                             04_views.sql   ──▶ v_treemap_cl_levels8_latest

The generator writes directly to issues_raw, so the pipeline runs unchanged against synthetic data. Each pipeline step is idempotent — re-running it produces identical state, not duplicates.

The schema has nine tables:

Table Purpose
project One row per project; all other tables FK here
path_node One row per unique path segment (server, team folder, asset)
path_closure Every ancestor–descendant pair with distance — enables O(1) hierarchy rollups
changelist One row per CL per project
snapshot One ingestion run per CL; decouples observation time from CL time
issue_instance Deduplicated issue identity, keyed by SHA-256 content hash
issue_observation Fact table: one row per (snapshot, issue, path location)
issue_presence_interval Derived: contiguous runs of CL presence per issue; NULL end = still open
changelist_metrics Pre-aggregated total/new/resolved counts per CL

Key Decisions

Closure table for the path hierarchy. The naive approach — recursive CTEs at query time — works for small trees but degrades badly at depth 8+ with millions of rows. A closure table pre-computes every ancestor–descendant pair once at ingest time. Hierarchy rollups become a single equi-join with no recursion at read time. The trade-off is write amplification: inserting one new path node writes O(depth) rows into path_closure. That is acceptable here because path nodes are written once and read constantly.

Snapshot model instead of current-state overwrite. Storing only the latest state of each issue is simpler, but it makes two things impossible: point-in-time queries ("what was open at CL 10045?") and safe re-ingestion (re-running the pipeline would corrupt history). The snapshot table separates when we observed data from what we observed. The same CL can be re-ingested without side effects, and any historical CL can be reconstructed exactly.

Pre-computed presence intervals. Determining whether an issue is new or recurring requires knowing whether it was present in the previous CL. Doing that at query time means scanning all prior observations for every issue — expensive with 80+ CLs and 650+ issues. issue_presence_interval pre-computes contiguous runs of CL presence. A gap of one or more CLs closes the current interval and opens a new one when the issue reappears. New/resolved state becomes a simple range filter, and recurring issues are a single HAVING COUNT(*) > 1.

Content hash identity for issues. Rather than assigning surrogate keys at ingest time and managing deduplication with a lookup round-trip, each issue is identified by a SHA-256 hash of its normalised text and asset path. The same issue reported in two different CLs resolves to the same issue_instance row. Upsert handles the rest. The trade-off: the normalisation function is load-bearing — changing it invalidates all existing hashes.