Volume 9 — Observability, Evaluation, and Verification
How to know whether the system is actually doing what it claims to do.
Reports
AI-ENG-Z — Strategic Telemetry: Traces, Tokens, Latency & Semantic Drift
Covers LLM observability as a first-class engineering discipline. Includes traces, spans, tokens, latency, errors, retries, tool calls, retrieval events, routing decisions, cost attribution, semantic drift, and the Green Dashboard Fallacy: infrastructure health without answer quality.
AI-ENG-AA — Evals Architecture: Ground Truth, Golden Sets & Regression Tests
Covers task evals, retrieval evals, grounding evals, citation evals, tool-use evals, adversarial evals, golden sets, synthetic test generation, human review, LLM-as-judge limits, inter-rater reliability, and regression gates before deployment.
AI-ENG-AB — Verification Artifacts: Auditability, Reproducibility & Evidence Trails
Covers preserving the evidence needed to understand why a system answered or acted as it did. Includes prompt versions, model versions, retrieved chunks, tool calls, intermediate states, user edits, policy checks, output diffs, and replayable execution traces.