Volume 9 — Observability, Evaluation, and Verification

How to know whether the system is actually doing what it claims to do.

Reports

AI-ENG-Z — Strategic Telemetry: Traces, Tokens, Latency & Semantic Drift

Covers LLM observability as a first-class engineering discipline. Includes traces, spans, tokens, latency, errors, retries, tool calls, retrieval events, routing decisions, cost attribution, semantic drift, and the Green Dashboard Fallacy: infrastructure health without answer quality.

AI-ENG-AA — Evals Architecture: Ground Truth, Golden Sets & Regression Tests

Covers task evals, retrieval evals, grounding evals, citation evals, tool-use evals, adversarial evals, golden sets, synthetic test generation, human review, LLM-as-judge limits, inter-rater reliability, and regression gates before deployment.

AI-ENG-AB — Verification Artifacts: Auditability, Reproducibility & Evidence Trails

Covers preserving the evidence needed to understand why a system answered or acted as it did. Includes prompt versions, model versions, retrieved chunks, tool calls, intermediate states, user edits, policy checks, output diffs, and replayable execution traces.

← Back to Canon Map