AI-ENG-AB — Verification Artifacts - Auditability, Reproducibility & Evidence Trails

Doctrinal Foundations: The Evidence Architecture Paradigm

In high-dimensional artificial intelligence systems, production reliability cannot be governed or monitored by traditional application performance monitoring (APM) paradigms.1 While legacy web architectures treat system health as a binary or scalar infrastructure state defined by hardware allocation, network latency, and HTTP status codes, modern multi-agent execution loops and probabilistic large language models (LLMs) break these assumptions.1 A system can return an infrastructure-level successful HTTP 200 payload that contains a complete structural schema failure, an unauthorized tool execution, a cross-tenant data leak, or an ungrounded factual confabulation.1 Conversely, optimal hardware utilization and high throughput can coexist with red behavioral and meaning-level system states.2 This discrepancy establishes the Green Dashboard Fallacy: the diagnostic error of assuming an AI system is operating correctly simply because its underlying servers and APIs are online and available.1

To resolve this visibility deficit, modern systems architectures isolate monitoring concerns across distinct operational layers.1 While strategic telemetry (AI-ENG-Z) functions as the sensor network capturing execution spans 1, and evaluation frameworks (AI-ENG-AA) act as the regression testing pipeline 1, this document defines the architecture of verification artifacts (AI-ENG-AB).1 Verification artifacts are the durable, privacy-governed, and tamper-evident evidence objects required to reconstruct, audit, and reproduce the behavior of probabilistic systems after the fact.3

This architecture is governed by the core doctrine: every consequential AI output or action must leave behind a replayable evidence trail.1 This trail must link the exact prompt template, model weights snapshot, retrieval context, tool payload, policy execution state, human modification, and final output in an unbroken, cryptographically signed chain of custody.1 The fundamental engineering question shifts from “did we record a log entry?” to “can the organization reconstruct the exact decision path from user input to final output or action, including the evidence, versions, state transitions, checks, edits, and approvals that shaped it?”.1 Without this capability, compliance becomes guesswork, debugging degrades into digital archaeology, and the system cannot be safely governed.1

Logs-Traces-Artifacts-Evidence Taxonomy

To establish a clear hierarchy, the following taxonomy contrasts these diagnostic layers across operational, technical, and evidentiary boundaries, proving why generic logs are insufficient for AI auditability:

Layer Technical Scope & Composition Primary Operational Purpose System Proof Capability Architectural Limitations
Event Log Flat structured or unstructured events emitted by applications, services, or workers. Debugging, operational status, exception capture, and coarse audit hints. Shows that a code path emitted an event at a time. Does not prove full context, prompt state, evidence, policy checks, user edits, or side effects.
Metric Aggregated numeric counters, gauges, histograms, and time-series measurements. Capacity planning, alerting, scaling, and SLO monitoring. Shows rates, volumes, durations, and resource pressure. Blind to semantic correctness, grounding, authorization, and user-visible truth.
Execution Trace Parent-child span graph linking distributed work across model calls, retrieval, tools, queues, and services. Latency analysis, causal path reconstruction, and system debugging. Shows the causal execution path and timing relationships. Often redacted, sampled, or incomplete for bulky or sensitive payloads.
Span Atomic observable unit of work such as inference, retrieval, rerank, parser, tool call, guardrail, or review step. Local timing, status, error, and attribute capture. Shows what a specific execution step attempted and how it ended. Does not by itself prove global policy state, source lineage, or final user-facing transformation.
Verification Artifact Schema-validated evidence packet containing hashes, references, versions, decisions, state checks, and selected payload evidence. Auditing a specific decision boundary or execution step. Proves integrity of recorded inputs, outputs, parameters, and decisions to the degree captured. Cannot prove uncaptured state; raw payload capture must be governed by privacy policy.
Manifest Declarative version record for prompts, models, tools, dependencies, policies, routes, and environments. Pinning expected system state and baseline capability configuration. Shows what configuration was intended or eligible for execution. Does not prove which dynamic inputs arrived or whether runtime behavior matched intent.
Ledger Append-only record of actions, approvals, state transitions, tool calls, retries, and commit/verification status. Sequencing, accountability, and action-state audit. Shows logical progression and final known state of operations. Does not automatically preserve full prompts, retrieved evidence, or rendered UI output.
Snapshot Point-in-time capture or reference to context, retrieval state, memory, index version, tool state, or UI state. Preserving what information was available at execution time. Shows the evidence/context surface available to the system. Storage-intensive; may require deduplication, redaction, and secure references.
Diff Structural, textual, semantic, or field-level comparison between artifact versions. Tracking edits, redactions, transformations, overrides, and regressions. Shows exactly what changed between two captured states. Does not explain why the change occurred unless linked to trace, policy, or reviewer artifacts.
Evidence Trail Hash-linked sequence of artifacts, manifests, ledgers, snapshots, and diffs with retention and access policy. Forensic reconstruction, compliance proof, replay, and dispute resolution. Shows an integrity-protected chain of recorded evidence across a consequential workflow. Tamper-evident, not omniscient; evidence quality depends on capture coverage and governance.

Legacy logs are fundamentally insufficient for auditing probabilistic architectures.1 Recording that an API returned a response does not show the prompt template that generated it, the vector database coordinates that informed it, the safety policies that validated it, or the user edits that overrode it before execution.1 If an administrator cannot inspect these intermediate states, they cannot defend automated decisions to regulators, resolve disputes, or debug system drift.1

Replay and Reproducibility Typology

GPU Non-Determinism and the Concurrency Hypothesis

Achieving exact, bitwise reproducibility in large language model inference is exceptionally difficult.1 It is a common misconception that setting a fixed random number seed guarantees identical outputs across successive runs.18 In parallel computing environments, GPUs execute matrix multiplications and attention calculations across thousands of concurrent cores.19
Because floating-point addition is non-associative, the order of arithmetic reduction operations alters the least significant bits of the floating-point representations 18:
(a + b) + c is not equal to a + (b + c)
If the execution speed of concurrent GPU threads varies due to minor hardware latency fluctuations, the reduction order changes, introducing run-to-run non-determinism.19
However, technical analysis reveals that this “concurrency and floating-point” hypothesis is not the primary driver of non-determinism in production inference endpoints.18 Individual operations (kernels) inside the model’s forward pass are run-to-run deterministic when executed in isolation.20 The primary cause of output variance is dynamic batching.18 High-performance serving engines (such as vLLM) batch concurrent requests dynamically to maximize hardware utilization.20 Because system load changes constantly, an individual request is batched with varying sets of other user prompts.20
To optimize performance for different batch sizes, the engine dynamically selects different parallelization strategies, such as switching from a standard data-parallel matrix multiplication to a Split-K matmul, or modifying reduction trees in Split-Reduction RMSNorm.19 This batch-dependent kernel selection alters the reduction arithmetic, producing different outputs for the identical prompt.20

To enforce true bit-exact determinism, the model must be executed on an inference stack utilizing batch-invariant kernels.19 This strategy parallelizes operations entirely within individual tensor cores (data-parallel reductions) and avoids Split-K optimizations, preserving the arithmetic order regardless of batch changes.19 This exact reproducibility introduces a measurable performance tradeoff:

In empirical testing over a Qwen3-8B model, executing 1,000 runs under standard vLLM produced 80 unique completions, while forcing batch-invariant operations yielded a single, consistent completion across all runs, accompanied by a 61.5% execution latency penalty.18

The Replay and Reproducibility Typology

Because forcing bit-exact determinism across third-party cloud providers, variable GPU clusters, and dynamic databases is often impractical, systems must design for graded levels of verification 1:

Replay Mode Target Objective Execution Constraints State Preservation Requirement System Verifications
Exact Replay Recreate bitwise identical outputs for debugging.1 Requires identical hardware, batch-invariant kernels, pinned model snapshots, and matching seeds.20 Bit-exact saving of weights, inputs, and attention states.20 Recomputes forward passes to verify exact token matches.20
Constrained Replay Recreate comparable completions on standard engines.1 Executes on matching model versions with fixed sampling parameters and temperatures.5 Saves the compiled prompt and exact decoding configurations.2 Compares output tokens to ensure they fall within acceptable semantic ranges.1
Semantic Replay Verify that a rerun satisfies equivalent logical constraints.1 Allows the model, prompt, or database to vary within defined boundaries.1 Saves the evaluation rubric, task schema, and input goals.1 Evaluates whether the new completion satisfies identical Pydantic rules and NLI grounding.1
Counterfactual Replay Compare how a change to a single variable alters behavior.1 Modifies one parameter (e.g., prompt version, retrieved chunk) while keeping others constant.1 Saves the base run as a template and isolates the variable under test.1 Maps behavior changes to identify regressions or evaluate improvements.1
Forensic Replay Reconstruct the cause of a failure when re-execution is impossible.1 Does not execute the active model; relies entirely on recorded logs and states.1 Saves the complete execution trace, tool arguments, and environment hashes.3 Walks the recorded sequence to pinpoint the exact failure location.3
Audit-Only Prove compliance with policies without exposing raw data.22 Utilizes cryptographic hashes and signatures to verify states.1 Saves the metadata headers, content hashes, and digital signatures.4 Validates the signatures against root authorities to prove the run occurred as logged.15

Determinism Decay and Environment Drift

A primary engineering challenge is that reproducibility degrades over time due to external dependencies.1 Even with fixed seeds and invariant kernels, the surrounding execution environment shifts silently, causing “determinism decay”.1
The primary operational drivers of this decay are categorized below:

To address this, platforms must avoid false promises of bit-exact determinism, defining instead strict “reproducibility envelopes” that capture the complete state of the execution context.1

Verification Artifact Bundle Schema

The central evidence container of this architecture is the Verification Artifact Bundle. It is a structured, schema-validated evidence object that records the minimum sufficient information required to audit, replay, or explain a consequential AI output or action.

The bundle should support JSON or CBOR serialization depending on platform tooling. It may be wrapped in an in-toto / DSSE-style envelope or another signed attestation format when the artifact class requires cryptographic proof. The important architectural rule is not the file format fetish—tempting though that altar is—but the preservation of signed, versioned, replayable evidence.

Block Name Example Field Paths Capture Rule Purpose
Identity & Scope identity.request_id, identity.trace_id, identity.session_hash, identity.tenant_hash, identity.subject_hash, identity.workflow_id, identity.risk_class Store opaque IDs or scoped hashes by default; raw identifiers only in restricted audit stores. Correlates the bundle with traces, sessions, tenants, subjects, workflows, and risk policy.
Prompt Lineage prompt.template_id, prompt.template_version, prompt.compiler_hash, prompt.compiled_hash, prompt.payload_ref, prompt.token_count Store hash and secure reference for compiled prompt; raw prompt only under access-controlled registry. Proves which prompt template and compiled prompt shaped execution.
Model / Route Manifest model.requested_route, model.selected_route, model.provider, model.model_version_ref, model.decoding_settings_hash, model.routing_reason Store route/model manifest and provider-visible version; note when hosted provider cannot provide weight hash. Documents model selection, fallback, decoding, and route changes.
Retrieval Snapshot retrieval.index_id, retrieval.index_version, retrieval.query_hash, retrieval.rewrite_hashes, retrieval.chunk_ids, retrieval.source_version_hashes, retrieval.evidence_refs Store document IDs, chunk IDs, source hashes, scores, coordinates, and secure refs; avoid raw text in general bundle. Proves what evidence was searched, selected, omitted, cited, or unavailable.
Tool Execution tool.name, tool.schema_version, tool.action_class, tool.payload_hash, tool.idempotency_key_hash, tool.pre_state_hash, tool.post_state_hash, tool.verification_status Store payload hashes and redacted summaries; sensitive request/response bodies by secure reference. Proves tool intent, arguments, authorization, side-effect status, and verification outcome.
State Ledger state.plan_id, state.step_id, state.transition_log, state.loop_count, state.no_progress_count, state.fallback_state, state.termination_reason Store state transitions and hashes; never require hidden chain-of-thought capture. Reconstructs the workflow trajectory without exposing private reasoning.
User / Reviewer Edit edit.draft_hash, edit.final_hash, edit.diff_ref, edit.editor_hash, edit.edit_class, edit.approval_state Store diffs, hashes, and secure refs; redact sensitive text as required. Shows how user or reviewer modifications changed the artifact.
Output Lineage output.raw_hash, output.post_processed_hash, output.policy_filtered_hash, output.redacted_hash, output.rendered_hash, output.final_hash Store transformation hashes and secure refs; raw outputs only under policy. Tracks every transformation between model output and final displayed/submitted output.
Policy Enforcement policy.bundle_version, policy.rule_ids, policy.decisions, policy.enforcement_points, policy.override_state, policy.reviewer_hash Store deterministic rule results separately from classifier scores. Proves which policy checks ran, where, and with what outcome.
Human Approval approval.request_id, approval.status, approval.maker_hash, approval.checker_hash, approval.payload_hash, approval.expires_at, approval.signature_ref Store signed approval metadata and payload hash. Proves approval state, separation of duties, and freshness of authorization.
Audit & Integrity audit.bundle_hash, audit.parent_hash, audit.signature_ref, audit.created_at, audit.retention_class, audit.legal_hold, audit.redaction_class Sign bundle hash or envelope; store chain metadata and retention policy. Makes the evidence record tamper-evident and governed.

Risk-Class Tiering Policy

The bundle should adapt to risk class:

Risk Class Capture Profile
Low Metadata, hashes, route IDs, status, and short-retention trace links.
Medium Metadata plus selected redacted snippets, secure refs, policy decisions, and tool hashes.
High Complete-enough evidence trail with secure references, signatures, approval records, state verification, and stricter retention.
Critical / Regulated High-risk profile plus legal hold support, stronger signing, replay package linkage, reviewer identity controls, and access-audited payload vaults.

High risk does not mean “store everything raw forever.” It means preserve enough evidence to audit and replay the decision while minimizing raw sensitive payloads, enforcing access controls, and recording when data was intentionally not captured.

Reproducibility Envelope Model

Every consequential AI execution should run inside a Reproducibility Envelope. The envelope records the system state needed to recompile, inspect, simulate, or replay the run. The reproducibility unit is not “the model” alone; it is the model route, prompt compiler, retrieval surface, policy bundle, tool contracts, feature flags, runtime configuration, and evidence state.

REPRODUCIBILITY ENVELOPE

+----------------------------------------------------------+
| Prompt Assembly                                          
| template id/version | compiler hash | context policy     
+----------------------------------------------------------+
| Inference Route                                          
| provider route | model ref | tokenizer | decoding config 
+----------------------------------------------------------+
| Retrieval Surface                                        
| corpus version | index hash | embedding model | filters  
+----------------------------------------------------------+
| Tool and Action Surface                                  
| tool schemas | mock mode | credential policy | idempotency
+----------------------------------------------------------+
| Policy and Governance                                    
| policy bundle | approval flow | safety filters | retention
+----------------------------------------------------------+
| Runtime Environment                                      
| container image | dependency lock | feature flags | region 
+----------------------------------------------------------+
| Replay Constraints                                       
| exact | constrained | semantic | counterfactual | forensic
+----------------------------------------------------------+
Parameter Category Example Field Paths Validation Source Registry
Prompt Assembly env.prompt.template_id, env.prompt.template_version, env.prompt.compiler_hash, env.prompt.context_assembler_hash, env.prompt.memory_policy_version Prompt registry, compiler release manifest, context policy registry.
Inference Runtime env.model.requested_route, env.model.selected_route, env.model.provider_api_version, env.model.snapshot_ref, env.model.tokenizer_ref, env.model.decoding_settings_hash Model registry, provider manifest, gateway route manifest.
Serving Environment env.serving.container_digest, env.serving.runtime_version, env.serving.cuda_or_accelerator_version, env.serving.batch_policy, env.serving.determinism_mode Container registry, deployment manifest, serving-engine config.
Retrieval Engine env.retrieval.index_id, env.retrieval.index_version, env.retrieval.corpus_version, env.retrieval.embedding_model_ref, env.retrieval.chunker_version, env.retrieval.reranker_ref, env.retrieval.permission_policy_version Vector/index registry, corpus manifest, ingestion pipeline records.
Tool Integration env.tool.manifest_hash, env.tool.schema_versions, env.tool.server_digest, env.tool.mock_mode, env.tool.side_effect_mode, env.tool.credential_policy_version Tool registry, container registry, credential broker policy.
Policy and Safety env.policy.bundle_hash, env.policy.safety_filter_version, env.policy.redaction_policy_version, env.policy.approval_flow_version, env.policy.retention_class Policy registry, governance manifest, approval workflow registry.
System and UI env.ui.version, env.ui.renderer_hash, env.feature_flags.hash, env.deployment.hash, env.region, env.cache_state, env.degraded_mode_state Web client registry, feature-flag snapshot, deployment manifest.
Replay Controls env.replay.mode, env.replay.egress_policy, env.replay.payload_capture_class, env.replay.tool_response_source, env.replay.allowed_counterfactuals Replay policy registry and audit configuration.

This envelope does not guarantee bit-exact reproduction in every environment. It defines what must be pinned, captured, or declared unknown so auditors can distinguish exact replay, constrained replay, semantic replay, counterfactual replay, forensic replay, and audit-only verification.30

Specialized Evidence Records

The Prompt Lineage Record

To determine whether behavior changed because of template edits, context assembly, memory inclusion, or dynamic variable injection, the system preserves a Prompt Lineage Record.

Field Name Physical Data Type Capture Rule Operational Purpose
prompt.template_id String Metadata. Identifies the registered template.
prompt.template_version SemVer / release ID Metadata. Matches the active template registry version.
prompt.template_hash SHA-256 hash Hash. Verifies template integrity without exposing text.
prompt.template_ref Secure reference Restricted. Points to raw template in controlled registry if access is allowed.
prompt.compiler_version Version / commit SHA Metadata. Identifies prompt compiler behavior.
prompt.context_assembler_version Version / commit SHA Metadata. Identifies context assembly logic.
prompt.variable_manifest JSON object / hash Redacted or secure reference. Records which variables were injected and their capture classes.
prompt.memory_refs Array of hashes/refs Hash/reference. Identifies memory items included in context.
prompt.evidence_refs Array of chunk/source refs Hash/reference. Identifies retrieval chunks inserted into context.
prompt.tool_schema_refs Array of schema hashes Metadata/hash. Records tool definitions available to the model.
prompt.policy_refs Array of policy hashes Metadata/hash. Records dynamic safety/governance instructions.
prompt.message_boundaries Structured offsets / roles Metadata. Maps system, developer, user, tool, and context scopes.
prompt.compiled_hash SHA-256 hash Hash. Verifies exact compiled prompt.
prompt.compiled_ref Secure reference Restricted. Allows controlled replay of compiled prompt when permitted.
prompt.token_count Integer Metadata. Records input size for cost and replay.
prompt.version_diff_ref Secure diff reference Restricted. Links to template diff without dumping sensitive prompt text into logs.

Saving only raw templates is insufficient because compiled prompts include dynamic variables, memories, retrieved evidence, tool schemas, policies, and message boundaries. The lineage record proves what was assembled without requiring every observer to see the full prompt text.

The Model and Route Manifest

Model route decisions are critical audit artifacts. A silent downgrade, fallback, cache hit, provider change, or safety-route change can alter quality, cost, latency, and policy behavior.

Field Name Physical Data Type Capture Rule Description / Key Purpose
route.requested_route String Metadata. Captures requested capability profile.
route.selected_route String Metadata. Captures actual executed route.
route.provider_name String Metadata. Records model host or serving class.
route.provider_api_version String Metadata if available. Captures provider/API version surface.
route.model_id String Metadata. Records requested model identifier.
route.model_snapshot_ref String / secure ref Metadata/reference. Records model version or provider-supplied snapshot when available.
route.tokenizer_ref String / hash Metadata/hash. Captures tokenizer version for replay.
route.routing_reason Enum string Metadata. Examples: nominal, failover, quota, cost_policy, quality_floor, safety_policy, cache_hit.
route.fallback_path_id String Metadata. Identifies fallback chain step if used.
route.cached_response Boolean Metadata. Flags whether semantic, prefix, or response cache was used.
route.router_policy_hash SHA-256 hash Hash. Verifies active routing policy.
route.quality_floor_id String Metadata. Links route decision to required task quality floor.
route.safety_floor_id String Metadata. Links route decision to safety/policy floor.
route.cost_estimate Number / object Metadata. Records preflight cost estimate and confidence.
route.actual_cost_ref Reference / number Metadata/reference. Links to reconciled cost ledger.
route.latency_profile Object Metadata. Records expected/observed latency class.
route.context_headroom_tokens Integer Metadata. Verifies route could fit required context.
route.tool_capability_required Boolean Metadata. Flags if task needed tool-use support.
route.modality Array of strings Metadata. Records modalities processed, such as text, image, audio, or video.
route.tenant_tier_hash Hash/string Metadata/hash. Supports priority/cost analysis without exposing tenant identity.
route.budget_state_ref Secure reference Restricted. Links to budget ledger.
route.decoding_settings_hash SHA-256 hash Hash. Verifies temperature, top-p, max-token, and related decoding configuration.
route.decoding_settings_ref Secure reference Restricted. Allows controlled replay of decoding parameters.
route.stop_reason String Metadata. Records completion stop condition.
route.user_disclosure_required Boolean Metadata. Records whether user should have been told.
route.user_disclosure_shown Boolean Metadata. Records whether disclosure occurred.

The Retrieval Evidence Snapshot

GRC verification requires proving the context available before generation. The retrieval snapshot should show what was searched, what was authorized, what was selected, what was omitted, and how citations map to evidence.

Field Name Physical Data Type Capture Rule Description / Key Purpose
retrieval.raw_query_hash SHA-256 hash Hash. Verifies original query without exposing it.
retrieval.raw_query_ref Secure reference Restricted. Allows controlled access to raw query if needed.
retrieval.rewrite_hashes Array of hashes Hash. Records query reformulation steps.
retrieval.index_id String Metadata. Identifies searched index or partition.
retrieval.index_version String Metadata. Tracks index migrations and rebuilds.
retrieval.corpus_version String/hash Metadata/hash. Tracks source corpus state.
retrieval.embedding_model_ref String/hash Metadata/hash. Records query embedding model.
retrieval.permission_policy_version String/hash Metadata/hash. Records authorization policy used.
retrieval.tenant_scope_hash SHA-256 hash Hash. Confirms tenant scope without raw tenant leakage.
retrieval.permission_filter_ids Array of strings Metadata. Records RBAC/ABAC filters applied.
retrieval.metadata_filter_hash SHA-256 hash Hash. Verifies structural filters.
retrieval.candidate_ids Array Metadata. Logs candidate IDs and scores.
retrieval.selected_chunk_ids Array Metadata. Logs chunks admitted into context.
retrieval.omitted_chunk_ids Array Metadata. Logs relevant chunks pruned or omitted.
retrieval.rerank_score_summary Object Metadata. Records score distribution or secure score references.
retrieval.source_version_hashes Array Hash. Verifies parent document versions.
retrieval.evidence_refs Array of secure refs Restricted. Links to raw text or regions when access is allowed.
retrieval.coordinates Array of objects Metadata. Page, span, line, offset, or bounding boxes.
retrieval.citation_map_hash SHA-256 hash Hash. Verifies claim-to-evidence mapping.
retrieval.citation_map_ref Secure reference Restricted. Allows detailed citation replay.
retrieval.conflict_signal Boolean/object Metadata. Flags source contradiction or unresolved conflict.
retrieval.stale_flag Boolean/object Metadata. Flags source age or freshness concerns.
retrieval.inaccessible_flag Boolean Metadata. Flags deleted or unauthorized source references.
retrieval.claim_links Array of hashes/refs Hash/reference. Maps generated claims to chunk/source evidence.

This snapshot provides bounded verifiability: it can distinguish retrieval failure, stale-source failure, citation failure, and generation failure when capture coverage, source versioning, and permission metadata are complete.

The Tool Execution Record

A generated statement claiming an action succeeded is not evidence of success. Auditing requires verification of tool authorization, payload integrity, idempotency, execution status, and post-action state.

Field Name Physical Data Type Capture Rule Description / Key Purpose
tool.planned_name String Metadata. Tool the model/coordinator intended to invoke.
tool.executed_name String Metadata. Tool actually executed.
tool.schema_version String Metadata. Tracks API/tool contract version.
tool.action_class Enum string Metadata. Examples: read_only, idempotent_mutation, non_idempotent_mutation, high_impact_action.
tool.permission_decision Object/string Metadata. Records authorization outcome.
tool.credential_ref_hash SHA-256 hash Hash. Verifies scoped credential reference without storing token.
tool.argument_hash SHA-256 hash Hash. Verifies exact argument object.
tool.argument_summary Redacted object Redacted. Provides safe diagnostic summary.
tool.argument_ref Secure reference Restricted. Controlled access to full arguments if policy allows.
tool.user_edit_diff_ref Secure reference Restricted. Records human changes to tool arguments.
tool.idempotency_key_hash SHA-256 hash Hash. Supports duplicate-action prevention.
tool.request_payload_hash SHA-256 hash Hash. Verifies outbound request.
tool.request_payload_ref Secure reference Restricted. Controlled access to raw request.
tool.response_payload_hash SHA-256 hash Hash. Verifies inbound response.
tool.response_payload_ref Secure reference Restricted. Controlled access to raw response.
tool.error_status Enum string Metadata. Examples: success, timeout, provider_error, validation_error, unknown.
tool.retry_decision Enum string Metadata. Examples: none, backoff, abort, manual_review, reconcile.
tool.quota_consumed Number/object Metadata. Tracks external quota or cost usage.
tool.side_effect_summary Object Redacted metadata. Records expected and observed side-effect classes.
tool.state_pre_hash SHA-256 hash Hash. Proves source-of-record state before action.
tool.state_post_hash SHA-256 hash Hash. Proves source-of-record state after action, if known.
tool.verification_status Enum string Metadata. Examples: verified, failed, pending, unknown, not_applicable.
tool.compensation_supported Boolean Metadata. Records whether compensation is possible.
tool.compensation_plan_ref Secure reference Restricted. Links to compensation plan without leaking endpoint surface.
tool.user_confirmation_ref Secure reference/hash Restricted. Proves required consent or approval where applicable.

The Intermediate State Ledger

To trace execution trajectories without exposing hidden reasoning or private chain-of-thought, the system preserves an Intermediate State Ledger. It records state transitions, decisions, validation failures, and observable progress—not the model’s private scratchpad.

Field Name Physical Data Type Capture Rule Description / Key Purpose
state.planner_version Version/hash Metadata. Tracks planner/orchestrator implementation.
state.subtask_graph_hash SHA-256 hash Hash. Verifies task graph structure.
state.subtask_graph_ref Secure reference Restricted. Allows controlled inspection of task DAG.
state.transitions Array of objects Metadata. Logs state-machine transitions and events.
state.loop_count Integer Metadata. Tracks total loop steps against workflow profile.
state.loop_budget_profile String Metadata. Links loop limit to risk/workflow class.
state.memory_read_hashes Array of hashes Hash. Records memory items read.
state.memory_write_hashes Array of hashes Hash. Records memory items written.
state.retry_branches Array of objects Metadata. Logs retry and branch attempts.
state.repair_attempts Array of objects Metadata. Logs output/schema repair attempts.
state.validation_failures Array of objects Metadata/redacted. Logs parser/schema/business-rule failures.
state.no_progress_count Integer Metadata. Tracks repeated states without material progress.
state.no_progress_policy String Metadata. Links action to halt, replan, or escalation policy.
state.tool_result_hashes Array of hashes Hash. Correlates tool results with steps.
state.policy_blocks Array of rule IDs Metadata. Logs triggered safety/compliance blocks.
state.fallback_state Enum string Metadata. Examples: nominal, degraded_model, degraded_retrieval, cached, partial, fail_closed.
state.termination_reason Enum string Metadata. Examples: success, max_steps, budget_exhausted, user_cancelled, policy_block, unknown_state.

The User Edit and Output Diff Record

The final answer displayed or submitted is the last artifact in a chain of transformations. The system should record hashes, diffs, edit metadata, and secure references while avoiding uncontrolled raw output storage.

Field Name Physical Data Type Capture Rule Description / Key Purpose
edit.original_draft_hash SHA-256 hash Hash. Verifies model-generated draft.
edit.original_draft_ref Secure reference Restricted. Controlled access to raw draft if retained.
edit.final_text_hash SHA-256 hash Hash. Verifies final edited text.
edit.final_text_ref Secure reference Restricted. Controlled access to final text if retained.
edit.editor_hash SHA-256 hash Hash/metadata. Identifies editor without broad raw identity exposure.
edit.editor_role String Metadata. Captures role or authority class.
edit.edit_class Enum string Metadata. Examples: factual, stylistic, policy, tool_argument, redaction, citation, approval.
edit.timestamp ISO-8601 timestamp Metadata. Records edit time.
edit.target_field String Metadata. Identifies field or region altered.
edit.status Enum string Metadata. Examples: accepted, rejected, overridden, pending_review.
edit.diff_hash SHA-256 hash Hash. Verifies diff artifact.
edit.diff_ref Secure reference Restricted. Controlled access to unified or structured diff.
edit.updated_memory Boolean Metadata. Records if edit updated memory.
edit.entered_evals Boolean Metadata. Records if run was promoted to evals.
edit.changed_tool_args Boolean Metadata. Flags edits to tool parameters.
edit.preserved_in_final Boolean Metadata. Verifies whether edit survived final rendering/submission.

Output lineage should capture transformation hashes for each stage:

  1. output.raw_model_hash
  2. output.post_processed_hash
  3. output.policy_filtered_hash
  4. output.redacted_hash
  5. output.citation_injected_hash
  6. output.rendered_ui_hash
  7. output.human_edited_hash
  8. output.final_submitted_hash

Raw text for any stage should be stored only according to capture class and retention policy.

The Policy Check Record

Policy checks must function as executable verification artifacts, not passive annotations. Deterministic rules, classifiers, human approvals, and override paths should be recorded separately so auditors can tell what kind of control actually fired.

Field Name Physical Data Type Capture Rule Description / Key Purpose
policy.bundle_version Version/hash Metadata/hash. Tracks active policy bundle.
policy.rule_ids Array of strings Metadata. Identifies rules executed.
policy.rule_authority String/hash Metadata/hash. Documents policy source or owner.
policy.input_hash SHA-256 hash Hash. Verifies evaluated input.
policy.input_ref Secure reference Restricted. Controlled access to evaluated content.
policy.decision Enum string Metadata. Examples: allow, block, escalate, redact, degrade, require_approval.
policy.rule_type Enum string Metadata. Examples: deterministic, classifier, human_review, external_policy.
policy.threshold Number/object Metadata. Classifier threshold when applicable.
policy.score Number/object Metadata. Classifier score when applicable.
policy.enforcement_point Enum string Metadata. Examples: ingress, context_assembly, pre_action, tool_gateway, egress, ui_render.
policy.result_status Enum string Metadata. Examples: pass, fail, error, not_applicable.
policy.override_status Enum string Metadata. Examples: none, requested, authorized, denied, expired.
policy.exception_summary String/object Redacted. Logs parser or system exception safely.
policy.reviewer_hash SHA-256 hash Restricted metadata. Identifies reviewer where applicable.
policy.approval_ref Secure reference Restricted. Links to approval trail.
policy.escalation_ref Secure reference Restricted. Links to escalation queue or package.
policy.final_action_state Enum string Metadata. Examples: completed, blocked, pending, unknown, review_required.

Replay Package Model and Execution Safety

To audit, debug, or evaluate a transaction safely, the system compiles a Replay Package. A Replay Package is not merely a folder of logs; it is a controlled reconstruction environment designed to reproduce or simulate behavior without mutating production systems.

A replay package contains:

REPLAY PACKAGE CONTAINER

+----------------------------------------------------------+
| Replay Orchestrator                                      
|  loads trace graph, bundle, envelope, policy, rubric      
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Network and Credential Controls                          
|  no production credentials                                
|  egress blocked or allowlisted                            
|  production endpoints denied                              
|  secrets replaced with inert references                   
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Mocked / Simulated Dependencies                           
|  recorded LLM outputs                                     
|  signed tool responses                                    
|  retrieval snapshot                                       
|  sandbox database                                         
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Replay Result                                             
|  exact replay, constrained replay, semantic comparison,    
|  counterfactual analysis, or forensic timeline             
+----------------------------------------------------------+

These execution bounds are mapped across six replay modes:

Replay Mode Egress Policy Input Constraints Execution Layer Target Verification Side-Effect Prevention
Read-Only Replay Egress blocked except approved artifact stores. Inputs locked to original values. Recompiles prompts, layouts, UI rendering, and local transformations. Verifies rendering, formatting, redaction, and citation display. No tool credentials; no production write paths.
Mocked-Tool Replay Egress blocked. Inputs locked; tool responses loaded from signed mocks. Re-executes orchestration against recorded dependency responses. Evaluates route behavior, state transitions, and output variance. Tool gateway intercepts all calls and returns mocks.
Sandbox Replay Egress restricted to isolated test network. Inputs locked or copied into sandbox fixtures. Runs against test database, test tool servers, and inert credentials. Evaluates integration behavior under controlled conditions. Production endpoints denied; credentials scoped to sandbox only.
Counterfactual Replay Egress blocked or sandbox-only. Allows one declared variable modification at a time. Re-runs selected prompt/model/retrieval/policy variant. Attributes behavioral changes to the modified variable. Tools mocked unless explicit sandbox mutation is approved.
Semantic Replay Egress blocked or sandbox-only. Inputs locked; model may vary within declared route envelope. Evaluates logical constraints, grounding, citations, and policy outcomes. Verifies equivalent behavior rather than exact tokens. No production side effects; unknown states remain marked unknown.
Forensic Replay No execution egress. No active model/tool execution. Timeline analysis of recorded evidence. Identifies likely failure point from trace, ledger, and artifacts. Complete decoupling from execution systems.

Evidence Chain and Tamper-Evidence Infrastructure

To detect silent modification or deletion of verification artifacts, the platform should implement tamper-evident evidence chains. Each Verification Artifact Bundle is canonicalized, hashed, signed when required, and linked to prior artifacts or workflow checkpoints.

The hash-chain relation can be expressed as:

h_i = H(canonical(bundle_i)   h_(i-1))

where H is a collision-resistant hash function, canonical(bundle_i) is a deterministic serialization of the bundle, and h_(i-1) is the previous chain hash. This does not make evidence indestructible or magically true. It makes later alteration detectable when verification is performed.

The platform may use several signing and provenance mechanisms depending on artifact type:

  1. Signed Attestation Envelope: Verification bundles, model manifests, tool manifests, and replay packages can be wrapped in DSSE / in-toto-style signed envelopes.
  2. Sigstore / Fulcio / Rekor: CI/CD artifacts, container images, model bundles, and build outputs can use keyless signing and transparency logging where appropriate.
  3. C2PA Manifests: Images, charts, video, generated media, or datasets may carry embedded or sidecar C2PA manifests. Text-only or binary artifacts may bind to C2PA through sidecar manifests and asset references.
  4. Internal WORM / Append-Only Ledger: Regulated systems may store chain checkpoints in WORM object storage or append-only audit databases.
  5. External Anchoring: High-assurance systems may periodically anchor evidence-chain checkpoints externally to improve tamper detection.
TAMPER-EVIDENT EVIDENCE CHAIN

+----------------------------------------------------------+
| Artifact Created                                         
|  bundle, replay package, manifest, snapshot, or diff     
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Canonicalize and Hash                                    
|  deterministic serialization                             
|  content hash                                            
|  parent hash                                             
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Sign / Attest When Required                              
|  DSSE / in-toto envelope                                  
|  Sigstore or internal signing                             
|  policy-defined signer identity                           
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Store Evidence                                           
|  content-addressed storage                                
|  retention class                                          
|  access policy                                            
|  redaction status                                         
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Verify Chain                                             
|  validate hashes                                          
|  validate signatures                                      
|  validate timestamps                                      
|  detect missing or altered records                        
+----------------------------------------------------------+
Field Name JSON Field Key Path Data Type Validation Standard
Artifact ID chain.artifact_id String UUID or platform artifact ID.
Artifact Type chain.artifact_type Enum string Examples: verification_bundle, replay_package, manifest, snapshot, diff, policy_record.
Content Hash chain.content_hash String SHA-256 or approved hash algorithm.
Parent Hash chain.parent_hash String/null Previous chain hash or null for chain root.
Canonicalization Version chain.canonicalization_version String Defines deterministic serialization rules.
Created By chain.created_by_hash String Workload, service, or user hash.
Created At chain.created_at Timestamp ISO-8601 with timezone.
Signed By chain.signed_by String/reference Signing identity or certificate reference.
Signature Reference chain.signature_ref Secure reference Link to DSSE, Sigstore, or internal signature record.
Transparency Entry chain.transparency_log_ref Optional reference Rekor or internal transparency log entry.
Storage Location chain.storage_ref Secure reference Content-addressed object or archive reference.
Retention Class chain.retention_class Enum string Policy-defined retention profile.
Access Policy chain.access_policy_id String Policy governing read/write access.
Redaction Status chain.redaction_status Enum string Examples: metadata_only, redacted, reference_only, restricted_full_payload.
Deletion Eligibility chain.deletion_eligibility Timestamp/string Date or policy state, subject to legal hold.
Legal Hold chain.legal_hold Boolean True when deletion is suspended.
Verification Status chain.verification_status Enum string Examples: verified, failed, degraded, not_checked.

If a historical record is altered, hash-chain verification fails and the system can trigger investigation, containment, and recovery workflows. That is the sober version. The cartoon version where cryptography tackles the attacker through a window is, regrettably, not in scope.

Artifact Privacy, Redaction, and Retention Model

Verification artifacts contain sensitive operational data: proprietary documents, customer records, payment details, credentials, policy decisions, reviewer comments, tool payloads, and source-of-record state. Evidence trails must therefore follow a data-minimization model. Auditability is not permission to build a beautifully indexed breach warehouse.

Redaction should be layered. Regex and NER scans are useful but insufficient alone. The system should combine structured field suppression, secret scanning, policy-aware capture classes, secure payload references, access logging, retention limits, and legal-hold rules.

PRIVACY REDACTION WORKFLOW

+----------------------------------------------------------+
| Artifact Capture Point                                   
|  prompt, retrieval, tool, output, review, policy, state   
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Capture-Class Decision                                   
|  never capture                                             
|  hash only                                                 
|  metadata only                                             
|  redacted snippet                                          
|  reference only                                            
|  restricted full payload                                   
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Redaction and Secret Controls                            
|  structured denylist                                       
|  regex / NER / classifier scan                             
|  source-aware policy                                       
|  credential stripping                                      
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Storage and Access Policy                                
|  content-addressed storage                                 
|  secure payload vault                                      
|  WORM archive when needed                                  
|  access logs and retention                                 
+----------------------------------------------------------+
Capture Class Technical Processing Rule Physical Storage Substrate Access Control Level Typical Data Target
0. Never Capture Strip or block before any persistent write. No raw storage. None for raw value; security event may record detection metadata. Passwords, API tokens, private keys, CVVs, raw auth headers.
1. Hash Only Store digest and capture metadata; discard raw payload. Metadata database or audit ledger. Restricted audit/SRE as policy allows. Large payload identity, file equality checks, prompt/output fingerprints.
2. Metadata Only Store size, schema, type, source ID, status, version, and timing. Trace store or artifact metadata index. Standard operational access if non-sensitive. Parser status, route metadata, model IDs, schema names.
3. Redacted Snippet Store minimal masked excerpt sufficient for debugging. Secure artifact store with retention limit. Purpose-bound operator/auditor access. Conversation excerpts, validation errors, reviewer notes.
4. Reference Only Store raw payload in secure vault; artifact stores only hash and reference. Encrypted object store or payload vault. Role-gated and access-logged. Customer PDFs, full prompts, tool request/response bodies, raw completions.
5. Restricted Full Payload Retain full payload only when legally or operationally required. Segregated encrypted vault or WORM storage. Strict audit/security/legal roles; all access logged. Regulated evidence, high-impact transaction records, legal holds.
6. Incident Mode Temporarily escalate capture under declared incident policy. Forensic vault with short TTL or legal hold. Incident/security team with approval trail. Active attack traces, compromised tool sessions, outage forensics.

Metadata can itself be sensitive. Tenant hashes, source IDs, route choices, policy decisions, and timing records can reveal business activity. Access rules should therefore apply to both payloads and metadata.

GRC Audit Query Playbook

To support audits, investigations, dispute resolution, and regulatory inspections, the verification framework should map common audit questions to the artifacts and replay modes required to answer them.

Audit Question Required Verification Artifacts Target Trace Fields / Metadata Optimal Replay Mode Technical Validation Path
Why did the model cite this source? Retrieval Snapshot, Output Lineage, Model/Route Manifest. retrieval.selected_chunk_ids, retrieval.citation_map_ref, output.citation_injected_hash. Forensic Replay. Verify claim-to-source mapping, source version, and evidence support.
Was the source document actually retrieved? Retrieval Snapshot. retrieval.index_id, retrieval.selected_chunk_ids, retrieval.evidence_refs. Audit-Only. Match chunk IDs and source hashes against retrieval trace and access log.
Was the source version current? Retrieval Snapshot, Reproducibility Envelope. retrieval.source_version_hashes, retrieval.stale_flag, env.retrieval.corpus_version. Audit-Only. Compare source hashes and freshness metadata against source registry.
Which prompt version produced this answer? Prompt Lineage Record, Reproducibility Envelope. prompt.template_id, prompt.template_version, prompt.compiled_hash. Constrained Replay. Recompile template under recorded envelope and compare compiled prompt hash.
Did routing silently downgrade the model? Model and Route Manifest. route.requested_route, route.selected_route, route.routing_reason, route.user_disclosure_shown. Audit-Only. Compare requested vs selected route and disclosure requirement.
Which tool call failed, and did it succeed later? Tool Execution Record, State Ledger. tool.error_status, tool.verification_status, state.transitions. Forensic Replay. Walk action ledger and post-action verification state.
Did the tool actually execute, or did the model hallucinate success? Tool Execution Record, Output Lineage. tool.state_post_hash, tool.verification_status, output.final_hash. Forensic Replay. Verify source-of-record state and compare completion claim to verified state.
Was confirmation shown before transaction execution? Tool Execution Record, Policy Check Record, Human Approval Record. tool.user_confirmation_ref, policy.decision, approval.status. Forensic Replay. Confirm approval/consent existed before high-impact action boundary.
Did the user or reviewer edit the value? User Edit and Output Diff Record. edit.diff_ref, edit.editor_hash, edit.edit_class, edit.preserved_in_final. Counterfactual Replay. Compare draft hash, diff hash, and final output hash.
Did the model overwrite a correction? User Edit Record, Intermediate State Ledger, Output Lineage. edit.preserved_in_final, state.transitions, output.final_submitted_hash. Counterfactual Replay. Compare subsequent outputs and state transitions against edited artifact.
Which policy rule fired? Policy Check Record. policy.rule_ids, policy.bundle_version, policy.decision, policy.enforcement_point. Audit-Only. Verify rule execution parameters and policy bundle version.
Who approved the override? Policy Check Record, Human Approval Record, Evidence Chain. policy.override_status, policy.reviewer_hash, approval.signature_ref. Audit-Only. Verify approval signature or audit record against trusted identity registry.
Did the final UI hide a warning? Output Lineage, Policy Check Record, UI/Render Manifest. output.rendered_hash, policy.rule_ids, route.user_disclosure_shown. Forensic Replay. Re-render under recorded UI version and verify warning/disclosure presence.
Would a different model have behaved differently? Verification Bundle, Reproducibility Envelope, Evaluation Rubric. route.selected_route, env.model.snapshot_ref, prompt.compiled_hash. Counterfactual Replay. Change one declared model/route variable and evaluate behavior under same rubric.
Was sensitive data over-captured? Privacy Capture Record, Evidence Chain, Access Logs. chain.redaction_status, capture_class, access_policy_id. Audit-Only. Verify capture class, retention class, and access events against privacy policy.

Artifact Lifecycle and Storage Architecture

Verification artifacts must be retained according to risk, regulatory obligations, incident status, tenant contract, and operational debugging value. Storage architecture should therefore be policy-based rather than hardcoded to one fixed age ladder. A low-risk FAQ trace, a failed high-impact payment action, a legal hold, and an active security incident should not share the same retention path just because the calendar was feeling democratic.

EVIDENCE STORAGE LIFECYCLE

+----------------------------------------------------------+
| Artifact Created                                         
|  bundle, trace, snapshot, manifest, diff, replay package  
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Classification                                            
|  risk class                                               
|  capture class                                            
|  retention class                                          
|  legal hold                                               
|  tenant / regulatory policy                               
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Hot Operational Store                                     
|  short debugging window                                   
|  fast lookup                                              
|  redacted or reference-first payloads                     
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Warm Audit Store                                          
|  content-addressed objects                                
|  signed manifests                                         
|  scoped metadata indexes                                  
|  access logging                                           
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Cold Compliance / Legal Hold                              
|  WORM or immutable archive where required                 
|  slower retrieval                                         
|  legal-hold controls                                      
|  deletion review                                          
+--------------------------+-------------------------------+
                           |
                           v
+----------------------------------------------------------+
| Expiry / Deletion / Preservation Decision                 
|  purge                                                    
|  retain under legal hold                                  
|  anonymize / aggregate                                    
|  preserve hash-only evidence                              
+----------------------------------------------------------+

The operational profiles, retrieval expectations, and redundancy targets of these storage tiers are specified below:

Storage Tier Retention Window Retrieval Expectation Storage Pattern Encryption and Access Typical Contents
Hot Operational Short operational window defined by retention class. Fast enough for debugging and incident triage. Operational database, trace store, short-TTL object store, or managed equivalent. Tenant-scoped encryption, role-based access, audit logs. Metadata, hashes, redacted snippets, recent traces, pending action state.
Warm Audit Medium retention window defined by audit and product policy. Hours to day-scale retrieval depending on policy. Content-addressed object storage, signed bundle registry, audit index. Strong access control, access logging, scoped evidence views. Verification bundles, manifests, secure payload references, signed hashes, reviewer artifacts.
Cold Compliance Long retention window for regulated, contractual, or legal-hold artifacts. Slower retrieval acceptable. WORM storage, immutable archive, Glacier-like storage, or compliant managed archive. Legal/compliance access controls, key management, deletion controls. Signed evidence chains, legal-hold bundles, high-impact action records, compliance artifacts.
Forensic Vault Incident-scoped retention window or legal-hold period. Fast during incident, policy-controlled afterward. Segregated forensic storage with restricted access and enhanced logging. Security/legal approval, immutable access logs, strict retention review. Attack traces, compromised sessions, break-glass records, raw payloads when justified.
Hash-Only Archive Long-lived integrity reference. Lightweight lookup. Metadata ledger or content-addressed hash registry. Lower exposure than payload stores but still access controlled. Content hashes, chain hashes, artifact IDs, deletion tombstones, proof-of-existence records.

Evidence indexes should be searchable through authorized metadata views across keys such as trace_id, artifact_id, workflow_id, prompt_version, model_route, document_id, tool_call_id, approval_request_id, and final_output_hash. Raw tenant_id, session_id, and user identifiers should be avoided in broad indexes; use scoped hashes or access-controlled identity joins.

Storage policy must also support:

Lifecycle Control Requirement
Legal Hold Suspends deletion and records authority, scope, and reason.
Deletion Eligibility Determines when payloads, metadata, hashes, or references may be purged.
Payload Minimization Allows raw payload deletion while preserving hash-bound proof where legally permitted.
Access Logging Records every access to sensitive payload references and restricted bundles.
Key Management Supports tenant-scoped keys, rotation, revocation, and recovery procedures.
Redaction Reprocessing Allows artifacts to be reclassified or re-redacted when policy changes.
Integrity Verification Periodically verifies hash chains, signatures, manifests, and storage references.

Cross-Canon Handoff Map

The Verification Artifacts architecture integrates across the entire AI Systems Engineering Canon because every major subsystem eventually needs evidence for audit, replay, incident response, evaluation, governance, and dispute resolution.

Target Report ID Target Report Domain Verification Artifact Handoff Dependency / Engineering Integration Rules
AI-ENG-B Context and State Governance Context snapshots, memory refs, state hashes, compaction records. Preserve enough context lineage to prove what was available, included, omitted, or summarized.
AI-ENG-D Corpus Engineering Source provenance, corpus manifests, document version hashes, ingestion records. Evidence artifacts must bind claims to source lineage and corpus version.
AI-ENG-E Retrieval Pipeline Retrieval snapshots, query hashes, source refs, citation maps, permission filters. Retrieval evidence must prove authorization, source version, selected chunks, omitted chunks, and citation bindings.
AI-ENG-F Freshness and Conflict Detection Freshness flags, conflict signals, source-age metadata, stale-cache records. Audit trails must show whether outputs relied on stale, conflicting, or unresolved evidence.
AI-ENG-G Model Selection Model/route manifest, requested route, selected route, fallback reason. Silent downgrades, provider shifts, and capability changes must be replayable.
AI-ENG-H Model Adaptation Adapter hashes, fine-tune manifests, calibration artifacts, regression records. Adapted model behavior must be traceable to training/adaptation lineage.
AI-ENG-I Regression Control Baseline artifacts, failed-run bundles, diff records, gate outcomes. Regression investigations require preserved inputs, outputs, routes, and evaluation artifacts.
AI-ENG-J Throughput Mechanics Token metrics, latency spans, serving-envelope data, queue and batch policy. Performance regressions must be tied to reproducibility envelopes and serving conditions.
AI-ENG-K Weight Dynamics Quantization/compression manifests, runtime kernel settings, model snapshots. Optimization-induced behavior changes require route and model-state evidence.
AI-ENG-L Serving Architecture Provider manifest, route manifest, cache state, fallback path, deployment hash. Serving decisions must be reconstructable from route and environment artifacts.
AI-ENG-M Agentic Orchestration State ledger, plan graph refs, loop counters, branch/retry records. Agent failures require trajectory evidence without exposing hidden reasoning.
AI-ENG-N Tool Contracts Tool schema versions, payload hashes, authorization records, idempotency hashes. Tool calls must be auditable from planned call through verified post-action state.
AI-ENG-O Action Verification Pre/post state hashes, action ledger, verification status, compensation refs. Completion claims require state-backed verification artifacts.
AI-ENG-P Multimodal Understanding Parser manifests, OCR/layout refs, coordinates, media hashes, evidence regions. Visual/document claims must bind to source coordinates or secure evidence refs.
AI-ENG-Q Speech and Realtime Interaction Transcript hashes, confirmation records, voice fallback state, endpointing metadata. High-impact voice actions need replayable confirmation and transcript evidence.
AI-ENG-R UI Agents UI state snapshots, DOM/screenshot hashes, action refs, post-action verification. UI automation must preserve observe-act-verify evidence.
AI-ENG-S Production Pathologies Malformed output records, repair-loop artifacts, false-success traces. Pathologies require typed evidence for debugging and regression tests.
AI-ENG-T Boundary Defense Tenant hashes, authorization decisions, prompt-injection evidence, redaction records. Security boundaries must be provable without leaking the protected payload.
AI-ENG-U Supply Chain Security Signed manifests, dependency locks, parser/tool/container hashes. Evidence trails must bind behavior to approved supply-chain artifacts.
AI-ENG-V Resource Abuse Cost ledgers, token-burn records, retry artifacts, quota state. Cost bombs and runaway loops need replayable consumption evidence.
AI-ENG-W UX Resilience Degraded-mode state, fallback disclosures, partial-answer records, saved-state refs. User-visible degraded states must be backed by evidence of preserved state and disclosure.
AI-ENG-X User Trust and Transparency Disclosure records, citation UX state, contestability records, user correction diffs. Trust claims must be tied to what the user actually saw and could contest.
AI-ENG-Y High-Impact Workflow Design Approval packets, maker/checker records, escalation packages, break-glass artifacts. High-impact decisions require signed approval, reviewer, and escalation evidence.
AI-ENG-Z Strategic Telemetry Trace IDs, span IDs, token metrics, latency records, semantic drift signals. Telemetry provides sensor data; verification artifacts preserve durable evidence.
AI-ENG-AA Evals Architecture Golden traces, case versions, labels, rubrics, regression gate artifacts. Production failures and corrected outputs should become reproducible evaluation artifacts where appropriate.
AI-ENG-AC Incident Response and Remediation Forensic bundles, containment timeline, compromised artifact refs, recovery evidence. Incidents require evidence chains for scope, timeline, root cause, and remediation.
AI-ENG-AD Governance and Compliance Retention class, policy bundle, approval authority, audit signatures, access logs. Governance defines capture obligations, retention, deletion, legal hold, and exception approvals.
AI-ENG-AJ Reference Architectures Verification bundle schema, replay package pattern, evidence store, audit ledger. Reference architectures should include evidence trails as first-class infrastructure.

Durable Principles of Verification Artifacts

  1. Logs Are Not Evidence Trails
    Logs show that something happened somewhere. Verification artifacts preserve enough structured evidence to reconstruct what happened, why it happened, what data shaped it, and what state resulted.

  2. Auditability Requires Scope, Not Hoarding
    More raw data is not automatically better auditability. Better auditability comes from scoped hashes, secure references, source versions, policy decisions, state checks, and signed manifests.

  3. Tamper-Evident Does Not Mean True
    A signed artifact proves integrity of the recorded evidence. It does not prove the underlying decision was correct unless the evidence, checks, and policies were also adequate.

  4. Replay Has Levels
    Exact replay, constrained replay, semantic replay, counterfactual replay, forensic replay, and audit-only verification serve different purposes. Do not promise bit-exact reproduction when the environment cannot support it.

  5. The Reproducibility Unit Is the Whole Runtime
    Model ID alone is not enough. Prompt compiler, retrieval index, policy bundle, route manifest, tool schema, feature flags, cache state, and deployment environment all shape behavior.

  6. Hidden Reasoning Is Not an Audit Primitive
    Preserve decisions, state transitions, prompts, evidence, policy checks, tool calls, and outputs. Do not require private chain-of-thought capture to explain system behavior.

  7. Tool Claims Require State Evidence
    “The model said it sent the invoice” proves nothing. Tool records must preserve authorization, payload hash, idempotency, execution status, and post-action verification.

  8. Sensitive Payloads Need Capture Classes
    Never-capture, hash-only, metadata-only, redacted snippet, reference-only, restricted full payload, and incident-mode capture should be explicit policy states.

  9. Evidence Storage Must Follow Risk and Law
    Retention windows should be driven by risk class, contract, regulation, incident state, legal hold, and operational need—not by vibes or one universal timer.

  10. Every Artifact Needs Ownership
    Prompt records, route manifests, retrieval snapshots, policy checks, approval records, and replay packages need owners, version history, and access rules.

  11. Evidence Must Support Dispute Resolution
    A user correction, reviewer override, citation dispute, model downgrade, or failed action should be traceable through artifacts rather than reconstructed from memory and regret.

  12. Verification Infrastructure Is Governance Infrastructure
    If the system cannot prove what happened, it cannot be governed. If it cannot be governed, it should not be trusted with consequential autonomy.

Works cited

  1. AI-ENG-AA — Evals Architecture: Ground Truth, Golden Sets & Regression Tests
  2. AI-ENG-Z — Strategic Telemetry - Traces, Tokens, Latency & Semantic Drift
  3. Recon.AI — Trust Accounting Platform for AI systems, accessed June 11, 2026, https://reconai.net/
  4. Content-Addressed Evidence Storage & Preservation … - SEC.gov, accessed June 11, 2026, https://www.sec.gov/files/ctf-written-fcck-pilot-evidence-02-16-2026.pdf
  5. [KNOWLEDGE] - AI Engineering - Volume 8. W-Y Resilience, Degraded Modes, and Human Trust
  6. Provenance only gets attention after a messy document case : r/AI_Agents - Reddit, accessed June 11, 2026, https://www.reddit.com/r/AI_Agents/comments/1sco94r/provenance_only_gets_attention_after_a_messy/
  7. Model Lineage and Reproducibility: Tracking Provenance Across the ML Lifecycle, accessed June 11, 2026, https://agility-at-scale.com/ai/governance/model-lineage-and-reproducibility/
  8. Trace concepts - Azure Databricks Microsoft Learn, accessed June 11, 2026, https://learn.microsoft.com/en-us/azure/databricks/mlflow3/genai/tracing/tracing-101
  9. OpenInference Specification - GitHub Pages, accessed June 11, 2026, https://arize-ai.github.io/openinference/spec/
  10. Trace Concepts MLflow AI Platform, accessed June 11, 2026, https://mlflow.org/docs/latest/genai/concepts/trace/
  11. Propagation - Python - OpenTelemetry, accessed June 11, 2026, https://opentelemetry.io/docs/languages/python/propagation/
  12. What is a C2PA Manifest? Structure, Assertions, and Verification, accessed June 11, 2026, https://c2paviewer.com/articles/what-is-c2pa-manifest
  13. In-toto Attestations in Secure Pipelines - Emergent Mind, accessed June 11, 2026, https://www.emergentmind.com/topics/in-toto-attestations
  14. Provenance Driven Identity Trust Architecture, accessed June 11, 2026, https://identitymanagementinstitute.org/provenance-driven-identity-trust-architecture/
  15. Technology - Ar.io, accessed June 11, 2026, https://ar.io/technology/
  16. C2PA Implementation Guidance, accessed June 11, 2026, https://spec.c2pa.org/specifications/specifications/2.4/guidance/Guidance.html
  17. Blockchain-enforced data lineage architectures with formal verification workflows, accessed June 11, 2026, https://ijsra.net/sites/default/files/fulltext_pdf/IJSRA-2023-0559.pdf
  18. Defeating Nondeterminism in LLM Inference - Simon Willison’s Weblog, accessed June 11, 2026, https://simonwillison.net/2025/Sep/11/defeating-nondeterminism/
  19. Defeating Nondeterminism in LLM Inference - Thinking Machines Lab, accessed June 11, 2026, https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
  20. Defeating Nondeterminism in LLM Inference - OpenAI Developer Community, accessed June 11, 2026, https://community.openai.com/t/defeating-nondeterminism-in-llm-inference/1358623
  21. Openinference Semantic Conventions - Arize AX Docs, accessed June 11, 2026, https://arize.com/docs/ax/observe/tracing-concepts/openinference-semantic-conventions
  22. Poisoning Attacks in Federated Learning: An Accountability-Oriented Survey with Centralized Learning as Baseline - Preprints.org, accessed June 11, 2026, https://www.preprints.org/manuscript/202605.1674
  23. C2PA for Developers: Implementation Guide, accessed June 11, 2026, https://c2pa.ai/for-developers
  24. An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats - arXiv, accessed June 11, 2026, https://arxiv.org/html/2606.09686v1
  25. Fighting Misinformation with Authenticated C2PA Provenance Metadata - IPTC, accessed June 11, 2026, https://iptc.org/std/MediaProvenance/Documents/NAB%20Paper%20-%20Formatted%20Version.pdf
  26. envelope.md - in-toto/attestation - GitHub, accessed June 11, 2026, https://github.com/in-toto/attestation/blob/main/spec/v1/envelope.md
  27. mlflow.tracing, accessed June 11, 2026, https://mlflow.org/docs/latest/api_reference/python_api/mlflow.tracing.html
  28. Attribute Mapping Reference MLflow AI Platform, accessed June 11, 2026, https://mlflow.org/docs/latest/genai/tracing/opentelemetry/attribute-mapping/
  29. Agent Trace Evaluation with TruLens Scorers in MLflow, accessed June 11, 2026, https://mlflow.org/blog/mlflow-trulens-evaluation/
  30. From Attack Simulation to SIEM Rule: Deterministic Detection-as-Code Synthesis with Probe-Level Traceability - arXiv, accessed June 11, 2026, https://arxiv.org/pdf/2606.05252
  31. [KNOWLEDGE] - AI Engineering - Volume 7. S-V Failure, Security, and Hostile Environments
  32. What do you use for tamper-evident audit logs? Looking for approaches beyond “ship to S3”, accessed June 11, 2026, https://www.reddit.com/r/Observability/comments/1rwm9vh/what_do_you_use_for_tamperevident_audit_logs/
  33. QCIVET: A Quantum–Classical Pipeline Integrity Framework with Contract-Based Subtype Verification and Hash-Chained Audit Traces - arXiv, accessed June 11, 2026, https://arxiv.org/html/2605.13109v1
  34. Cosign Projects - AI Tinkerers - Bogotá, accessed June 11, 2026, https://bogota.aitinkerers.org/technologies/cosign
  35. What Is Sigstore? Keyless Signing for the Software Supply Chain - Sbomify, accessed June 11, 2026, https://sbomify.com/2024/08/12/what-is-sigstore/
  36. New SIG proposal: E2E Model Provenance · Issue #40 · ossf/ai-ml-security - GitHub, accessed June 11, 2026, https://github.com/ossf/ai-ml-security/issues/40
  37. Docker Image signing and attestation with Cosign - AugmentedMind.de, accessed June 11, 2026, https://www.augmentedmind.de/2025/03/02/docker-image-signing-with-cosign/
  38. Guidance for Artificial Intelligence and Machine Learning - C2PA Specifications, accessed June 11, 2026, https://spec.c2pa.org/specifications/specifications/2.4/ai-ml/ai_ml.html
  39. A Study on Broker-Assisted Blockchain Trust Chains for Provenance and Integrity Verification of Generative Media Using Watermarking, Semantic Fingerprinting, and C2PA - MDPI, accessed June 11, 2026, https://www.mdpi.com/2076-3417/16/7/3391
  40. Evidentia™ T1 — Institutional Evidence Infrastructure, accessed June 11, 2026, https://evidentiat1.com/
  41. (PDF) From Policy to Pipeline: A Governance Framework for AI Development and Operations Pipelines - ResearchGate, accessed June 11, 2026, https://www.researchgate.net/publication/399025393_From_Policy_to_Pipeline_A_Governance_Framework_for_AI_Development_and_Operations_Pipelines

Attribution

Part of Stunspot’s Guide to AI SystemsThe AI Engineering Systems Canon.

Created by Sam “stunspot” Walker / Collaborative Dynamics.

Repository: https://github.com/Stunspot/stunspots-guide-to-ai-systems
Stunspot: https://stunspot.com
Collaborative Dynamics: https://www.collaborative-dynamics.com
Discord: https://discord.gg/stunspot

Licensed under CC BY-NC-SA 4.0 unless otherwise stated.
Commercial use, resale, paid redistribution, inclusion in commercial training products, or incorporation into paid knowledge-base products requires prior written permission.

← Back to Canon Map