AI-ENG-AI — Contract Thinking - Deterministic Edges Around Probabilistic Cores
Doctrinal Introduction & The Contract Thinking Doctrine
In high-dimensional artificial intelligence engineering, the primary challenge is not the elimination of model non-determinism, but rather the containment of that non-determinism within safe, verifiable boundaries. This paradigm, formulated as probabilistic containment architecture, recognizes that the cognitive flexibility of large language models is highly valuable during internal inference, but presents severe risks when allowed to interface directly with stateful, authoritative, or public-facing systems. Good systems design does not attempt to force the entire model into absolute predictability; instead, it establishes a rigid, deterministic envelope around the model’s probabilistic core. The model may execute complex synthesis, logical deduction, and creative planning within this bounded space, but every boundary crossing—or seam—must conform to a strict, typed, and enforceable contract.
This methodology generalizes Bertrand Meyer’s Design by Contract (DbC) framework, originally formulated in 1986 to improve traditional software reliability.1 Meyer identified correctness—the ability of software to perform its exact tasks as defined by its specification—as the supreme quality of software, supplemented by robustness, efficiency, extendibility, reusability, and compatibility.3 While defensive programming suggests adding redundant checks everywhere, it leads to code complexity—the primary obstacle to software quality.2 DbC advocates an “offensive programming” posture: establish clear preconditions, postconditions, and invariants, and “fail hard” immediately when assertions are violated, simplifying debugging and ensuring correctness.1
Applying the Liskov Substitution Principle to AI contract boundaries establishes clear constraints on function execution 4:
- Preconditions cannot be strengthened: An implementation must not accept a narrower range of input than the interface specification dictates.4
- Postconditions cannot be weakened: An implementation must not return a wider, more ambiguous range of output than the specification guarantees.4
- Invariants cannot be weakened: An implementation must maintain core system state within specified tolerances throughout execution.4
In AI engineering, the model is inherently probabilistic. We cannot make the model itself fully deterministic. The useful question is: Can we make every boundary around the model explicit, enforceable, observable, testable, and safe to breach-handle? Inside the box, the model operates with probabilistic freedom; at the edges, it must sign deterministic paperwork.
To establish this discipline, systems architects must maintain clean distinctions across multiple operational dimensions:
- Flexibility versus Authority: The generative core is permitted to reason, draft, and propose, but it holds zero administrative authority. Any execution of state changes, data mutation, or outbound communication must be mediated by a deterministic runtime that validates permissions and rules before execution.5
- Schema Validity versus Truth: A model output conforming perfectly to a JSON schema is structurally valid, but syntactic correctness does not imply factual accuracy. Structuring constraints ensure format compliance; separate semantic and factual verifications are required to confirm truth.7
- Prompt Instruction versus Enforceable Boundary: Instructions embedded within prompts (e.g., “never reveal system keys”) are instructions, not security controls.6 Actual boundaries must be enforced by isolated software filters, input-scrubbing routines, and policy-as-code sidecars that are completely outside the model’s context window.
- Model Capability versus System Permission: A model’s computational ability to generate a tool invocation, generate SQL, or draft a financial transaction must never be confused with systemic authorization to perform that action.5
- Generated Action versus Authorized Action: Actions drafted by the model are treated as unauthenticated proposals. They are only translated into physical side effects after passing through a deterministic authorization engine (such as Open Policy Agent or Cedar).6
- Grounding versus Citation-Shaped Decoration: The presence of inline footnotes and citation links in a generated output often serves as a persuasive UX element rather than factual proof.8 Rigid grounding contracts must programmatically verify that claims are supported by retrieved sources, with proper temporal and logical qualifiers.7
- Memory versus Surveillance: Persistent agentic memory must not become an unconstrained data sink. It requires strict tenant isolation, local pseudonymization, automatic retention rules, and user-controlled deletion paths to prevent liability and privacy leakage.9
- Eval Score versus Release Contract: An eval is a formal release gate, not a subjective vibe check.14
- Logs versus Compliance Evidence: Raw diagnostic logs track errors; immutable, cryptographically signed events serve as compliance evidence.11
- User Trust versus User Expectation Contract: UX must cultivate calibrated trust, ensuring users rely on the system when correct and override it when it fails.17
- Vendor Route versus Model-Route Contract: Gateway abstraction maintains system stability across changing provider deployments.20
- Fallback versus Contract Downgrade: Fault tolerance must preserve safety properties even when dropping back to less capable local models.20
Conceptual Glossary
The following standardized terms form the core vocabulary of the contract thinking doctrine and are defined for uniform engineering application:
| Term | Doctrinal Definition |
|---|---|
| Contract Thinking | The engineering practice of placing deterministic, typed, versioned, enforceable, observable, and testable interfaces around probabilistic model behavior at every architectural seam. |
| Probabilistic Core | The internal non-deterministic execution domain of a generative model where token generation, reasoning, and conceptual synthesis occur. |
| Deterministic Edge | The rigid, software-enforced boundary surrounding a probabilistic core that validates inputs, filters outputs, enforces budgets, checks permissions, and handles failures. |
| Contract Surface | Any distinct architectural seam where model outputs or inputs interface with structured systems, human users, external APIs, or storage. |
| Contract Stack | The layered hierarchy of active agreements (from user expectations to deployment manifests) that must remain aligned to prevent silent system failures. |
| Schema Contract | A programmatically enforced definition of output structure, typing, and constraints, typically implemented via constrained decoding at the token level.22 |
| Prompt Contract | A versioned, testable deployment configuration that formalizes the task boundary, instruction hierarchy, and input/output expectations of a model step. |
| Retrieval Contract | An agreement governing context injection, specifying source authority, permission filtering, freshness tolerances, and citation requirements.14 |
| Tool Contract | A highly specific API interface that maps natural language intents to deterministic code execution, enforcing schema validation, auth, and idempotency.5 |
| Permission Contract | A zero-trust authorization boundary that separates model-generated proposals from system-level action execution based on the underlying user’s identity.6 |
| Resource Contract | A set of hard runtime constraints governing execution envelopes, including token budgets, call limits, latency ceilings, and cost budgets.16 |
| Memory Contract | Rules governing long-term writable state, defining write validation, retention periods, read scoping, and on-device privacy mapping.12 |
| Model Route Contract | The operational envelope defining which task classes and risk tiers are mapped to specific models, providers, and fallbacks.21 |
| Eval Contract | The quantifiable release gate specifying the performance benchmarks, regression thresholds, and behavioral tests a configuration must pass to ship. |
| Deployment Contract | A cryptographically signed or version-pinned bundle containing prompts, schemas, routing policies, and verification artifacts. |
| Observability Contract | The schema defining tracing spans, telemetry events, redacted logs, and compliance evidence emitted during system execution.16 |
| User Expectation Contract | The user interface patterns and disclosure frameworks that align human trust with actual system competence, preventing automation bias.17 |
| Breach Behavior | The deterministic error-handling pipeline executed immediately when any contract boundary is violated at runtime. |
| Contract Drift | A progressive misalignment between layered contracts (e.g., a product promise exceeding model capability) that leads to silent, systemic failure. |
The Contract Stack Model & Drift Diagnostic
AI systems fail when contract layers silently diverge. A product promise may imply current policy compliance, while retrieval pulls stale documents. A schema may require citations, while the citation verifier only checks that a field exists. A UI may show an action as complete, while the downstream transaction failed. Contract thinking prevents these failures by aligning every layer from user expectation to deployment evidence.
CONTRACT STACK
[ User Expectation Contract ]
|
[ Product / Workflow Contract ]
|
[ Policy / Permission Contract ]
|
[ Prompt / Context Contract ]
|
[ Retrieval / Memory Contract ]
|
[ Model Route Contract ]
|
[ Schema / Output Contract ]
|
[ Tool / Action Contract ]
|
[ Eval / Verification Contract ]
|
[ Deployment / Observability Contract ]
Each layer must be supported by the layers beneath it. The user interface must not promise a capability the product workflow cannot verify. The prompt must not request evidence the retrieval system cannot supply. The model route must not be assigned to a task class it has not passed. The tool contract must not execute an action the permission contract has not authorized.
Contract Drift Diagnostic
| Drift Pattern | Detection Signal | Safe Runtime Response | Structural Owner | Preventive Practice |
|---|---|---|---|---|
| User Promise vs. Product Capability | Users complain that the system did not do what the interface implied. | Downgrade claim, add disclosure, route to human or safer workflow. | Product Owner | Product promises must map to tested capabilities and known limits. |
| Product Workflow vs. Model Route | Complex/high-risk tasks routed to low-capability or low-cost model. | Re-route to approved model class or block with explanation. | Product Architect / Platform Owner | Route manifests tied to task class, risk tier, and eval evidence. |
| Prompt vs. Retrieval | Prompt asks for current/grounded answer but retrieval lacks current or authoritative sources. | Refuse, ask for source, or disclose unsupported status. | Prompt / Retrieval Owner | Prompt context requirements must be checked against corpus metadata. |
| Retrieval vs. Freshness | Retrieved source is stale, superseded, or outside time scope. | Remove stale source, rerun retrieval, disclose conflict or absence. | Corpus / Retrieval Owner | Freshness metadata, source-of-record hierarchy, index lifecycle controls. |
| Schema vs. Truth | Output passes schema validation but fails citation, math, or policy checks. | Block downstream use; send to factual/semantic repair or human review. | Eval / Verification Owner | Multi-layer validation beyond JSON shape. |
| Schema vs. Tool Contract | Model emits structurally valid arguments that violate tool preconditions. | Deny tool call; return structured breach reason. | Tool Owner | Shared typed schemas, preconditions, and postcondition tests. |
| Permission vs. Tool Execution | Tool call authorized by prompt language but not by user identity, role, tenant, or resource policy. | Block before side effect; log policy decision. | Security / Governance Owner | External policy engine, ABAC/RBAC, user-bound delegation. |
| Fallback vs. Risk Tier | Failover route has lower capability, weaker policy, or different data handling. | Degrade authority, disable actions, or route to human. | Platform SRE | Fallbacks must preserve safety properties, not just availability. |
| Memory vs. Authority | Memory from prior sessions influences a task outside active scope. | Restrict memory read, clear active memory, or require user confirmation. | Memory / Privacy Owner | Principal-scoped memory, provenance tags, retention and deletion rules. |
| Eval vs. Production Reality | Offline evals pass but production corrections, incidents, or complaints rise. | Freeze rollout, sample production failures, update eval set. | Evaluation Owner | Production-like eval data and continuous drift monitoring. |
| Cost vs. Loop Design | Spend spike from retries, tool loops, long context, or agent recursion. | Terminate loop, return partial result, alert owner. | Platform / FinOps Owner | Hard budgets, retry caps, loop limits, task-level cost telemetry. |
| Observability vs. Evidence | Logs exist but cannot prove what happened for audit or incident review. | Preserve scoped evidence package; open evidence-quality issue. | Observability / Compliance Owner | Separate telemetry schema from audit-evidence schema. |
Drift Response Rule
When drift is detected, the system should prefer:
block unsafe action
downgrade authority
disclose uncertainty
reroute only to approved routes
preserve scoped evidence
open review when drift repeats
It should not silently “repair” contract drift in ways that hide misalignment. A patched-looking response can be worse than a clean refusal.
Contract Surface Inventory
A contract surface is any seam where probabilistic behavior touches structured systems, users, tools, memory, retrieval, policy, deployment, or evidence. Every surface needs an owner, enforcer, validation method, breach behavior, and evidence boundary.
| Contract Surface | Purpose | Allowed Inputs | Expected Outputs | Owner | Enforcer | Validation | Default Breach Behavior | Evidence Boundary |
|---|---|---|---|---|---|---|---|---|
| Prompt Contract | Defines task, instruction hierarchy, allowed behavior, refusal behavior. | Typed variables, task state, approved context blocks. | Model request package and expected response mode. | Prompt / Product Owner | Prompt compiler / Gateway | Prompt unit tests, eval linkage, version checks. | Use last stable prompt or block route. | Prompt version, hash, route ID, input references. |
| Context Contract | Controls what state enters the model. | User input, session state, retrieved context, system variables. | Permission-filtered context bundle. | Context Owner | Gateway / Context Builder | ACL/RLS, redaction, source allowlist, token budget. | Remove unauthorized context or abort. | Context manifest, source IDs, redaction record. |
| Schema Contract | Makes output machine-readable. | Model output stream or structured decoding result. | Typed object or validation failure. | API / Schema Owner | Parser / Validator | JSON/schema validation, enum checks, required fields. | Retry once if safe; otherwise fail closed. | Schema version, validation result, error class. |
| Semantic Contract | Checks whether field values make business sense. | Parsed object. | Semantically valid object or violation. | Domain Owner | Business-rule validator | Ranges, invariants, cross-field checks. | Reject or route to human. | Rule version, failed invariant, secure payload reference. |
| Retrieval Contract | Defines evidence entering generation. | Query, user permissions, task scope. | Ranked, permissioned, fresh source set. | Retrieval / Corpus Owner | Retriever / Permission Filter | Source authority, freshness, dedupe, relevance, conflict checks. | Refuse grounded answer or disclose unsupported state. | Source IDs, timestamps, retrieval manifest. |
| Grounding Contract | Verifies claims against evidence. | Generated claims, cited sources. | Supported, unsupported, or conflicting claim labels. | Evaluation / RAG Owner | Citation / Claim Verifier | Claim-to-evidence, qualifier, time-scope, contradiction checks. | Block or mark unsupported claims. | Claim IDs, source refs, verifier result. |
| Memory Contract | Governs persistent writable state. | User-approved memory candidates, feedback, preferences. | Scoped memory write/read or rejection. | Memory / Privacy Owner | Memory Controller | Consent, provenance, scope, retention, conflict, delete rules. | Reject memory write or restrict read. | Memory event hash, provenance, retention class. |
| Tool Contract | Converts proposed actions into deterministic calls. | Typed tool arguments and actor context. | Execution result, denial, or error. | Tool Owner | Execution Harness | Preconditions, schema, idempotency, postconditions. | Deny before side effect or compensate if needed. | Action ID, idempotency key, result status. |
| Permission Contract | Determines whether a subject may perform an action. | Subject, tenant, resource, action, arguments, environment. | Allow, deny, require approval, or escalate. | Security / Governance Owner | Policy Engine | RBAC/ABAC, tenant, risk, approval, separation-of-duty rules. | Deny by default. | Policy version, decision, reason codes. |
| Resource Contract | Limits spend, latency, concurrency, retries, loops, and context. | Request, route, session, tenant, budget state. | Continue, throttle, degrade, or terminate. | Platform / FinOps Owner | Gateway / Runtime Monitor | Token, time, cost, queue, retry, loop limits. | Stop loop, return partial, or route to human. | Budget event, usage counters, route ID. |
| Model Route Contract | Maps task class and risk tier to model/provider/fallback. | Task class, data class, risk tier, latency/cost constraints. | Approved route or refusal. | Platform Owner | Model Gateway | Route manifest, eval pass, policy compatibility. | Use approved fallback or fail closed. | Route manifest, provider snapshot, eval status. |
| Eval Contract | Defines release and regression gate. | Candidate prompt/schema/route/tool package. | Pass, fail, conditional pass, or block. | Eval Owner | CI/CD / Eval Harness | Task metrics, safety tests, regression thresholds. | Block deployment or rollback. | Eval report, dataset version, manifest hash. |
| Deployment Contract | Packages active production configuration. | Prompts, schemas, policies, model routes, tools, eval artifacts. | Signed deployment manifest. | SRE / Release Owner | CI/CD Pipeline | Hashes, signatures, approval checks, canary health. | Halt rollout or rollback. | Manifest, approval, rollout status. |
| Observability Contract | Defines runtime telemetry. | Spans, metrics, errors, redacted event fields. | Structured telemetry stream. | Observability Owner | Telemetry Pipeline | Schema validation, redaction, exporter health. | Spool locally or alert. | Trace IDs, metrics, redacted spans. |
| Audit Evidence Contract | Preserves proof for compliance or incident review. | Scoped evidence records. | Tamper-evident evidence package. | Compliance / Security Owner | Evidence Store | Hashing, signatures, retention, access controls. | Preserve minimal required evidence; restrict access. | Evidence package ID, hashes, secure references. |
| User Expectation Contract | Aligns UI claims with system capability. | Rendered output, evidence, controls, disclosures. | Calibrated user action or review. | Product / UX Owner | UI Runtime | Disclosure checks, evidence display, action-state integrity. | Show uncertainty, block action, or require confirmation. | UI state, action confirmation, user decision event. |
| Vendor / Sourcing Contract | Governs external dependency. | Outbound request, data class, provider route. | Service result under contract terms. | Procurement / Legal Owner | Gateway / Vendor Manager | DPA, retention, SLA, no-training, subprocessor controls. | Failover, block route, or open vendor incident. | Vendor route event, contract reference, SLA record. |
The inventory should be maintained as a registry, not as prose buried in a design doc. If a surface has no owner, it has no contract.
Prompt Contract Specification
A prompt is not decorative prose or a venue for model coaxing; it is production code and must be engineered as versioned, testable, and deterministic configuration. Treating the model as the reliability layer is an architectural mistake.5 Instead, prompts must be written inside structured files (such as YAML, JSON, or TOML) that clearly outline the precise runtime conditions under which the model will execute.
The system context must also conform to a context contract. This contract specifies exactly what state enters the model, from where, and under what systemic authority.6 To prevent context bleed—which degrades model reasoning capacity by up to 90.2% when irrelevant data is carried across distinct tasks—sub-agents must be isolated by job function.5 Context isolation must be structurally enforced, ensuring that user data does not leak across session boundaries and system instructions remain shielded from malicious user manipulation.
The following Prompt Contract Template defines the schema and execution rules for a production-grade prompt configuration:
meta:
name: "InvoiceExtractionContract"
owner: "FinanceEngineeringGroup"
version: "2.4.1"
last_modified: "2026-03-15T09:30:00Z"
linked_eval_suite: "eval_invoice_extraction_v2"
rollback_target: "2.4.0"
change_log: "Added tax identifier extraction field and updated failure escalation rules."
task_boundary:
description: "Extract line-item billing information from raw unstructured text files."
allowed_languages: ["en", "de", "fr"]
max_input_length_characters: 50000
instruction_hierarchy:
system_directives: 1 # Highest priority, protected from user override
formatting_rules: 2
context_data: 3
user_inputs: 4 # Evaluated as untrusted payload only
input_assumptions:
expected_variables:
- name: "raw_invoice_text"
type: "string"
required: true
- name: "client_id"
type: "uuid"
required: true
context_sources:
- name: "organization_billing_rules"
origin: "postgres_db"
freshness_tolerance_seconds: 86400
behavioral_constraints:
allowed_behavior:
- "Extract merchant name, total, subtotal, tax amounts, line-items, and invoice date."
- "Perform currency code standardization using ISO-4217."
forbidden_behavior:
- "Do not calculate or infer missing values; report them as null."
- "Do not append conversational text, explanations, or markdown commentary outside the structured schema."
uncertainty_behavior:
low_confidence_threshold: 0.85
action: "Set extraction_confidence metric and populate low_confidence_fields array; do not omit the field."
refusal_and_escalation:
condition: "Text does not resemble an invoice or is written in an unsupported language."
response_mode: "explicit_refusal"
refusal_payload:
error_code: "UNSUPPORTED_DOCUMENT_TYPE"
message: "The provided document could not be identified as a valid invoice."
escalate_to: "manual_review_queue"
output_requirements:
format: "json_schema"
schema_reference: "invoice_extraction_schema_v3.json"
evidence_and_citation:
require_grounding: true
citation_scope: "line_level"
minimum_grounding_score: 0.90
Schema Contract Model & Multi-Layer Validation
Schemas turn model output into typed artifacts that software can inspect. A schema contract proves structure, not truth. A perfectly valid JSON object can still contain false claims, unsupported citations, unsafe recommendations, or unauthorized actions.
Structured-output mechanisms and constrained decoding can improve schema adherence, but production systems must still validate outputs at multiple layers.
SCHEMA VALIDATION STACK
[ Probabilistic Core ]
|
v
[ Structured Output / Constrained Decoding Where Supported ]
|
v
[ Serialization Validation ]
valid JSON / XML / CSV / function call envelope
|
v
[ Schema Validation ]
required fields | types | enums | object shape
|
v
[ Semantic Validation ]
domain rules | ranges | invariants | cross-field logic
|
v
[ Factual Validation ]
evidence | math | source-of-record | citations
|
v
[ Policy Validation ]
safety | privacy | tenant | compliance | risk tier
|
v
[ Action Validation ]
preconditions | authorization | idempotency | postconditions
|
v
[ Accepted Artifact or Breach Behavior ]
Validation Layers
| Layer | What It Proves | What It Does Not Prove | Enforcement Point | Breach Behavior |
|---|---|---|---|---|
| Serialization Validation | Output can be parsed. | Correct schema, truth, safety. | Parser. | Reject or request one safe repair. |
| Schema Validation | Output matches required shape and types. | Values are meaningful or factual. | JSON Schema, Pydantic, Zod, Protobuf, typed SDK. | Reject, retry once, or route to human depending on risk. |
| Semantic Validation | Values obey business rules and invariants. | Claims are sourced or policy-safe. | Business-rule validators. | Reject invalid fields or block artifact. |
| Factual Validation | Claims match evidence, math, or source-of-record data. | User should act or system has permission. | Grounding verifier, calculation engine, database check. | Mark unsupported, remove, refuse, or escalate. |
| Policy Validation | Output is allowed under safety, privacy, tenant, and compliance policy. | Output is useful or complete. | Policy engine / gateway. | Block or redact. |
| Action Validation | Proposed side effect satisfies preconditions, authorization, and postconditions. | User expectation is calibrated. | Tool execution harness. | Deny, compensate, rollback, or escalate. |
Schema Contract Requirements
| Requirement | Purpose |
|---|---|
| Schema version | Allows output changes to be tracked and tested. |
| Required fields | Prevents missing data from being silently ignored. |
| Closed object policy where appropriate | Prevents extra unreviewed fields from entering downstream code. |
| Explicit nullable fields | Distinguishes missing, unknown, and not applicable. |
| Enum constraints | Prevents invented status labels. |
| Numeric and string constraints | Enforces ranges and length where supported; always revalidate client-side. |
| Cross-field validators | Catches contradictions such as subtotal greater than total. |
| Evidence fields | Links claims to source IDs or verification references. |
| Breach behavior | Defines reject, repair, clarify, or escalate. |
Important Rule
Do not treat schema adherence as success.
valid JSON ≠ correct answer
valid schema ≠ grounded answer
grounded answer ≠ authorized action
authorized action ≠ completed action
completed action ≠ user understood the result
Each equality must be earned by a separate contract.
Retrieval Contract Model & RAG Grounding Verification
Retrieval determines what reality enters the model. The retrieval contract defines which sources may be used, who may access them, how fresh they must be, how conflicts are handled, and what level of evidence is required before the model may make a claim.
The core doctrine is:
No evidence, no grounded claim.
Weak evidence, weak claim.
Conflicting evidence, disclosed conflict.
Stale evidence, time-bounded claim.
Unauthorized evidence, no context injection.
Retrieval Contract Specification
| Contract Element | Required Definition |
|---|---|
| Source Authority | Which sources are authoritative for the task, and in what order. |
| Permission Filtering | Which user, tenant, role, or workflow may access each source. |
| Freshness Requirement | How current evidence must be for the task. |
| Time Scope | Whether the answer is valid as of a date, version, or policy period. |
| Deduplication Rule | How duplicate or near-duplicate sources are removed or clustered. |
| Conflict Policy | Whether conflicts are disclosed, escalated, or resolved by source hierarchy. |
| Citation Granularity | Document, page, paragraph, line, field, or record-level evidence. |
| Null-Evidence Behavior | Refuse, ask clarification, or answer explicitly as ungrounded. |
| Source Exclusion Disclosure | Whether omitted sources must be disclosed to users. |
| Grounding Verification | How claims are checked against cited evidence. |
| Retrieval Telemetry | Source IDs, rank, freshness, permission result, and verifier status. |
Grounding Verification Checks
| Check | Purpose |
|---|---|
| Claim-to-Evidence Alignment | Each atomic claim must be supported by cited evidence. |
| Citation Specificity | Citation points to the exact passage, record, page, or field used. |
| Qualifier Preservation | “Except,” “unless,” “only,” “as of,” and thresholds are preserved. |
| Temporal Validity | Source date/version matches the user’s time scope. |
| Source Authority | Claim uses the correct source-of-record, not merely a keyword match. |
| Conflict Disclosure | Conflicting sources are surfaced rather than flattened. |
| Cross-Chunk Synthesis | Logical links between chunks are themselves supported. |
| Null-Evidence Refusal | Plausible but unsupported questions do not trigger hallucinated answers. |
| Paraphrase Fidelity | Summary preserves obligations, exclusions, numbers, and conditions. |
| Answer-Context Sensitivity | Removing or changing evidence changes the answer appropriately. |
| Human Replayability | A reviewer can verify the claim from the cited evidence without archaeology. |
Retrieval Breach Behaviors
| Breach | Runtime Behavior |
|---|---|
| Source unauthorized | Remove source and rerun retrieval; log permission denial. |
| Source stale | Exclude or mark as stale; ask whether historical answer is acceptable. |
| Evidence missing | Refuse grounded answer or ask for source. |
| Evidence weak | Use uncertainty language or require human review. |
| Evidence conflicting | Disclose conflict and source hierarchy. |
| Citation unsupported | Remove claim or block response. |
| Corpus poisoning suspected | Quarantine source and open corpus review. |
A retrieval pipeline that cannot refuse unsupported claims is not grounded. It is just hallucination with footnotes.
Tool and Action Contract Model
Language becomes operationally dangerous at the tool seam. A model may propose an action, but only deterministic code may authorize, execute, verify, and record that action.
A tool contract defines the complete boundary between a model proposal and a real side effect.
Tool Contract Structure
| Contract Field | Meaning |
|---|---|
| tool_id | Stable identifier for the tool. |
| owner | Team responsible for behavior, safety, and maintenance. |
| allowed_callers | Which routes, agents, users, or services may request this tool. |
| input_schema | Typed arguments accepted by the tool. |
| preconditions | Required state before execution. |
| authorization_policy | Permission rules evaluated outside the model. |
| idempotency_policy | How retries avoid duplicate side effects. |
| execution_timeout | Maximum runtime. |
| postconditions | Required state after execution. |
| compensation_policy | How to reverse, repair, or reconcile failure. |
| evidence_policy | What evidence is retained and for how long. |
| breach_behavior | Deny, retry, compensate, escalate, or open incident. |
Idempotency Key Doctrine
An idempotency key must identify the same logical operation across retries. It should not depend on attempt-specific IDs that change when the framework retries, unless those IDs are guaranteed stable for the logical operation.
idempotency_key =
hash(
actor_id
+ operation_id
+ tool_id
+ target_resource_id
+ normalized_arguments
+ authorization_context_hash
)
| Key Component | Purpose |
|---|---|
| actor_id | Binds action to user/service principal. |
| operation_id | Stable ID for the intended logical operation. |
| tool_id | Prevents collisions across tools. |
| target_resource_id | Binds key to affected object. |
| normalized_arguments | Ensures equivalent retries deduplicate. |
| authorization_context_hash | Prevents reuse under changed permission conditions. |
Action State Machine
ACTION CONTRACT STATE MACHINE
[ Proposed ]
|
v
[ Parsed and Schema-Valid ]
|
v
[ Permission Checked ]
|
+-- deny --> [ Denied ]
|
v
[ Preconditions Verified ]
|
+-- fail --> [ Blocked / Needs Clarification ]
|
v
[ Idempotency Key Reserved ]
|
+-- duplicate --> [ Return Prior Result ]
|
v
[ Submitted ]
|
v
[ Confirmed by Source of Record ]
|
+-- fail --> [ Compensate / Escalate ]
|
v
[ Completed ]
Postcondition Verification
A tool call is not complete when the API returns. It is complete when the source of record confirms the expected state.
| Action Type | Required Postcondition |
|---|---|
| Create | New object exists with expected fields and owner. |
| Update | Target object version changed and fields match requested delta. |
| Delete | Object removed or marked deleted according to policy. |
| Send | Message accepted by delivery system with recipient and content hash. |
| Payment / Transfer | Transaction ID confirmed and amount/recipient match approved payload. |
| Booking / Reservation | Reservation exists with correct time, party, price, and cancellation terms. |
| Permission Change | Policy state reflects intended access and no unintended grants. |
Tool Breach Behavior
| Breach | Behavior |
|---|---|
| Invalid arguments | Reject before execution. |
| Unauthorized actor | Deny and log policy decision. |
| Failed precondition | Ask clarification or route to human. |
| Duplicate retry | Return cached result for same logical operation. |
| Partial side effect | Compensate or reconcile. |
| Unknown final state | Query source of record; escalate if unresolved. |
| High-impact action | Require human approval before submission. |
The model proposes. The tool contract disposes.
Permission and Security Contract Model
A model’s ability to generate an action is not permission to execute it. Permission must be evaluated by deterministic policy outside the model, using the authenticated subject, tenant, resource, action, arguments, environment, risk tier, and approval state.
The default decision is deny.
Permission Decision Inputs
| Input | Meaning |
|---|---|
| subject | Human user, service account, agent identity, or delegated actor. |
| tenant | Organizational boundary for data and authority. |
| resource | Object, account, document, record, system, or environment being affected. |
| action | Read, write, delete, send, approve, execute, export, etc. |
| tool | Tool or API being invoked. |
| arguments | Normalized proposed payload. |
| risk_tier | Consequence class of the action. |
| environment | Production, staging, sandbox, region, network, time, device posture. |
| approval_state | Whether required maker-checker, human approval, or dual control exists. |
| policy_version | Active policy bundle used to decide. |
Permission Contract Outcomes
| Outcome | Meaning |
|---|---|
| allow | Action may proceed to precondition and idempotency checks. |
| deny | Action is blocked. |
| require_approval | Human or multi-party approval needed. |
| require_clarification | Action intent or target is ambiguous. |
| degrade_authority | System may draft or review but not execute. |
| escalate | Route to security, compliance, or workflow owner. |
Policy Contract Pattern
{
"decision_request": {
"subject": {
"id": "user_123",
"role": "finance_reviewer",
"tenant": "tenant_a",
"scopes": ["invoice:read", "invoice:review"]
},
"resource": {
"type": "invoice",
"id": "inv_789",
"tenant": "tenant_a",
"classification": "confidential"
},
"tool": {
"id": "invoice_approval_api",
"action": "approve_payment"
},
"arguments": {
"amount": "1250.00",
"currency": "USD",
"payee_id": "vendor_456"
},
"context": {
"risk_tier": "high",
"environment": "production",
"approval_state": "maker_submitted_checker_pending",
"route_id": "finance_review_governed"
}
}
}
Security Contract Controls
| Control | Purpose |
|---|---|
| External Policy Engine | Keeps authorization outside model context. |
| Least Privilege | Agents inherit only the user’s approved scopes, not platform-wide authority. |
| Tenant Isolation | Subject, resource, memory, retrieval, and tool scopes must align. |
| Argument-Level Policy | Authorization checks the proposed payload, not just the tool name. |
| Separation of Duties | Maker and checker roles must be distinct for high-impact actions. |
| Approval Binding | Approval must bind to the exact payload hash, not a vague intent. |
| Time and Environment Conditions | Sensitive actions may depend on maintenance windows, regions, or environment. |
| Policy Reason Codes | Denials should be explainable to system owners and users where safe. |
| Policy Versioning | Every decision references the active policy bundle. |
Anti-Patterns
| Anti-Pattern | Why It Fails |
|---|---|
| Prompt says “only do authorized actions.” | Prompt text is not authorization. |
| Tool name is allowed, so all payloads are allowed. | Argument-level abuse bypasses intent checks. |
| String blacklist blocks dangerous commands. | Attackers route around brittle keyword filters. |
| Agent runs under service-admin identity. | Confused-deputy failure. |
| Approval is recorded before payload finalization. | User approved an intent, not the executed action. |
| Fallback route skips policy sidecar. | Availability destroys security. |
Security contracts must be enforced before side effects, not explained after them.
Resource, Memory, and Model Route Contracts
Resource, memory, and model-route contracts define the runtime envelope around model behavior. They prevent runaway loops, privacy leakage, stale personalization, unsafe fallback, and provider-specific dependency drift.
Resource Contract Model
A resource contract constrains the amount of time, money, context, concurrency, and retry effort a model workflow may consume.
| Resource Limit | Purpose | Breach Behavior |
|---|---|---|
| max_input_tokens | Prevents unbounded context growth. | Compress, retrieve less, ask user, or block. |
| max_output_tokens | Prevents verbose or runaway generation. | Stop generation and mark truncated. |
| max_total_tokens | Caps full session cost. | Return partial result or escalate. |
| max_loop_iterations | Prevents agent recursion. | Stop loop; return current state. |
| max_tool_calls | Prevents tool abuse and cost bombs. | Block further tool use. |
| max_retries | Prevents retry storms. | Fail with structured error. |
| latency_ceiling_ms | Protects user experience and workflow SLO. | Timeout, fallback, or queue. |
| cost_budget | Controls spend by user, tenant, route, or workflow. | Throttle, degrade, or require approval. |
| concurrency_limit | Prevents overload and rate-limit exhaustion. | Queue or reject. |
| queue_deadline | Prevents stale work. | Expire or revalidate task. |
Resource Contract Template
resource_contract:
contract_id: "<resource_contract_id>"
owner: "<platform_or_finops_owner>"
applies_to_routes:
- "<route_id>"
limits:
max_input_tokens: 0
max_output_tokens: 0
max_total_tokens: 0
max_loop_iterations: 0
max_tool_calls: 0
max_retries: 0
latency_ceiling_ms: 0
cost_budget_usd: 0
concurrency_limit: 0
breach_behavior:
token_limit: "compress_or_refuse"
loop_limit: "return_partial_and_escalate"
cost_limit: "require_approval"
latency_timeout: "fallback_or_queue"
Memory Contract Model
Memory is writable state. It must be scoped, permissioned, reviewable, and deletable. Memory should not become a surveillance landfill or a prompt-injection persistence layer.
| Memory Rule | Purpose |
|---|---|
| Write Validation | Prevents poisoned, false, sensitive, or unauthorized memory writes. |
| User / Principal Scope | Prevents memory from leaking across users, tenants, roles, or workflows. |
| Provenance Tagging | Records where the memory came from and when. |
| Confidence and Status | Distinguishes user-confirmed memory from inferred memory. |
| Retention Class | Defines when memory expires or must be reviewed. |
| Read Authorization | Checks whether memory may be used in the current task. |
| Conflict Resolution | Handles contradictory memories. |
| User Control | Provides inspect, edit, disable, and delete where appropriate. |
| Rollback / Quarantine | Allows poisoned or harmful memory to be removed. |
Memory Contract Template
memory_contract:
contract_id: "<memory_contract_id>"
owner: "<memory_owner>"
allowed_memory_types:
- preference
- workflow_context
- user_confirmed_fact
prohibited_memory_types:
- secrets
- unsupported_inferences
- sensitive_data_without_policy_basis
write_policy:
require_user_confirmation: true
require_provenance: true
injection_scan_required: true
read_policy:
scope: "user | tenant | workflow | role"
require_active_task_relevance: true
require_permission_check: true
retention:
default_ttl_days: 0
review_required: true
user_controls:
inspect: true
edit: true
delete: true
disable: true
breach_behavior:
suspected_poisoning: "quarantine_memory_and_open_review"
unauthorized_read: "deny_and_log"
conflict: "ask_user_or_disclose_conflict"
Model Route Contract Model
A model route contract maps task classes and risk tiers to approved execution routes. Routes should be expressed as capability profiles, not as brittle vendor catalogs.
| Route Field | Meaning |
|---|---|
| route_id | Stable internal route name. |
| task_classes | Workloads approved for this route. |
| risk_tiers | Maximum risk tier allowed. |
| data_classes | Data types allowed to enter this route. |
| capability_profile | Required reasoning, modality, context, tool, or latency capability. |
| provider_profile | Managed API, hosted model, self-hosted, local, deterministic fallback. |
| eval_gate | Eval suite and minimum pass condition. |
| fallback_chain | Approved fallbacks that preserve safety contract. |
| degraded_authority | What actions are disabled under fallback. |
| observability | Required telemetry emitted by route. |
| cost_budget | Cost limits for route use. |
Model Route Contract Template
model_route_contract:
route_id: "<route_id>"
owner: "<platform_owner>"
approved_task_classes:
- "<task_class>"
maximum_risk_tier: "low | medium | high | regulated"
allowed_data_classes:
- "public"
- "internal"
capability_profile:
context_window: "small | medium | large"
modalities: ["text"]
tool_use: "none | read_only | write_with_approval"
latency_class: "interactive | batch | async"
execution_profile:
primary_route: "<provider_or_runtime_alias>"
fallback_routes:
- route: "<fallback_route_id>"
authority_downgrade: "disable_tool_execution"
reason: "primary_unavailable"
eval_gate:
required_suite: "<eval_suite_id>"
required_status: "pass"
observability:
emit_trace: true
emit_cost: true
emit_quality_sample: true
breach_behavior:
eval_expired: "block_route"
provider_unavailable: "use_approved_fallback"
fallback_not_safe: "fail_closed"
A fallback route must preserve the safety contract. If the fallback cannot meet the same permission, privacy, grounding, or action-verification requirements, it is not a fallback. It is a contract downgrade and must reduce authority.
Evaluation, Deployment, and Observability Contracts
Evaluation, deployment, and observability contracts turn AI behavior from a demo into an operated system. The eval contract decides whether a configuration may ship. The deployment contract defines exactly what shipped. The observability contract proves what happened at runtime.
Evaluation Contract Model
An evaluation contract defines the test suite, dataset, metrics, thresholds, owners, and release decision for a model route, prompt, schema, retrieval pipeline, tool, or full workflow.
| Eval Contract Element | Required Definition |
|---|---|
| eval_suite_id | Stable identifier for the evaluation suite. |
| scope | Prompt, route, retrieval, schema, tool, workflow, or full system. |
| risk_tier | Determines required evidence and threshold strictness. |
| dataset_version | Golden set, adversarial set, production sample, synthetic set. |
| metrics | Task-specific success, failure, safety, cost, and latency metrics. |
| thresholds | Pass/fail or conditional release thresholds. |
| regression_window | Allowed delta from prior stable manifest. |
| human_review_required | Whether expert review is required before release. |
| failure_behavior | Block release, allow canary, require mitigation, or rollback. |
| owner | Person/team accountable for eval validity. |
Risk-Tiered Evaluation Gates
| Risk Tier | Evaluation Posture |
|---|---|
| Low | Lightweight task tests, schema validation, basic safety and latency checks. |
| Medium | Golden set, regression checks, sampled human review, cost/latency gates. |
| High | Strong golden set, adversarial tests, grounding/policy checks, human sign-off. |
| Regulated / Critical | Formal evidence package, independent review, replayability, approval record. |
Avoid universal thresholds like “1,000 cases” or “100% refusal” unless the workflow justifies them. Thresholds must match task consequence, data quality, and measurement confidence.
Deployment Contract Model
A deployment contract is the signed manifest of the active AI system configuration.
deployment_manifest:
manifest_id: "<manifest_id>"
release_version: "<version>"
owner: "<release_owner>"
created_at: "<iso_datetime>"
applies_to:
workflow: "<workflow_id>"
environment: "staging | production"
components:
prompt_contract: "<prompt_contract_id>@<version>"
schema_contract: "<schema_contract_id>@<version>"
retrieval_contract: "<retrieval_contract_id>@<version>"
memory_contract: "<memory_contract_id>@<version>"
tool_contracts:
- "<tool_contract_id>@<version>"
permission_policy_bundle: "<policy_bundle_hash>"
model_route_contract: "<route_contract_id>@<version>"
resource_contract: "<resource_contract_id>@<version>"
eval_gate:
eval_suite: "<eval_suite_id>"
status: "pass | conditional | fail"
report_hash: "<hash>"
approvals:
product_owner: "<approval_ref>"
security_owner: "<approval_ref>"
eval_owner: "<approval_ref>"
rollback:
previous_manifest_id: "<manifest_id>"
rollback_conditions:
- "schema_error_spike"
- "policy_denial_spike"
- "eval_regression"
Any change to prompt, schema, route, retrieval, memory, tool, policy, eval threshold, or resource envelope is a deployment event.
Observability Contract Model
The observability contract defines runtime telemetry. It should record enough to debug, evaluate, and govern the system without overcollecting sensitive data.
| Signal | Purpose |
|---|---|
| trace_id / workflow_id | Correlate steps in a request or task. |
| manifest_id | Identify active deployment package. |
| route_id | Identify model/provider/runtime path. |
| contract versions | Link runtime behavior to exact contract stack. |
| validation results | Schema, semantic, factual, policy, and action checks. |
| resource usage | Tokens, cost, latency, retries, tool calls, queueing. |
| breach events | Contract violations and breach behavior. |
| user decisions | Accept, reject, edit, approve, override, escalate. |
| redaction status | Whether sensitive fields were removed or referenced securely. |
Telemetry vs. Audit Evidence
| Dimension | Telemetry | Audit Evidence |
|---|---|---|
| Purpose | Debugging, monitoring, optimization, drift detection. | Proving policy, compliance, approval, or incident facts. |
| Content | Redacted spans, metrics, errors, counters. | Minimal structured records, hashes, approvals, policy decisions, secure refs. |
| Retention | Shorter, operationally scoped. | Risk/legal/compliance scoped. |
| Mutability | Rotated and managed as operational data. | Tamper-evident where required. |
| Access | Engineering and operations access by role. | Strictly controlled access by legal/security/compliance need. |
Audit evidence should not be raw logs with a fancy hat. It should be scoped, structured, minimized, and defensible.
User Expectation & Trust Calibration Contract
The user interface is a contract boundary. It tells the user what the system can do, what it cannot do, what evidence it used, what was excluded, what authority it has, and when the user must verify or decide.
Trust calibration means user reliance matches system competence. Overtrust creates rubber-stamping. Undertrust creates abandonment. The contract must make the system’s role legible.
Expectation Contract Elements
| Element | User-Facing Question |
|---|---|
| System Role | Is the AI drafting, reviewing, recommending, deciding, or acting? |
| Authority Boundary | Can the AI execute, or only propose? |
| Evidence Used | What sources support this output? |
| Evidence Excluded | What sources were unavailable, unauthorized, stale, or omitted? |
| Uncertainty State | What is known, uncertain, unsupported, or conflicting? |
| User Responsibility | What must the user review or approve? |
| Action State | Is this a draft, pending approval, submitted, confirmed, failed, or rolled back? |
| Correction Path | How can the user reject, edit, appeal, override, or report an issue? |
| Fallback State | Is the system in degraded, reduced-fidelity, or fallback mode? |
Trust Calibration Controls
| Control | Purpose |
|---|---|
| Role Label | Shows whether AI is assistant, reviewer, router, or executor. |
| Evidence Panel | Makes source support visible and replayable. |
| Unsupported Claim Marker | Prevents unsupported text from appearing equally authoritative. |
| Conflict Banner | Surfaces unresolved disagreement among sources. |
| Omitted Source Notice | Tells users when search or retrieval was incomplete. |
| Action Proximity Warning | Places risk and approval requirements near execution controls. |
| Draft / Final State Separation | Prevents users from mistaking generated text for committed action. |
| Human Approval Gate | Requires explicit decision for high-impact actions. |
| Undo / Appeal / Correction Path | Supports trust repair and contestability. |
| Degraded Mode Indicator | Shows when capability or evidence has been reduced. |
Confidence Display Rule
Raw model confidence can mislead users. Prefer concrete status labels tied to verification:
| Weak Display | Better Display |
|---|---|
| “95% confident” | “Cited source supports this claim.” |
| “High confidence” | “Verified against source of record.” |
| “Probably correct” | “Needs human review: conflicting evidence.” |
| “Low confidence” | “Missing required evidence.” |
| “AI completed this” | “Submitted; awaiting source-of-record confirmation.” |
Expectation Breach Examples
| Breach | Example | Required Response |
|---|---|---|
| Capability Overclaim | UI promises “policy compliant” when retrieval is incomplete. | Change language, block claim, or show unsupported status. |
| Authority Confusion | User thinks draft was sent. | Separate draft, approval, submitted, confirmed states. |
| Evidence Illusion | Citation exists but does not support claim. | Remove claim or mark unsupported. |
| Fallback Concealment | System silently uses weaker fallback model. | Show degraded mode and reduce authority. |
| Automation Bias | User can one-click approve high-impact output without review. | Add evidence gate or independent judgment step. |
The UI should not ask users to trust the model. It should give users enough context to trust, verify, correct, or refuse the system appropriately.
Contract Breach Playbook
A contract breach occurs when any deterministic edge rejects, cannot verify, or cannot safely process model behavior, context, retrieval, memory, tool execution, policy, route, deployment, or user expectation.
Breach handling must be deterministic. The system should not improvise safety.
CONTRACT BREACH FLOW
[ Breach Detected ]
|
v
[ Classify Breach ]
schema | semantic | factual | policy | permission | tool | resource
memory | retrieval | route | deployment | observability | user expectation
|
v
[ Determine Severity ]
low | medium | high | security/compliance incident
|
v
[ Contain ]
block action | stop loop | freeze state | remove source | restrict memory | fail closed
|
v
[ Resolve or Escalate ]
repair | retry | clarify | reroute | degrade | human review | incident
|
v
[ Preserve Scoped Evidence ]
|
v
[ Notify / Recover / Review ]
Breach Response Matrix
| Breach Type | Default Response | Notes |
|---|---|---|
| Serialization Failure | Reject or safe single repair attempt. | Do not regex-hack high-risk outputs into existence. |
| Schema Failure | Retry once if low-risk; otherwise reject or escalate. | Preserve validation error class. |
| Semantic Failure | Reject field/object or route to human. | Business invariants beat model fluency. |
| Factual / Grounding Failure | Remove unsupported claim, refuse, or request evidence. | Do not hide weak grounding behind citations. |
| Policy Failure | Block and log policy decision. | No model retry should bypass policy. |
| Permission Failure | Deny before side effect. | Return reason where safe. |
| Tool Precondition Failure | Ask clarification, block, or route to human. | Never execute with ambiguous target. |
| Tool Postcondition Failure | Query source of record; compensate or escalate. | API return is not proof of completion. |
| Resource Breach | Stop loop, throttle, degrade, or require approval. | Cost and loop breaches can become incidents. |
| Memory Breach | Quarantine memory, restrict read, or delete per policy. | Preserve provenance and review path. |
| Retrieval Breach | Exclude source, rerun retrieval, disclose conflict, or refuse. | Poisoned or stale sources require corpus review. |
| Route Breach | Use approved fallback or fail closed. | Fallback must preserve safety properties. |
| Deployment Breach | Halt rollout or rollback manifest. | Treat prompt/policy/schema changes as deploys. |
| Observability Breach | Spool locally, alert owner, or block high-risk route. | Systems without evidence may be unsafe to operate. |
| User Expectation Breach | Update UI state, disclose limitation, require confirmation. | Trust repair is part of breach handling. |
Evidence Preservation Rule
Preserve enough evidence to investigate, prove, and repair the breach without creating an uncontrolled sensitive-data archive.
| Evidence Type | Preferred Form |
|---|---|
| Prompt / schema / policy / route versions | Hashes and manifest IDs. |
| Input/output payloads | Secure references, redacted excerpts, or hashes by default. |
| Tool/action details | Action ID, idempotency key, target resource, status, payload hash. |
| Policy decision | Policy version, decision, reason code, subject/resource references. |
| Retrieval evidence | Source IDs, timestamps, rank, verifier status. |
| Memory event | Memory ID, provenance, retention class, operation type. |
| User approval | Approval ID, approver role, payload hash, timestamp. |
Raw payload capture should be reserved for incident classes and environments where policy, legal basis, access control, and retention are explicitly defined.
Incident Escalation Triggers
| Trigger | Escalation |
|---|---|
| Repeated breach of same contract | Open owner review. |
| Breach affects high-risk workflow | Escalate to human/governance owner. |
| Unauthorized data exposure | Security/privacy incident. |
| Tool action executed incorrectly | Operations incident. |
| Vendor route drift causes regression | Vendor/platform incident. |
| Breach evidence cannot be preserved | Observability/compliance incident. |
| Users were misled by UI state | Product/trust incident. |
A breach is not just an error. It is a signal that a contract boundary either worked, failed, or was missing.
Contract Review and Lifecycle Model
Contracts are living system artifacts. They must be created, approved, tested, released, monitored, reviewed, migrated, deprecated, and retired. A contract that exists only in prose is not a contract; it is a wish with headers.
Contract Lifecycle
| Stage | Required Activity | Output |
|---|---|---|
| 1. Creation | Define purpose, owner, inputs, outputs, validation, breach behavior, evidence boundary. | Draft contract artifact. |
| 2. Review | Product, security, governance, eval, platform, and domain review as appropriate. | Approval or required changes. |
| 3. Testing | Run unit tests, schema tests, evals, policy tests, route tests, and breach tests. | Test report. |
| 4. Release | Bundle into deployment manifest with hashes, versions, and rollback target. | Signed manifest. |
| 5. Monitoring | Track validation errors, breaches, cost, latency, user overrides, drift, incidents. | Runtime telemetry and evidence. |
| 6. Drift Review | Compare production behavior to contract assumptions. | Contract update, route change, or incident. |
| 7. Migration | Move to new model, provider, schema, policy, corpus, tool, or UI surface. | Migration plan and eval comparison. |
| 8. Deprecation | Mark old contract as no longer preferred; route traffic away. | Deprecation notice and timeline. |
| 9. Retirement | Disable execution and preserve required evidence. | Archived contract and retirement record. |
Review Triggers
| Trigger | Contracts Affected |
|---|---|
| Model/provider change | Model route, eval, prompt, schema, observability. |
| Prompt update | Prompt, eval, schema, user expectation. |
| Schema/API change | Schema, tool, deployment, eval. |
| New tool side effect | Tool, permission, action verification, evidence. |
| New data class | Context, retrieval, memory, permission, privacy. |
| New user population | User expectation, adoption, permission, support. |
| New jurisdiction or regulation | Policy, memory, evidence, vendor, retention. |
| Vendor term or subprocessor change | Vendor, data rights, route, procurement. |
| Retrieval corpus/index rebuild | Retrieval, grounding, eval, observability. |
| Embedding model change | Retrieval, index, eval, route, exit plan. |
| Eval regression | Eval, deployment, route, prompt, retrieval. |
| Cost anomaly | Resource, route, tool, observability. |
| Security incident | Permission, tool, memory, evidence, deployment. |
| User trust/adoption failure | User expectation, workflow, prompt, evaluation. |
| Repeated contract breach | Affected contract and upstream/downstream layers. |
Contract Registry Fields
| Field | Purpose |
|---|---|
| contract_id | Stable identifier. |
| contract_type | Prompt, schema, tool, memory, route, etc. |
| owner | Accountable team/person. |
| status | Draft, active, deprecated, retired. |
| version | Semantic or manifest version. |
| linked_contracts | Related surfaces in the stack. |
| linked_eval_suite | Required eval gate. |
| risk_tier | Determines approval and evidence requirements. |
| deployment_manifest | Active release bundle. |
| evidence_policy | Retention and evidence requirements. |
| breach_policy | Deterministic failure behavior. |
| last_reviewed | Review timestamp. |
| next_review_trigger | Time or event-based trigger. |
Contract lifecycle management is the maintenance discipline that keeps probabilistic systems from becoming folklore.
Cross-Canon Handoff Map
AI-ENG-AI defines the seam discipline that unifies the AI Engineering Systems Canon. Contract thinking turns prompts, schemas, retrieval, memory, tools, permissions, resources, model routes, evaluations, deployments, observability, vendor dependencies, and user expectations into enforceable boundaries around probabilistic cores.
| Canon Report | Handoff Into AI | Contract-Thinking Integration |
|---|---|---|
| AI-ENG-B — Context Architecture | Context windows, state, instruction hierarchy, authority. | Context Contract and Prompt Contract define what enters the model and under what priority. |
| AI-ENG-D — Corpus Engineering | Source ownership, lineage, metadata, lifecycle. | Retrieval Contract requires source authority, provenance, and corpus lifecycle state. |
| AI-ENG-E — Retrieval Pipeline | Chunking, ranking, reranking, citation, retrieval telemetry. | Retrieval and Grounding Contracts define evidence admission and claim support. |
| AI-ENG-F — Freshness and Conflict Detection | Source freshness, conflict packets, source-of-record logic. | Retrieval Contract requires time scope, freshness, and conflict behavior. |
| AI-ENG-J — Throughput Mechanics | Latency, batching, queues, prefill/decode constraints. | Resource and Model Route Contracts define runtime envelopes and latency ceilings. |
| AI-ENG-K — Weight Dynamics | Quantization, adapters, model behavior shifts. | Model Route and Eval Contracts bind model variants to tested task classes. |
| AI-ENG-L — Serving Architecture | Gateways, routing, failover, deployment patterns. | Model Route and Deployment Contracts control route manifests and fallback safety. |
| AI-ENG-M — Agentic Orchestration | Planner/executor loops, agent roles, multi-step tasks. | Resource, Tool, Permission, and Memory Contracts bound agent behavior. |
| AI-ENG-N — Tool Contracts | Tool schemas, idempotency, execution interfaces. | AI-ENG-AI generalizes tool boundaries into full contract-stack discipline. |
| AI-ENG-O — Action Verification | Source-of-record checks, postconditions, false-success prevention. | Tool and Action Contracts require preconditions, postconditions, and confirmation. |
| AI-ENG-P — Multimodal Understanding | Image/audio/video input uncertainty and evidence. | Schema, Context, and User Expectation Contracts define modality-specific confidence and review. |
| AI-ENG-Q — Speech, Voice, and Real-Time Systems | Latency, interruption, streaming, voice UX. | Resource and User Expectation Contracts define realtime boundaries and confirmation gates. |
| AI-ENG-R — UI Agents | Browser/UI actions, screen state, user authority. | Tool, Permission, and User Expectation Contracts govern UI-side effects. |
| AI-ENG-S — Production Pathologies | Failure modes, brittleness, hallucination, drift. | Contract Drift Diagnostic maps pathologies to breached seams. |
| AI-ENG-T — Boundary Defense | Tenant isolation, prompt injection, egress, policy hierarchy. | Permission, Context, Retrieval, and Tool Contracts enforce boundaries outside the model. |
| AI-ENG-U — Supply Chain Security | SBOM, AI-BOM, provenance, artifact trust. | Deployment and Vendor Contracts require signed artifacts and dependency evidence. |
| AI-ENG-V — Resource Abuse | Loop abuse, denial-of-wallet, cost bombs. | Resource Contract enforces budgets, loop caps, retry caps, and concurrency limits. |
| AI-ENG-W — UX Resilience | Degraded modes, fallback UX, continuity. | Breach Playbook and User Expectation Contract define safe degradation. |
| AI-ENG-X — User Trust | Transparency, contestability, disclosure, trust repair. | User Expectation Contract formalizes trust calibration and user authority. |
| AI-ENG-Y — Human Review | Maker-checker, approval queues, reviewer burden. | Permission and Tool Contracts bind approvals to payloads and action states. |
| AI-ENG-Z — Strategic Telemetry | Traces, metrics, behavior observability. | Observability Contract defines runtime emission and redaction. |
| AI-ENG-AA — Evaluation Architecture | Golden sets, rubrics, regression gates. | Eval Contract defines release gates and drift response. |
| AI-ENG-AB — Verification Artifacts | Evidence packages, replay, audit references. | Audit Evidence Contract defines what proof is retained. |
| AI-ENG-AC — AI Operations | Incidents, rollback, runbooks, containment. | Breach Playbook and Contract Lifecycle Model define operational response. |
| AI-ENG-AD — Governance Architecture | Policy, audit, compliance, accountability. | Permission, Evidence, Vendor, and Deployment Contracts make governance executable. |
| AI-ENG-AE — Sustainable AI | Cost, energy, routing, lifecycle efficiency. | Resource and Model Route Contracts enforce cost/resource envelopes. |
| AI-ENG-AF — Product Architecture | Use-case fit, workflow value, product surface. | User Expectation and Product/Workflow Contracts bind promises to capability. |
| AI-ENG-AG — Adoption Systems | Training, feedback loops, incentives, change. | User Expectation and Observability Contracts feed adoption telemetry and correction loops. |
| AI-ENG-AH — Sourcing and Vendor Strategy | Build/buy/open/vendor decisions, exit plans. | Vendor and Model Route Contracts prevent sourcing decisions from becoming invisible lock-in. |
| AI-ENG-AJ — Reference Architectures | Reusable implementation patterns. | AI-ENG-AI supplies the contract surfaces each reference architecture must instantiate. |
Core Canon Rule
Every probabilistic capability must cross deterministic boundaries before it can affect users, systems, memory, tools, money, permissions, evidence, or production state.
The model may be probabilistic. The system must not be.
Works cited
- Design by contract - Wikipedia, accessed June 15, 2026, https://en.wikipedia.org/wiki/Design_by_contract
- Applying ‘design by contract’ - Michigan Technological University, accessed June 15, 2026, https://pages.mtu.edu/~aebnenas/teaching/spring2010/cs3141/readings/meyerPDF.pdf
- Design By Contract: A Missing Link In The Quest For Quality Software, accessed June 15, 2026, https://wstomv.win.tue.nl/edu/2ip30/references/design-by-contract/index.html
-
API Security through Contract-Driven Programming CMU Software Engineering Institute, accessed June 15, 2026, https://www.sei.cmu.edu/blog/api-security-through-contract-driven-programming/ - AI Agent Tool Use Best Practices for Practitioners - MLflow, accessed June 15, 2026, https://mlflow.org/articles/ai-agent-tool-use-best-practices-for-practitioners/
- Why Open Policy Agent is the Missing Guardrail for Your AI Agents, accessed June 15, 2026, https://codilime.com/blog/why-use-open-policy-agent-for-your-ai-agents/
- Groundedness detection in Azure AI Content Safety - Microsoft Learn, accessed June 15, 2026, https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/groundedness
-
RAG Grounding: 11 Tests That Expose Fake Citations by Nexumo …, accessed June 15, 2026, https://medium.com/@Nexumo_/rag-grounding-11-tests-that-expose-fake-citations-30d84140831a - LLMs and Data Privacy: How to Protect Sensitive Information - Duality Technologies, accessed June 15, 2026, https://dualitytech.com/blog/llm-data-privacy/
- OPA Guardrails - TrueFoundry Docs, accessed June 15, 2026, https://www.truefoundry.com/docs/ai-gateway/opa-guardrails
- Enforce least-privilege authorization in multi-agent AI delegation chains using Cedar on AWS - GitHub, accessed June 15, 2026, https://github.com/aws-samples/sample-cedar-agentic-ai-authorization
- A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle - arXiv, accessed June 15, 2026, https://arxiv.org/html/2604.16548v2
- MemPrivacy is a privacy-preserving personalized memory management framework for edge-cloud agents. - GitHub, accessed June 15, 2026, https://github.com/MemTensor/MemPrivacy
- RAG Evals: Retrieval Relevance, Grounding, and Citation Fidelity - Vikas Goyal, accessed June 15, 2026, https://vikasgoyal.github.io/agentic/observe/rag-evals.html
- RAG Testing — Validating Retrieval Accuracy, Grounding, and Context Leakage - Medium, accessed June 15, 2026, https://medium.com/@gunashekarr11/rag-testing-validating-retrieval-accuracy-grounding-and-context-leakage-b3145e3a7b26
-
Building Idempotent Tools for Long-Running Agents PADISO Blog, accessed June 15, 2026, https://www.padiso.co/blog/building-idempotent-tools-for-long-running-agents/ -
Designing for Trust Calibration: Why AI Tools Need to Stop Pretending to Be Certain by Lena C Bootcamp Medium, accessed June 15, 2026, https://medium.com/design-bootcamp/designing-for-trust-calibration-why-ai-tools-need-to-stop-pretending-to-be-certain-0e7b74d285be - Design Patterns For Building Trust, accessed June 15, 2026, https://smart-interface-design-patterns.com/articles/the-trust-calibration-spectrum-in-ux/
- From Trust in Automation to Trust in AI in Healthcare: A 30-Year Longitudinal Review and an Interdisciplinary Framework - PMC, accessed June 15, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12562135/
- Best LLM Router and AI Gateway (2026) - Inworld AI, accessed June 15, 2026, https://inworld.ai/resources/best-llm-router-ai-gateway
- What Is an AI Gateway? Why Your Enterprise LLM Infrastructure Needs One - LiteLLM, accessed June 15, 2026, https://www.litellm.ai/blog/what-is-an-ai-gateway
- Structured outputs with OpenAI and Pydantic - dida.do, accessed June 15, 2026, https://dida.do/blog/structured-outputs-with-openai-and-pydantic
-
Agent Idempotency: Build Tool Calls That Are Safe to Retry Chanl …, accessed June 15, 2026, https://www.channel.tel/blog/idempotent-tool-calls-agent-retry-safety - Explainable recommendation: when design meets trust calibration - PMC, accessed June 15, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC8327305/
-
Memory scaling for AI agents Databricks Blog, accessed June 15, 2026, https://www.databricks.com/blog/memory-scaling-ai-agents -
Structured model outputs OpenAI API, accessed June 15, 2026, https://developers.openai.com/api/docs/guides/structured-outputs - Stop Parsing JSON by Hand: Structured LLM Outputs With Pydantic - DEV Community, accessed June 15, 2026, https://dev.to/klement_gunndu/stop-parsing-json-by-hand-structured-llm-outputs-with-pydantic-1pg0
- NeurIPS Poster SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG, accessed June 15, 2026, https://neurips.cc/virtual/2025/poster/115589
- LLM Gateway Comparison 2025 - what I learned testing 5 options in production : r/AIQuality, accessed June 15, 2026, https://www.reddit.com/r/AIQuality/comments/1q57lfc/llm_gateway_comparison_2025_what_i_learned/
- The Complete Guide to Using Pydantic for Validating LLM Outputs, accessed June 15, 2026, https://machinelearningmastery.com/the-complete-guide-to-using-pydantic-for-validating-llm-outputs/
Attribution
Part of Stunspot’s Guide to AI Systems — The AI Engineering Systems Canon.
Created by Sam “stunspot” Walker / Collaborative Dynamics.
Repository: https://github.com/Stunspot/stunspots-guide-to-ai-systems
Stunspot: https://stunspot.com
Collaborative Dynamics: https://www.collaborative-dynamics.com
Discord: https://discord.gg/stunspot
Licensed under CC BY-NC-SA 4.0 unless otherwise stated.
Commercial use, resale, paid redistribution, inclusion in commercial training products, or incorporation into paid knowledge-base products requires prior written permission.