The mature artificial intelligence enterprise does not treat its data as an unmanaged document landfill, a collection of flat PDFs, or a basic vector database filled with disconnected text chunks. Instead, the corpus is modeled as an engineered, governed knowledge substrate.1 This report defines the architectural discipline of corpus engineering as the end-to-end knowledge supply-chain management that transforms raw, unstructured organizational information into clean, versioned, deduplicated, permission-aware, canonical, conflict-aware, and lifecycle-governed knowledge assets.1 These assets are designed to support retrieval-augmented generation (RAG), multi-agent memory structures, verifiably accurate citations, and model-facing context compilation.4
+------------------------------------------------------------------------------------------------+
| KNOWLEDGE SUPPLY CHAIN CONTROL MODEL |
+------------------------------------------------------------------------------------------------+
|
| Rule: provenance, permission, authority, hygiene, and lifecycle checks happen before semantic
| retrieval. Relevance is not allowed to outrank legitimacy.
|
| +------------------------------------------------------------------------------------------+
| | Raw Organizational Sources |
| | |
| | PDFs | Wikis | SQL records | APIs | tickets | emails | manuals | policies | uploads |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Discovery, Registration, and Source Lock |
| | |
| | discover asset -> assign canonical_source_id -> capture source_uri -> lock raw replica |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Provenance and Permission Preflight |
| | |
| | verify origin | hydrate ACLs | assign tenant scope | capture owner/steward/jurisdiction |
| | reject assets with missing source authority or unverifiable access boundaries |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Parsing, Normalization, and Hygiene |
| | |
| | parse/OCR -> normalize structure -> remove boilerplate -> redact PII -> deduplicate |
| | preserve layout, tables, headings, timestamps, units, locale, and citation coordinates |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Metadata, Authority, and Lifecycle Gates |
| | |
| | source_authority | version_state | valid_from/until | retention_class | conflict_status |
| | supersession links | archival state | legal hold | lineage trail | sensitivity class |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Governed Corpus Object Store |
| | |
| | Corpus Objects become parent assets for chunks, embeddings, claims, summaries, entities,|
| | and citation anchors. All derivatives inherit provenance, permissions, and sensitivity. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Secure Retrieval Substrate |
| | |
| | vector index | keyword index | graph index | claim store | citation map | lineage ledger|
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Pre-Retrieval Eligibility Filter |
| | |
| | enforce tenant, role, group, jurisdiction, version, archival, legal-hold, and freshness |
| | constraints before ANN/vector similarity scoring. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Retrieval and Context Assembly |
| | |
| | candidate search -> rerank -> citation verification -> conflict checks -> Context Object|
| | assembly -> model-facing context window |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| [ LLM Synthesizer ]
|
+------------------------------------------------------------------------------------------------+
| Doctrine: the corpus substrate is not a pile of documents. It is a governed supply chain that |
| decides which knowledge is eligible to become model-facing context. |
+------------------------------------------------------------------------------------------------+
Corpus engineering represents a distinct architectural shift from standard data ingestion, document management, RAG preprocessing, retrieval engineering, and memory architectures. The following comparative matrix outlines these structural boundaries:
| Functional Layer | Principal Focus | Operational Scope | Typical Failure Modes |
|---|---|---|---|
| Data Ingestion | Bulk transport of raw bytes from point A to point B. | ETL pipelines, cloud sync, replication. | Ignores document layout, parses non-printable characters, lacks semantic indexing.2 |
| Document Management | Storage, indexing, and human-facing access control of files. | Folder structures, metadata cards, search indexes. | Fails to segment text, lacks vector representations, ignores model compatibility. |
| RAG Preprocessing | Ad-hoc text splitting and bulk embedding generation. | Blind character-limit chunking, basic vector loading. | Creates fragmented text, severs logical relations, leaks confidential data.7 |
| Corpus Engineering | Upstream governance, provenance validation, and lifecycle control. 3 | Ingestion-time security, deduplication, canonical ID assignment, source authority.8 | Vector DB bloat, citation laundering, stale index retrieval.6 |
| Retrieval Engineering | Dynamic search execution, similarity scoring, and reranking. | Hybrid search, query expansion, vector distance checks. | Evaluates unauthorized or outdated document chunks.7 |
| Memory Architecture | Read/write management of state during agent execution. | Short-term prompt caching, long-term state databases. | Writes hallucinated data back into permanent storage. |
Before an external retrieval system, semantic memory, or context compiler can safely expose source data to a large language model (LLM), specific criteria must be strictly verified. The retrieval engine must not operate under the assumption that semantic similarity translates to factual legitimacy. A text passage can match a user’s query with near-perfect cosine similarity while remaining factually obsolete, unapproved, duplicated, poisoned, misparsed, or entirely restricted under organizational access control policies.7
To build a secure and accurate retrieval system, the corpus substrate must enforce the following rule:
Provenance before relevance: no knowledge object should enter retrieval, memory, citation, or model-facing context unless the system knows where it came from, who owns it, what authority it carries, what scope it applies to, what transformations it has undergone, and how its lifecycle is governed.
Corpus engineering operates as the upstream control system that prevents retrieval frameworks from becoming very fast machinery for finding the wrong information with absolute confidence.10
Raw organizational information is transformed through several discrete states before it is loaded into an LLM’s active context window. This architectural progression is governed by the Epistemic Hierarchy, which maps raw source files to structured, model-ready context objects.
+------------------------------------------------------------------------------------------------+
| EPISTEMIC HIERARCHY OF CORPUS ARTIFACTS |
+------------------------------------------------------------------------------------------------+
|
| Goal: track how raw organizational information becomes structured, governed, retrievable,
| citeable, and finally eligible for model-facing context.
|
| +------------------------------------------------------------------------------------------+
| | 1. Raw Source |
| | |
| | Native artifact: PDF, DOCX, HTML, SQL row, API payload, ticket, wiki page, scanned file.|
| | Stored in system of record; not directly trusted as model context. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 2. Normalized Document |
| | |
| | Parsed and normalized representation: Markdown, extracted tables, OCR text, section |
| | boundaries, page coordinates, timestamps, locale, units, and structural layout. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 3. Corpus Object |
| | Canonical governed asset with object_id, source_uri, source authority, lifecycle state, |
| | ACLs, tenant scope, sensitivity class, provenance, |
| | retention rules, and lineage pointers. |
| +---------------------+----------------------+----------------------+----------------------+
| | | |
| v v v
| +-------------------------+ +-----------------------+ +-----------------------------+
| | 4A. Canonical Chunks | | 4B. Extracted Claims | | 4C. Embedding Records |
| | | | | | |
| | Structure-aware text | | Atomic propositions | | Vector representations tied |
| | segments with parent | | derived from source | | to chunk IDs, model version,|
| | pointers, section IDs, | | material for conflict | | source object, and ACLs. |
| | and security metadata. | | checks and grounding. | | |
| +-----------+-------------+ +-----------+-----------+ +-------------+---------------+
| | | |
| v v v
| +-------------------------+ +-----------------------+ +-----------------------------+
| | 4D. Generated Summaries | | 4E. Entity Records | | 4F. Citation Anchors |
| | | | | | |
| | Corpus-level or cluster | | Canonical graph nodes | | Stable coordinates mapping |
| | overviews used for | | with aliases, IDs, | | generated claims back to |
| | broad discovery. | | relationships, scope. | | source, page, section, span.|
| +-----------+-------------+ +-----------+-----------+ +-------------+---------------+
| | | |
| +----------------------------+----------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 5. Retrieval Candidate Set |
| | |
| | Permission-filtered, freshness-aware, authority-weighted candidate objects selected from|
| | vector, keyword, graph, claim, and citation indexes. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 6. Context Object |
| | |
| | Transient, validated, permission-safe, task-relevant model-facing block compiled into |
| | the active context window. Deleted or archived after the generation session. |
| +------------------------------------------------------------------------------------------+
|
+------------------------------------------------------------------------------------------------+
| Rule: each lower artifact must preserve lineage to its parent. Chunks, embeddings, summaries, |
| claims, and citations do not get to wander off wearing fake mustaches. |
+------------------------------------------------------------------------------------------------+
The table below defines the structural boundaries, ingestion mechanics, and operational lifespans of each artifact within the epistemic hierarchy:
| Artifact State | Structural Definition | Ingestion/Transformation Mechanic | Lifespan and Operational Scope |
|---|---|---|---|
| Raw Source | Unprocessed file stored in its native format (e.g., PDF, DOCX, HTML, SQL record).4 | File discovery, system connection polling, web scraping.4 | Permanent storage inside the enterprise system of record (e.g., SharePoint, Confluence).13 |
| Normalized Document | Structured Markdown text representing the structural elements of the raw source file.4 | Structural layout analysis, OCR processing, and table cell formatting.4 | Ephemeral, middle-tier object. Deleted or archived after downstream parsing completes.1 |
| Corpus Object | The primary, metadata-anchored unit of governed source knowledge.3 | Identity assignment, security metadata mapping, and source authority evaluation.8 | Permanent. Serves as the primary parent asset in the knowledge database.1 |
| Canonical Chunk | A semantically cohesive segment of text derived from a parent Corpus Object.1 | Recursive or structure-aware text splitting with parent-child metadata pointers.2 | Stored permanently in the vector database alongside its embeddings.1 |
| Extracted Claim | An atomic, model-verifiable assertion derived from a document (e.g., a single fact).5 | Multi-agent schema extraction, logic isolation, and confidence scoring.4 | Stored in relational tables or graph databases for factual contradiction checks.12 |
| Embedding Record | High-dimensional vector representation of a Canonical Chunk.1 | Vectorization via an embedding model (e.g., text-embedding-3-large).1 | Kept in sync with its corresponding chunk. Deleted immediately if the chunk is deprecated. |
| Generated Summary | An LLM-synthesized overview of a Corpus Object or cluster.5 | Summarization routines run during document ingestion.5 | Stored in the metadata index to accelerate broad, top-down discovery queries.1 |
| Entity Record | A unique node in the enterprise knowledge graph mapping to a real-world concept.9 | Named Entity Recognition (NER), entity clustering, and alias table mapping.9 | Permanent. Updated dynamically as new documents reference the target entity.17 |
| Citation | A structural pointer mapping generated text back to its exact document coordinates.11 | Real-time context alignment and source-evidence verification during model generation.4 | Session-bound. Stored in system telemetry logs for audit trail compliance.13 |
| Context Object | A managed, validated, and permission-filtered context block loaded into active memory. | Retrieval execution, security pre-filtering, and token-budget context compilation.8 | Transient. Deleted immediately upon completion of the generation session.7 |
To establish a standard terminology across the engineering canon, the following table defines the critical concepts of corpus design and knowledge supply chain control:
| Term | Technical Definition | Operational Implication |
|---|---|---|
| Corpus Engineering | The systematic discipline of designing, constructing, and maintaining the end-to-end organizational knowledge supply chain.1 | Shifts focus from ad-hoc chunking to structured, metadata-rich, lineage-tracked asset management. |
| Corpus Object | The canonical, uniquely identified unit of governed source knowledge containing both structural content and administrative metadata.1 | Serves as the primary parent asset from which chunks, embeddings, and summaries are derived. |
| Knowledge Supply Chain | The multi-stage pipeline that ingests raw data, enforces hygiene, extracts metadata, evaluates authority, and indexes objects safely.4 | Ensures every step of document transformation is deterministic, validated, and auditable. |
| Data Provenance | Detailed documentation tracking the origin, authorship, ownership, and chain of custody of a knowledge asset.3 | Enables downstream models to cite sources with verifiable structural authority.6 |
| Data Lineage | The complete history of structural transformations, parsers, and processing steps applied to an asset.3 | Allows tracing of downstream artifacts (such as chunks or embeddings) back to exact source versions. |
| Source Authority | A quantitative or categorical metric representing the structural weight, trust level, and compliance status of a source system.10 | Prevents low-authority draft documents from overriding high-authority systems of record during retrieval. |
| Knowledge Hygiene | The operational practice of extracting noise, resolving OCR/parsing errors, and deleting duplicate information.1 | Eliminates context-window bloating, redundant embedding costs, and model distraction.2 |
| Canonical ID | A persistent, globally unique identifier assigned to an entity, document, or concept within the enterprise knowledge graph.9 | Prevents duplicate records by establishing a single source of truth across disparate source files. |
| Alias Table | A mapping schema that associates surface-name variations, historical acronyms, and abbreviations with a single Canonical ID.9 | Enables the retrieval system to resolve lexical variants to the same underlying entity profile.16 |
| Deduplication | The removal of exact, near-exact, or semantic duplicates from the active index using lexical hashes or semantic clustering.1 | Lowers overall storage footprint, reduces inference latency, and prevents bias in generation.22 |
| Normalization | The standardization of document formats, metadata fields, timestamps, schemas, and locales across all ingested documents.1 | Ensures uniform query filtering and reliable metadata-based partitioning in the database. |
| Sensitivity Inheritance | The rule forcing derived elements (chunks, embeddings, summaries) to carry the classification of their parent source.8 | Prevents the accidental laundering of classified or sensitive information into open context windows.7 |
| Permission-Aware Indexing | The injection of access control lists (ACLs) directly into vector indexes and metadata schemas at ingestion time.8 | Evaluates user and group authorization prior to vector scoring, preventing side-channel leakage.7 |
| Supersession | A directed relationship linking an outdated version of a document to its valid, current counterpart. | Directs active retrievals to the current standard while preserving old versions for audit trails. |
| Retention Class | A metadata attribute defining the legally mandated lifespan of a document based on regulatory compliance.14 | Automates the safe, defensible deletion of expired records from physical storage and vector indexes.14 |
| Archival State | The lifecycle stage indicating if an object is active, cold-stored, under a legal hold, or marked for permanent deletion.14 | Segregates active business context from historical, compliance-only records.14 |
| Redaction Propagation | The programmatic removal of sensitive terms or PII across all downstream chunks, summaries, and embeddings. | Guarantees that deleting or masking a term in a source document instantly sanitizes all derivative objects. |
| Conflict Status | A metadata flag indicating that an object contains claims that contradict another active knowledge source.10 | Triggers specialized multi-step reasoning models to isolate and resolve epistemic differences.12 |
| Citation Fidelity | The mathematical and structural match between the generated answer text and the referenced source coordinates.11 | Detects and blocks “citation laundering,” where irrelevant files are cited to support hallucinated claims.11 |
A robust, enterprise-grade metadata payload must be bound to every Corpus Object at ingestion. This metadata allows downstream retrieval engines, security checkers, and context compilers to filter out out-of-scope, invalid, or unauthorized materials before running search algorithms.7
┌────────────────────────────────────────────────────────────────────────┐
│ CORPUS OBJECT SCHEMA │
├────────────────────────────────────────────────────────────────────────┤
│ IDENTITY : object_id (UUID), canonical_source_id (UUID) │
│ ORIGIN : source_system (Str), source_uri (Str), jurisdiction │
│ PROVENANCE : creator (Str), owner (Str), steward (Str) │
│ LIFECYCLE : version_id (Str), valid_from, valid_until (Timestamps) │
│ STATE : version_state (Enum), archival_state (Enum) │
│ RELATIONSHIPS : supersedes (UUID), superseded_by (UUID), lineage_parent│
│ SECURITY : classification (Enum), permission_scope (JSON/ACL) │
│ COMPLIANCE : retention_class (Enum), legal_hold (Bool) │
│ EPISTEMIC : source_authority (Float), conflict_status (Enum) │
└────────────────────────────────────────────────────────────────────────┘
The fields below define the technical metadata schema of the standard Corpus Object:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "CorpusObjectSchema",
"type": "object",
"properties": {
"identity": {
"type": "object",
"properties": {
"object_id": { "type": "string", "format": "uuid" },
"object_type": { "type": "string", "enum": ["document", "chunk", "claim", "table_row", "entity_profile", "summary"] },
"canonical_source_id": { "type": "string", "format": "uuid" }
},
"required": ["object_id", "object_type", "canonical_source_id"]
},
"origin": {
"type": "object",
"properties": {
"source_system": { "type": "string" },
"source_uri": { "type": "string", "format": "uri" },
"jurisdiction": { "type": "string" },
"product_scope": { "type": "array", "items": { "type": "string" } }
},
"required": ["source_system", "source_uri"]
},
"provenance": {
"type": "object",
"properties": {
"creator": { "type": "string" },
"owner": { "type": "string" },
"steward": { "type": "string" },
"creation_date": { "type": "string", "format": "date-time" },
"ingestion_date": { "type": "string", "format": "date-time" },
"observed_date": { "type": "string", "format": "date-time" }
},
"required": ["creator", "owner", "steward", "creation_date", "ingestion_date"]
},
"lifecycle": {
"type": "object",
"properties": {
"valid_from": { "type": "string", "format": "date-time" },
"valid_until": { "type": "string", "format": "date-time" },
"version_id": { "type": "string" },
"version_state": { "type": "string", "enum": ["draft", "approved", "active", "deprecated", "superseded", "archived", "deleted"] },
"supersedes": { "type": "string", "format": "uuid" },
"superseded_by": { "type": "string", "format": "uuid" },
"lineage_parent": { "type": "string", "format": "uuid" },
"lineage_children": { "type": "array", "items": { "type": "string", "format": "uuid" } },
"transformation_history": {
"type": "array",
"items": {
"type": "object",
"properties": {
"activity_id": { "type": "string", "format": "uuid" },
"parser_method": { "type": "string" },
"normalization_method": { "type": "string" },
"timestamp": { "type": "string", "format": "date-time" }
}
}
}
},
"required": ["valid_from", "version_id", "version_state"]
},
"security": {
"type": "object",
"properties": {
"sensitivity_class": { "type": "string", "enum": ["public", "internal", "confidential", "secret"] },
"redaction_status": { "type": "string", "enum": ["unredacted", "pii_masked", "privilege_cleared"] },
"permission_scope": {
"type": "object",
"properties": {
"allowed_users": { "type": "array", "items": { "type": "string" } },
"allowed_roles": { "type": "array", "items": { "type": "string" } },
"allowed_groups": { "type": "array", "items": { "type": "string" } },
"row_level_security_tags": { "type": "array", "items": { "type": "string" } }
}
},
"tenant_scope": { "type": "string" }
},
"required": ["sensitivity_class", "redaction_status", "permission_scope", "tenant_scope"]
},
"compliance": {
"type": "object",
"properties": {
"retention_class": { "type": "string" },
"archival_state": { "type": "string", "enum": ["active", "cold_archive", "legal_hold", "disposed"] },
"legal_hold_status": { "type": "boolean" },
"deletion_timestamp": { "type": "string", "format": "date-time" },
"review_trigger": {
"type": "object",
"properties": {
"trigger_type": { "type": "string", "enum": ["temporal", "event_driven"] },
"trigger_rule": { "type": "string" },
"next_review_date": { "type": "string", "format": "date-time" }
}
}
},
"required": ["retention_class", "archival_state", "legal_hold_status"]
},
"epistemic": {
"type": "object",
"properties": {
"canonical_entity_ids": { "type": "array", "items": { "type": "string", "format": "uuid" } },
"aliases": { "type": "array", "items": { "type": "string" } },
"extracted_claims": { "type": "array", "items": { "type": "object" } },
"source_authority": { "type": "number", "minimum": 0.0, "maximum": 1.0 },
"confidence_score": { "type": "number", "minimum": 0.0, "maximum": 1.0 },
"citation_label": { "type": "string" },
"conflict_status": { "type": "string", "enum": ["resolved", "disputed", "uncontested"] }
},
"required": ["source_authority", "citation_label", "conflict_status"]
}
},
"required": ["identity", "origin", "provenance", "lifecycle", "security", "compliance", "epistemic"]
}
The transformation of raw source systems into governed, context-ready Corpus Objects requires a strict, multi-stage processing pipeline.1 The pipeline is continuous, executing micro-batching and event-driven updates rather than episodic ingestion, to keep the system synchronized with enterprise system changes.6
RAW SOURCE SYSTEMS
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
SharePoint Confluence SQL Databases
│ │ │
└───────────────────────┬───────────────────────┘
│
STAGE 1: Discovery & Registration
│
STAGE 2: Permission Preflight
│
STAGE 3: Ingestion & Lock
│
STAGE 4: Parsing & OCR
│
STAGE 5: Format Normalization
│
STAGE 6: Boilerplate Removal
│
STAGE 7: Redaction & PII Masking
│
STAGE 8: Deduplication
│
STAGE 9: Entity Resolution
│
STAGE 10: Metadata Tagging
│
STAGE 11: Authority Scoring
│
STAGE 12: Lineage Recording
│
STAGE 13: Index Loading
│
▼
GOVERNED CORPUS OBJECTS
Each stage in the knowledge supply chain operates with a distinct architectural goal, processing raw data deterministically and documenting the execution metrics at every transition point:
| Stage | Process Name | Operational Execution Mechanics | Failure Impact on Retrieval Quality |
|---|---|---|---|
| 1 | Discovery & Registration | Continuous scanning of enterprise data system connectors; generates a unique, persistent asset ID for discovered files.4 | Unregistered files create blind spots in the enterprise search space. |
| 2 | Permission Preflight | Inherits and caches access control lists (ACLs) from the source system prior to parsing the file content.23 | Prevents the ingestion of unauthorized files, eliminating permission drift.6 |
| 3 | Ingestion & Lock | Pulls raw binary bytes; creates a write-locked, immutable replica inside a temporary landing bucket. | Modifications to the source file during processing can corrupt downstream parser outputs. |
| 4 | Parsing & OCR | Runs layout analysis, separates text from figures, and processes scanned documents using OCR.4 | OCR errors misinterpret numeric data, corrupting downstream generation.15 |
| 5 | Format Normalization | Translates parsed elements into a structured, strict UTF-8 Markdown representation.4 | Unnormalized syntax structure breaks consistent chunking algorithms.2 |
| 6 | Boilerplate Removal | Strips headers, footers, licensing text, and web navigation markers.2 | Boilerplate noise pollutes vector search spaces, inflating context costs.1 |
| 7 | Redaction & PII Masking | Runs regex and NER classifiers to redact or replace credentials and customer PII.2 | Missing redactions expose confidential records to external processing endpoints.7 |
| 8 | Deduplication | Runs exact byte hashes and MinHash LSH queries to identify and drop duplicate files.1 | Redundant files inflate index sizes and cause bias in retrieved content.22 |
| 9 | Entity Resolution | Maps extracted text entities to unique Graph IDs using alias tables.9 | Ambiguous entities lead to retrieval errors across similar projects or users.17 |
| 10 | Metadata Tagging | Evaluates document structural headers, mapping attributes to the target JSON schema.1 | Missing metadata tags prevent effective query-time pre-filtering.1 |
| 11 | Authority Scoring | Computes a quantitative authority score based on origin system and review status.10 | Treats unapproved drafts as equal to systems of record during retrieval. |
| 12 | Lineage Recording | Logs the activities, agents, and processing steps in a PROV-O-compliant ledger.18 | Inability to audit the transformation history limits forensic debugging. |
| 13 | Index Loading | Loads the finalized, partitioned vectors, keywords, and graph records into active storage.1 | Out-of-sync indexes serve stale or deleted knowledge assets.6 |
The parser selected for Stage 4 dictates the structural quality ceiling of the entire corpus.2 If the parser misaligns tables, merges distinct columns, or ignores layout hierarchies, downstream semantic retrieval is fundamentally limited.15
The following evaluation table compares three industry-standard document parsing engines based on empirical evaluation data 15:
| Metric / Capability | Docling (IBM Research) | Unstructured (Unstructured.io) | LlamaParse (LlamaIndex) |
|---|---|---|---|
| Core Structural Engine | Layout analysis via DocLayNet; table cell parsing via TableFormer.15 | Transformer-based layout analysis and OCR pipeline.15 | Agentic OCR with expert model orchestration.4 |
| Table Cell Extraction Accuracy | 97.9% accuracy on multi-level, nested, and complex table structures.15 | 100% on simple tables, but drops to 75% on complex, multi-row structures.15 | 100% on simple tables, but fails on complex layouts with column shifts.15 |
| Text Extraction Fidelity | High. Preserves paragraph breaks and technical phrasing.15 | Inconsistent line breaks; merges paragraph breaks; prone to over-extraction.15 | Inconsistent with complex layouts; prone to word merging and content hallucinations.15 |
| Single-Page Parse Latency | 6.28 seconds.15 | 51.06 seconds (very slow; highly inefficient for large pipelines).15 | ~6.00 seconds.15 |
| 50-Page Parse Latency | 65.12 seconds (demonstrates predictable, linear scaling).15 | 141.02 seconds (non-linear scaling issues).15 | ~6.00 seconds (highly optimized cloud extraction).15 |
| Markdown Structural Output | Good. Clear section headers, though nested headings are sometimes flattened to ##.15 | Basic structure extraction; struggles with multi-level section hierarchies.15 | Struggles with deep layout hierarchies; merges section boundaries into blocks.15 |
| Downstream Architectural Fit | Best for processing technical documents and dense financial reports.15 | Suitable for batch-oriented ETL data normalization but restricted by latency limitations.15 | Best for high-velocity, lightweight agent workflows where speed is prioritized over table accuracy.15 |
A mature enterprise corpus must account for the reality that organizational documents carry varying levels of authority.10 Treating a casual conversation or an unapproved draft with the same weight as a validated system-of-record entry introduces serious compliance and factual risks.10
To solve this, the system assigns a quantitative Source Authority Score (As in the range [0.0, 1.0]) to every ingested resource.10 At query time, this score is used as a weight factor during vector retrieval scoring.
The table below defines the authority tiers of enterprise knowledge sources:
| Source Class | Target System | As | Volatility | Ownership and Stewards | Compliance Auditing | Appropriate Context Use Case |
|---|---|---|---|---|---|---|
| Tier 1: Enterprise System of Record | Production PostgreSQL DB, Core SAP, Salesforce.6 | 1.00 | Extremely Low | IT Database Administrators, Security Officers. | Complete ACID transaction logs and security reviews. | Verifying customer transaction values, current product stock, and billing amounts. |
| Tier 2: Certified Corporate Policy | Legal DMS, Approved HR Handbooks, Regulatory Manuals.14 | 0.90 | Low | Legal Counsel, Department Stewards.14 | Annual reviews and formal revision control.14 | Authoritative regulatory compliance answers and security protocols. |
| Tier 3: Product Specifications | GitHub Releases, Jira Epic specs, Product databases. | 0.75 | Moderate | Product Managers, Engineering Leads. | Code repository history and commit signatures. | Resolving current product hardware dimensions and API schemas. |
| Tier 4: Dynamic Team Wikis | Confluence, Notion Teams, Shared Google Docs.13 | 0.50 | High | Collaborative team creators. | Basic document modification history. | Finding internal team operational guides and system deployment tips. |
| Tier 5: External Web Scrapes | Scraped partner APIs, public vendor blogs.4 | 0.30 | Very High | Automated crawl agents. | Excluded from primary compliance boundaries. | Competitor pricing trends and industry news aggregation.4 |
| Tier 6: User-Generated Uploads | Active chat uploads, unclassified local PDFs.13 | 0.20 | Dynamic | End-user during active session.13 | Sandbox scanning and DLP inspection logs.13 | Dynamic analysis and summarization of local user files.13 |
| Tier 7: Generated Summaries | Background summaries, model claim extractions.5 | 0.10 | Instantaneous | Background AI processors, verification agents. | Complete PROV-O lineage tracking.18 | Accelerating broad topic discovery and semantic link indexing. |
To guarantee complete auditability, the corpus platform must map every step of document discovery, parsing, translation, and chunking using the W3C Provenance Ontology (PROV-O).18 PROV-O uses a tripartite structure: Entities (assets, chunks, or claims), Activities (parser runs, chunk splits, or translation steps), and Agents (software systems or curators).29
The RDF Turtle block below illustrates a complete lineage trail from a raw PDF document to a retrieved citation 18:
Code snippet
@prefix prov: <http://www.w3.org/ns/prov#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix corp: <http://enterprise.ai/corpus/>.
@prefix agent: <http://enterprise.ai/agents/>.
@prefix act: <http://enterprise.ai/activities/>.
# 1. Raw Source Entity representation
corp:raw_hr_policy_pdf
a prov:Entity ;
prov:wasAttributedTo agent:hr_policy_director ;
prov:generatedAtTime "2026-01-15T09:00:00Z"^^xsd:dateTime.
# 2. Parsing Activity
act:docling_parser_run_452
a prov:Activity ;
prov:startedAtTime "2026-06-06T10:00:00Z"^^xsd:dateTime ;
prov:endedAtTime "2026-06-06T10:00:06Z"^^xsd:dateTime ;
prov:wasAssociatedWith agent:docling_parser_engine.
# 3. Normalized Document Entity
corp:normalized_hr_policy_md
a prov:Entity ;
prov:wasDerivedFrom corp:raw_hr_policy_pdf ;
prov:wasGeneratedBy act:docling_parser_run_452.
# 4. Chunking Activity
act:recursive_splitter_run_893
a prov:Activity ;
prov:startedAtTime "2026-06-06T10:01:00Z"^^xsd:dateTime ;
prov:wasAssociatedWith agent:chunking_service.
# 5. Canonical Chunk Entity
corp:canonical_chunk_segment_12
a prov:Entity ;
prov:wasDerivedFrom corp:normalized_hr_policy_md ;
prov:wasGeneratedBy act:recursive_splitter_run_893.
# 6. RAG Retrieval Activity
act:vector_retrieval_query_901
a prov:Activity ;
prov:startedAtTime "2026-06-06T13:45:00Z"^^xsd:dateTime ;
prov:wasAssociatedWith agent:retrieval_orchestrator.
# 7. Context Object Entity
corp:context_object_claim_901
a prov:Entity ;
prov:wasDerivedFrom corp:canonical_chunk_segment_12 ;
prov:wasGeneratedBy act:vector_retrieval_query_901.
# Agents
agent:hr_policy_director a prov:Agent.
agent:docling_parser_engine a prov:Agent ; prov:actedOnBehalfOf agent:it_data_platform_steward.
agent:chunking_service a prov:Agent.
agent:retrieval_orchestrator a prov:Agent.
agent:it_data_platform_steward a prov:Agent.
Using this structural ledger, a downstream model or auditor can perform recursive graph queries to trace a generated claim back to its raw source.18 If a model cites context_object_claim_901, the system can trace the chain back through canonical_chunk_segment_12 and normalized_hr_policy_md to confirm that the source was raw_hr_policy_pdf, which was created by hr_policy_director.18
Unstructured enterprise information frequently contains layout differences, formatting noise, and duplicate content.20 If ignored, this noise pollutes semantic indexes and causes retrieved results to contain redundant or conflicting sections.1
+-------------------------------------------------------------------------------------------------+
| NORMALIZATION, DEDUPLICATION, AND CANONICALIZATION |
+-------------------------------------------------------------------------------------------------+
|
| Goal: clean raw source material, remove duplicate knowledge, retain the highest-authority
| version, and map entity variants to canonical graph records before retrieval.
|
| [ Raw Sources ]
| |
| v
| +------------------------------------------------------------------------------------------+
| | 1. Format Normalization |
| | |
| | UTF-8 cleanup | Markdown structure | ISO 8601 timestamps | SI units | locale tags |
| | section boundaries | table structure | citation coordinates | schema-normalized fields |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 2. Noise and Safety Cleanup |
| | |
| | remove boilerplate | strip headers/footers/nav | redact PII/secrets | repair OCR errors |
| | preserve source coordinates and transformation lineage |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 3. Duplicate Detection |
| | |
| | exact byte hash -> lexical hash -> MinHash LSH -> semantic similarity clustering |
| | |
| | near-duplicate threshold example: Jaccard >= 0.80 |
| +---------------------------------------------+--------------------------------------------+
| |
| [ Duplicate or Near-Duplicate? ]
| / \
| No / \ Yes
| v v
| [ Continue to Entity Pass ] +----------------------------------+
| | 4. Duplicate Resolution |
| | |
| | Same version/timestamp: discard |
| | Minor version difference: retain |
| | current or higher-authority copy |
| | System-of-record conflict: |
| | retain stronger source and flag |
| | weaker source for review |
| +----------------+-----------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 5. Supersession and Lineage Update |
| | |
| | mark old object superseded | link superseded_by | preserve audit trail | update indexes |
| | ensure derived chunks, embeddings, summaries, and claims follow the retained parent |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 6. Entity Extraction and Alias Matching |
| | |
| | extract entity mentions -> match aliases -> compare graph neighborhoods -> score match |
| | examples: "Sony", "Sony Interactive", "SIE" -> canonical entity candidate |
| +---------------------------------------------+--------------------------------------------+
| |
| [ Canonical Entity Match? ]
| / | \
| / | \
| v v v
| High Confidence Ambiguous Match No Match
| | | |
| v v v
| [ Merge to Canonical ID ] [ Review Queue ] [ Create New Entity ]
| | | |
| +----------------+----------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 7. Governed Corpus Update |
| | |
| | write canonical IDs, aliases, parent pointers, duplicate decisions, authority choices, |
| | redaction state, and lineage events to the corpus object store and graph index. |
| +------------------------------------------------------------------------------------------+
|
+-------------------------------------------------------------------------------------------------+
| Rule: normalize before splitting, deduplicate before indexing, and canonicalize entities before |
| writing graph memory. Otherwise the retrieval system learns your filing cabinet’s bad habits. |
+-------------------------------------------------------------------------------------------------+
The ingestion pipeline enforces strict normalization standards to ensure uniform metadata filtering and clean text formatting:
The pipeline uses MinHash Locality-Sensitive Hashing (LSH) to identify near-duplicate documents and chunks, preventing redundant files from cluttering the index.1
For document token sets A and B, the system calculates their Jaccard similarity:
J(A, B) = |A intersect B| / |A union B|
MinHash approximates this similarity by applying independent hash functions to represent token permutations.22 The probability that a signature matches is mathematically equivalent to the Jaccard similarity:
P(MinHash(A) = MinHash(B)) = J(A, B)
By building a signature matrix from k independent hash functions (where k = 128) and splitting the rows into b bands of r rows, the probability that a document pair hashes to the same bucket in at least one band is expressed as:
P_match = 1 - (1 - s^r)^b
where s = J(A, B).22 The system uses b = 16 and r = 8 to set a Jaccard similarity threshold T = 0.80.1 If a document’s Jaccard score exceeds this threshold, it is flagged as a duplicate.
The system resolves duplicate flags using the following logic:
When documents refer to a single entity using different names (e.g., “Sony”, “Sony Interactive”, “SIE”), standard vector searches return fragmented contexts.9 To prevent this, systems must implement Proxy-Pointer RAG.9
+------------------------------------------------------------------------------------------------+
| CANONICAL ENTITY AND ALIAS MODEL |
+------------------------------------------------------------------------------------------------+
|
| Goal: prevent entity sprawl by resolving names, aliases, acronyms, and local references to
| stable canonical IDs before writing chunks, claims, or graph edges.
|
| [ Ingested Document / Chunk / Claim ]
| |
| v
| +------------------------------------------------------------------------------------------+
| | 1. Entity Profile Construction |
| | |
| | Extract: surface name, entity type, local context, relationships, document section, |
| | tenant/project scope, source authority, language, and nearby entities. |
| | |
| | Example profile: |
| | surface_name = "Sony" |
| | related_terms = ["PlayStation", "SIE", "trademark", "console division"] |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 2. Multi-Track Query Decomposition |
| | |
| | Generate targeted checks: |
| | |
| | - raw name search: "Sony" |
| | - alias search: "Sony Interactive", "SIE" |
| | - relationship search: "owns PlayStation trademark" |
| | - scoped search: active tenant, project, corpus, jurisdiction |
| | - graph search: nearby nodes and existing canonical IDs |
| +----------------------+----------------------+----------------------+-------------------- +
| | | |
| v v v
| +-------------------------+ +-----------------------+ +-----------------------------+
| | Vector / Hybrid Search | | Alias Table Lookup | | Knowledge Graph Lookup |
| | | | | | |
| | finds candidate chunks | | finds known spelling, | | finds existing nodes, |
| | and parent sections | | acronym, old-name | | edges, parent entities, |
| | | | variants | | and relationship patterns |
| +-----------+-------------+ +-----------+-----------+ +-------------+---------------+
| | | |
| +----------------------------+----------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 3. Parent-Section Loading |
| | |
| | Do not reconcile from isolated chunks alone. Load the structurally intact parent section|
| | so the reconciler can see local definitions, headings, tables, and relationship context.|
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 4. Reconciliation Scoring |
| | |
| | Compare candidate entities using: |
| | |
| | name similarity | alias match | entity type | relationship match | source authority |
| | tenant/project scope | graph-neighborhood fit | temporal validity | confidence score |
| +---------------------------------------------+--------------------------------------------+
| |
| [ Reconciliation Decision ]
| / | \
| / | \
| v v v
| Strong Existing Ambiguous Existing No Existing
| Match Match Match
| | | |
| v v v
| +-------------------------+ +--------------------+ +---------------------------+
| | Merge to Canonical ID | | Pending Entity | | Create Canonical Entity |
| | | | Review | | |
| | attach alias, update | | hold graph write, | | assign new ID, store |
| | confidence, preserve | | request resolver, | | aliases, provenance, |
| | lineage and source | | avoid false merge | | scope, and initial edges |
| +------------+------------+ +---------+----------+ +-------------+-------------+
| | | |
| +------------------------+--------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | 5. Graph and Corpus Write |
| | |
| | Write canonical_entity_id to chunks, claims, summaries, embeddings, citations, and |
| | Corpus Object metadata. Store alias history and reconciliation evidence for audit. |
| +------------------------------------------------------------------------------------------+
|
+------------------------------------------------------------------------------------------------+
| Rule: entity aliases are resolved at the database layer, not left for the model to guess during|
| answer generation. Retrieval should return canonical knowledge, not name soup. |
+------------------------------------------------------------------------------------------------+
Under this model, the system processes entities as follows:
A major vulnerability in enterprise AI search is data exposure through retrieved context.7 Unmanaged vector searches can pull sensitive text (such as salary tables or private communications) and pass them to unauthorized users simply because their semantic vectors match the query.7
To resolve this, the system must enforce access control checks before vector similarity scoring takes place.8 Applying permissions after retrieval is insecure; it allows unauthorized data to be fetched into memory and exposes security side-channels, such as observing differences in vector search latencies or utilizing unauthorized rerank scores to deduce private text content.8
User Query + JWT Token
│
├──> Native Token-Based Query (Entra ID) ──> Extract claims (Roles, Groups)
│ ──> Match index security metadata
│ ──> Filter out unauthorized vectors BEFORE ANN search
│
└──> Security Filter Query Plan (Cerbos) ──> Evaluate ABAC/RBAC policy
──> Generate native SQL/Vector pre-filters
──> Execute secure, constrained retrieval
To prevent security gaps, any downstream summary, metadata field, extracted claim, evaluation dataset, prompt cache, or memory block derived from a parent source inherits the identical security classification and ACL constraints of that source unless explicitly downgraded by a certified administrative workflow.7
If a parent document is classified as confidential, all derived chunks and summaries inherit the confidential tag. This sensitivity propagation ensures that restricted data cannot be accidentally laundered through summaries, embeddings, vector indexes, or memory writes.
The following Python implementation patterns demonstrate how to enforce permission-aware indexing and secure query filtering:
At ingestion time, every vector chunk is tagged with its tenant, allowed security groups, and classifications.8 The search engine enforces RBAC prior to approximate nearest neighbor (ANN) matching.8
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
@dataclass
class SecureChunk:
id: str
text: str
tenant_id: str
allowed_roles: List[str]
department: str
classification: str
doc_version: str
class SecureRetrievalEngine:
def __init__(self, vector_store_client: Any):
self.client = vector_store_client
def rbac_prefilter(self, tenant_id: str, user_roles: List[str]) -> Dict[str, Any]:
"""
Generates database-native pre-filtering conditions.
Enforces that the tenant matches and the user possesses at least one matching role.
"""
return {
"and": [
{"field": "tenant_id", "equals": tenant_id},
{"field": "allowed_roles", "in": user_roles}
]
}
def secure_search(self, query_vector: List[float], tenant_id: str, user_roles: List[str], top_k: int = 10) -> List:
# Generate the pre-filter block
filter_conditions = self.rbac_prefilter(tenant_id, user_roles)
# Execute the ANN vector search with pre-filtering enforced natively
results = self.client.search(
vector=query_vector,
filter=filter_conditions,
limit=top_k
)
return results
For fine-grained, dynamic rules (e.g., checking user geography, department match, and clearance level simultaneously), the system utilizes an ABAC filter.8
@dataclass
class UserContext:
identity_id: str
tenant_id: str
departments: List[str]
clearance_level: str # e.g., 'public', 'internal', 'confidential'
class ABACEnforcer:
@staticmethod
def construct_abac_filter(user: UserContext) -> Dict[str, Any]:
"""
Ensures users can only retrieve chunks matching their department,
their exact tenant, and at or below their active clearance level.
"""
clearance_hierarchy = {
"public": ["public"],
"internal": ["public", "internal"],
"confidential": ["public", "internal", "confidential"]
}
allowed_clearances = clearance_hierarchy.get(user.clearance_level, ["public"])
return {
"and": [
{"field": "tenant_id", "equals": user.tenant_id},
{"field": "department", "in": user.departments},
{"field": "classification", "in": allowed_clearances}
]
}
For agents executing structured SQL queries via Text2SQL, PostgreSQL Row-Level Security must be enforced on the connection session to prevent query injection from leaking cross-tenant data.8
-- Enable RLS on the core agent memory table
ALTER TABLE agent_memory_store ENABLE ROW LEVEL SECURITY;
-- Establish the tenant isolation policy using session variables
CREATE POLICY tenant_isolation_policy
ON agent_memory_store
FOR ALL
USING (tenant_id = NULLIF(current_setting('app.current_tenant', true), ''));
The database adapter execution pattern maps the authenticated user principal to this session variable prior to query processing 8:
from sqlalchemy import text
from sqlalchemy.orm import Session
def execute_agent_query(db_session: Session, user: UserContext, generated_sql: str) -> List:
"""
Executes an LLM-generated SQL query safely.
Enforces Row-Level Security by setting the session variable before execution.
"""
try:
# Enforce the active tenant parameter in the PostgreSQL transaction context
db_session.execute(
text("SET LOCAL app.current_tenant = :tenant"),
{"tenant": user.tenant_id}
)
# Execute the generated SQL query. RLS is enforced natively at the database engine level.
result = db_session.execute(text(generated_sql))
return [dict(row) for row in result.mappings()]
except Exception as e:
db_session.rollback()
raise SecurityException(f"SQL execution failed or blocked by policy engine: {str(e)}")
Enterprise document repositories are dynamic environments.6 A document’s legal authority changes as laws adapt, product lines deprecate, and retention schedules expire.14
+------------------------------------------------------------------------------------------------+
| VERSIONING, RETENTION, REDACTION, AND ARCHIVAL POLICY MAP |
+------------------------------------------------------------------------------------------------+
|
| Goal: manage document authority, retrieval eligibility, retention obligations, redaction
| paths, supersession links, archival movement, legal holds, and defensible deletion.
|
| [ New or Updated Corpus Object ]
| |
| v
| +------------------------------------------------------------------------------------------+
| | Lifecycle Classification |
| | |
| | determine version_state, retention_class, sensitivity_class, legal_hold_status, |
| | valid_from/valid_until, owner, steward, supersession links, and retrieval eligibility. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Document Version States |
| +----------------------+----------------------+----------------------+---------------------+
| | | |
| v v v
| +-------------------------+ +-----------------------+ +-----------------------------+
| | Draft | | Approved | | Active |
| | | | | | |
| | excluded from retrieval | | queued for validation | | eligible for standard |
| | stored for authors only | | and index loading | | retrieval and citation |
| +-----------+-------------+ +-----------+-----------+ +-------------+---------------+
| | | |
| | v v
| | +-----------------------+ +-----------------------------+
| | | Deprecated | | Superseded |
| | | | | |
| | | valid_until expired; | | replaced by newer version; |
| | | historical retrieval | | active queries reroute to |
| | | only | | superseded_by target |
| | +-----------+-----------+ +-------------+---------------+
| | | |
| +----------------------------+----------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Security and Redaction Branch |
| | |
| | Sensitive source? |
| | | |
| | +-- No -> retain inherited classification and continue lifecycle processing. |
| | | |
| | +-- Yes -> create Redacted Derivative if lower-clearance use is allowed. |
| | Mask PII/secrets, assign derivative ID, preserve parent lineage, |
| | inherit or narrow ACLs, and mark redaction_status. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Retention and Legal Hold Gate |
| | |
| | Retention trigger reached? |
| | | |
| | +-- No -> maintain current archival_state and schedule next review. |
| | | |
| | +-- Yes -> check legal_hold_status. |
| | | |
| | +-- Legal hold active -> freeze deletion, lock record, restrict |
| | | mutation, preserve all derivatives. |
| | | |
| | +-- No legal hold -> archive or delete according to policy. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Archival / Disposal Outcomes |
| +----------------------+----------------------+----------------------+---------------------+
| | | |
| v v v
| +-------------------------+ +-----------------------+ +-----------------------------+
| | Archived | | Legal Hold | | Deleted / Disposed |
| | | | | | |
| | cold storage, excluded | | immutable preservation| | cryptographic erasure from |
| | from standard indexes, | | blocks deletion and | | storage, vector indexes, |
| | available for audit | | retention expiry | | caches, and derivatives |
| +-----------+-------------+ +-----------+-----------+ +-------------+---------------+
| | | |
| +----------------------------+----------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Index and Lineage Propagation |
| | |
| | Update vector index, keyword index, graph records, summaries, extracted claims, citation|
| | anchors, prompt caches, audit logs, and downstream context-eligibility filters. |
| +------------------------------------------------------------------------------------------+
|
+------------------------------------------------------------------------------------------------+
| Rule: document lifecycle is not cosmetic metadata. It determines whether a knowledge asset can |
| be retrieved, cited, redacted, archived, frozen, superseded, or permanently destroyed. |
+------------------------------------------------------------------------------------------------+
The corpus lifecycle engine enforces distinct state transitions to manage the valid lifespan of documents:
To support defensible deletion and structural archival, the corpus lifecycle engine must integrate standards derived from the United States Department of Defense records management strategies 14:
The policy map below defines the retention and redaction rules for enterprise knowledge classes:
| Ingestion Category | Retention Trigger Type | Retention Lifespan | Redaction Requirement | Archival State Destination |
|---|---|---|---|---|
| Financial Ledgers | Event-Based: Fiscal Year Closure.14 | 7 Years.14 | Mask individual account IDs and customer PII.13 | Encrypted database table; excluded from active RAG indexes. |
| HR Employee Files | Metadata-Based: Employee Termination. | 10 Years. | Complete redaction of SSNs, home addresses, and private medical histories.13 | Placed under read-only schema lock; restricted to authorized HR admins.8 |
| Legal Contracts | Time-Based: Expiration of terms. | 20 Years. | Redaction of third-party business names and trade secrets before vector load.13 | Retained on active index under strict legal-access scopes.8 |
| Product Manuals | Version-Based: Product End-of-Life. | Indefinite. | Low-risk; zero redaction required. | Moved to “Legacy Archive” index; matching queries receive a “Superseded” banner. |
| Customer Tickets | Metadata-Based: Resolution Status. | 1 Year. | Automatic extraction and deletion of customer credentials and billing tokens. | Cleared for automatic deletion from index unless marked with a Legal_Hold tag.14 |
When retrieval systems ingest conflicting inputs (such as an old API manual and a newer specification, or contradictory policy guidelines), standard RAG blindly merges the text.12 This forces the model into a Context-Compliance Regime, where it outputs incorrect or hallucinated claims because the retrieved context dominates its parametric memory.12
+------------------------------------------------------------------------------------------------+
| CONFLICT RESOLUTION AND EPISTEMIC BELIEF REVISION |
+------------------------------------------------------------------------------------------------+
|
| Goal: prevent retrieval from flattening contradictory sources into a single confident answer.
| Conflicts must be isolated, scoped, resolved, logged, or surfaced as unresolved.
|
| [ Retrieval Candidate Set ]
| |
| v
| +------------------------------------------------------------------------------------------+
| | Step 1: Contextual Extraction |
| | |
| | Extract a_ctx: the answer asserted strictly by the retrieved context. |
| | Ignore parametric memory and do not reconcile yet. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Step 2: Parametric Extraction |
| | |
| | Extract a_param: the answer produced from model knowledge without retrieved context. |
| | Treat this as a comparison signal, not as authoritative truth. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Step 3: Divergence Check |
| | |
| | Compare a_ctx and a_param. Also compare retrieved sources against each other. |
| | |
| | Check: entity mismatch | date mismatch | jurisdiction mismatch | product/customer scope |
| | source authority | version state | supersession links | conflict_status |
| +---------------------------------------------+--------------------------------------------+
| |
| [ Divergence Detected? ]
| / \
| No / \ Yes
| v v
| +-------------------------------------------+ +--------------------------------------------+
| | Output grounded answer with normal | | Step 4: Premise Isolation |
| | citations and standard confidence. | | |
| | | | Extract the discrete premises causing |
| | Log: no conflict detected. | | the conflict and bind each to source, |
| +----------------------+--------------------+ | citation span, authority, scope, and time. |
| | +-------------------+------------------------+
| | |
| | v
| | +--------------------------------------------+
| | | Step 5: Scoped Variation Test |
| | | |
| | | Are the claims both valid under different |
| | | temporal, jurisdictional, product, tenant, |
| | | or customer scopes? |
| | +-------------------+------------------------+
| | |
| | +-------------+-------------+
| | | |
| | Yes No
| | | |
| | v v
| | +-------------------------------+ +--------------------+
| | | Scoped Variation Resolution | | True Contradiction |
| | | | | Resolution |
| | | Preserve both claims, attach | | |
| | | scope conditions, and answer | | Compare authority, |
| | | conditionally. | | recency, version, |
| | | | | provenance, and |
| | | Example: "For EU accounts..." | | trust score. |
| | +---------------+---------------+ +----------+---------+
| | | |
| | | v
| | | +----------------------------+
| | | | Resolved? |
| | | +-------------+--------------+
| | | / \
| | | Yes / \ No
| | | v v
| | | +----------------+ +-------------+
| | | | Select winning | | Quarantine |
| | | | claim; mark | | conflicting |
| | | | displaced claim| | claims; do |
| | | | superseded or | | not compile |
| | | | overridden. | | as truth. |
| | | +-------+--------+ +------+------+
| | | | |
| +-----------------------------+----------------+------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Conflict-Aware Output |
| | |
| | If resolved: answer with selected claim, citation, and conflict log. |
| | If scoped: answer conditionally by region/time/product/customer scope. |
| | If unresolved: disclose uncertainty, exclude disputed claims from hard conclusions, |
| | and trigger human review or system-of-record lookup. |
| +---------------------------------------------+--------------------------------------------+
| |
| v
| +------------------------------------------------------------------------------------------+
| | Audit and Belief Store Update |
| | |
| | Store: a_ctx, a_param, premises, source spans, authority comparison, scope decision, |
| | selected claim, quarantined claim IDs, confidence, and follow-up resolver action. |
| +------------------------------------------------------------------------------------------+
|
+------------------------------------------------------------------------------------------------+
| Rule: conflict is not noise to hide. It is evidence to structure. The system must distinguish |
| true contradiction from scoped variation before the answer reaches the user. |
+------------------------------------------------------------------------------------------------+
A contradiction is not always a quality failure. Two conflicting statements can both be true if their valid operational domains differ. The corpus engine must evaluate three core variables to determine whether a conflict represents a structural error or a scoped variation:
If these scopes overlap entirely, a true contradiction exists, and the system must flag the objects as disputed.10
To resolve these epistemic knowledge conflicts, systems must implement the Context-Driven Decomposition (CDD) belief-revision framework.12 CDD structures the inference process into five distinct steps to detect and arbitrate factual divergence:
The CDD framework significantly improves RAG accuracy under knowledge stress tests compared to standard retrieval baselines 12:
| Evaluation Setting | Standard RAG Accuracy | CDD Accuracy | Mistake-Injection Causal Sensitivity |
|---|---|---|---|
| Entity Swap Attacks | 58.4% 12 | 88.0% 12 | Corrects malicious entity substitutions. |
| Logical Contradictions | 56.0% 12 | 83.2% 12 | Identifies logical inconsistencies. |
| Temporal Shifts (Drift) | 68.8% 12 | 71.3% 11 | Restores accuracy under historical drifts. |
| Noisy Distractor Context | 68.8% 12 | 69.9% 11 | Ignores irrelevancies and out-of-scope files. |
| TruthfulQA Misconceptions | 15.0% 11 | 78.1% 12 | Blocks model compliance with falsehoods. |
By treating knowledge conflict as an inference-time observable, CDD keeps the trace of factual dispute visible to compliance teams, ensuring that the final answer is auditable.10
Managing a continuous ingestion pipeline requires strict observability. Data drift, parser regressions, and metadata omissions will corrupt retrieval indexing if left unmonitored.2
The corpus platform must monitor and evaluate its data quality against these fifteen critical metrics:
| Metric Name | Mathematical Definition / Formulation | Target SLA | Failure Mitigation Strategy |
|---|---|---|---|
| 1. Provenance Completeness | (Objects with Valid source_uri) / (Total Ingested Objects) 3 | 100.00% | Reject ingestion of files missing verifiable parent URIs.3 |
| 2. Metadata Coverage | (Objects with non-empty Mandatory Keys) / (Total Ingested Objects) | >99.90% | Route files with incomplete tags to human metadata-tagging queues.4 |
| 3. Exact Duplicate Rate | (Identical Lexical Hash Matches) / (Total Ingested Documents) 21 | <1.00% | Discard identical files during Stage 8 pipeline execution.1 |
| 4. Near-Duplicate Rate | (MinHash Jaccard Matches at T >= 0.85) / (Total Ingested Documents) 1 | <3.00% | Merge similar drafts; retain only the highest-authority version.1 |
| 5. Entity Resolution Accuracy | (Correct Entity merges to Canonical ID) / (Total Extracted Entities) 9 | >95.00% | Refine entity profile builder templates; adjust LLM merge logic.9 |
| 6. Alias Coverage Rate | (Resolved Entity Aliases) / (Total Entity Synonyms Extracted) 9 | >90.00% | Run background batch entity extraction and alias table updates.16 |
| 7. Permission Leakage Rate | (Chunks retrieved without valid Role match) / (Total Evaluated Retrievals) 8 | 0.00% | Enforce security checks before vector scoring is calculated.8 |
| 8. Redaction Propagation | (Derived elements correctly redacting term t) / (Total derived elements from parent D) | 100.00% | Cascade parent redactions instantly using parent-child lineage paths. |
| 9. Citation Fidelity | (Generations verified by Cited Source) / (Total Citations Generated) 11 | >98.00% | Block output if cited context lacks semantic warrant for claims.11 |
| 10. Parse Error Rate | (Mangled paragraph / layout outputs) / (Total Parsed Pages) 2 | <2.00% | Upgrade parsing pipelines to structure-aware layout engines.2 |
| 11. OCR/Table Read Accuracy | (Correctly extracted cells) / (Total cells in evaluation set) 15 | >95.00% | Route documents with multi-row tables to Docling/TableFormer.15 |
| 12. Stale-Source Rate | (Active Objects past valid_until date) / (Total Active Ingested Objects) | <0.50% | Trigger automated deprecation workflows; notify content owner. |
| 13. Conflict Detection Accuracy | (True Contradictions Identified) / (Total Contradictions in System) 10 | >95.00% | Implement Context-Driven Decomposition (CDD) at query time.12 |
| 14. Lineage Complete Rate | (Objects with complete PROV-O trail) / (Total Ingested Objects) 18 | 100.00% | Block ingestion of objects with incomplete activity or agent steps. |
| 15. Retrieval Eligibility | (Candidate objects matching active scopes) / (Total Candidate set returned by index) | 100.00% | Reject candidates that violate geographic, version, or tenant scopes. |
Data preparation sets the quality ceiling for all downstream retrieval and generation processes.2 Ingestion teams must run this operational audit checklist prior to moving any document into active search indexes:
| Hygiene Step | Audit Criteria & Validation Rule | Operational System Trigger | Verification State |
|---|---|---|---|
| Parser Validation | Inspect 20 representative parsing outputs.2 Confirm no multi-column text merges horizontally.15 | Run parser comparison test.2 | [ ] Pending |
| OCR/Table Check | Verify multi-row table cells map to correct parents.15 | TableFormer structural analysis run.15 | [ ] Pending |
| Encoding Cleanup | Strip non-printable characters; repair broken UTF-8 bytes. | Regex syntax validation script. | [ ] Pending |
| Boilerplate Removal | Confirm headers, footers, and side navs are completely removed.2 | Sentence frequency analysis pass.1 | [ ] Pending |
| Metadata Verification | Enforce non-empty tags for mandatory schema keys.1 | Schema validator engine pass.1 | [ ] Pending |
| Deduplication Pass | Run exact byte hashing and MinHash duplicate checking.1 | Jaccard threshold evaluation pass.2 | [ ] Pending |
| Alias Resolution | Confirm entity variants map to stable Canonical IDs.9 | Reconciler LLM check.9 | [ ] Pending |
| Redaction Execution | Verify PII and credentials are masked or removed.2 | Automated DLP scanner check.13 | [ ] Pending |
| Source Authority | Verify authority tiers are assigned correctly (As).10 | Connector source authority mapping match.10 | [ ] Pending |
| Version Verification | Confirm document is marked active and has not expired.14 | Temporal validity window verification check.14 | [ ] Pending |
| Permission Check | Confirm SharePoint/NTFS ACLs match database parameters.23 | Entra ID query test run.8 | [ ] Pending |
| Citation Registration | Verify document carries a valid citation format.3 | Schema citation tag check.11 | [ ] Pending |
Corpus engineering defines the governed data substrate beneath retrieval mechanics, freshness policies, context design, and compliance systems. The following table establishes how this report coordinates with other components of the AI systems canon.
| Upstream Report Reference | Downstream Target Report | System-Level Interface Mechanics | Canon Integration Objective |
|---|---|---|---|
| AI-ENG-B — Context Architecture | AI-ENG-D — Corpus Engineering | Context-side objects inherit the temporal validity, tenant scope, and owner fields defined in the corpus schema. | Connects context-compilation boundaries directly to source data governance. |
| AI-ENG-C — Inference Economics | AI-ENG-D — Corpus Engineering | Minimizes chunk bloat, duplicate files, and boilerplate text at the ingestion layer to control context costs.1 | Protects system margin by removing expensive, redundant context tokens.20 |
| AI-ENG-D — Corpus Engineering | AI-ENG-E — Retrieval Mechanics | Delivers normalized, permission-aware, and deduplicated candidate sets to the retrieval engine.8 | Establishes the secure dataset that query-writing and embedding models search.8 |
| AI-ENG-D — Corpus Engineering | AI-ENG-F — Knowledge Freshness | Provides version, supersession, and review triggers to control knowledge-base deprecation and freshness.14 | Drives the background routines that rebuild stale chunks and clear deprecated vectors. |
| AI-ENG-D — Corpus Engineering | AI-ENG-T — Tenant Isolation | Maps the explicit tenant and partition boundaries required to prevent multi-tenant data leaks.8 | Natively isolates user access before vector searches touch physical memory.8 |
| AI-ENG-D — Corpus Engineering | AI-ENG-U — Supply-Chain Security | Enforces sensitivity inheritance and redaction on derived assets to prevent data laundering.7 | Prevents high-classification data from leaking into lower-security models. |
| AI-ENG-D — Corpus Engineering | AI-ENG-Z — Systems Telemetry | Records execution timings, PROV-O traces, and parser performance to systems telemetry dashboards.7 | Monitors pipeline latency, extraction failures, and near-duplicate bloat in real time. |
| AI-ENG-D — Corpus Engineering | AI-ENG-AA — Retrieval Evals | Supplies ground-truth citation links and verified claims for retrieval and grounding evaluations.5 | Evaluates citation accuracy against direct source document coordinates.11 |
| AI-ENG-D — Corpus Engineering | AI-ENG-AB — Auditability Trails | Maintains a complete, immutable lineage record from raw source to generated output.18 | Satisfies legal compliance audits and forensic query reconstruction.14 |
| AI-ENG-D — Corpus Engineering | AI-ENG-AD — Data Governance | Enforces organizational compliance, retention rules, and DoD-grade records management standards.14 | Aligns the AI platform with corporate risk, legal discovery, and legal compliance policies. |
| AI-ENG-D — Corpus Engineering | AI-ENG-AH — Build/Buy Strategy | Compares parsing speed, table accuracy, and costs (such as Docling vs. LlamaParse APIs).15 | Guides resource allocation between local systems and vendor platforms. |
| Build an unstructured data pipeline for RAG | Databricks on AWS, accessed June 6, 2026, https://docs.databricks.com/aws/en/generative-ai/tutorials/ai-cookbook/quality-data-pipeline-rag |
| RAG Data Preparation: The Foundation That Makes or Breaks Your AI System | sph.sh, accessed June 6, 2026, https://sph.sh/en/posts/rag-data-preparation/ |
| Document-Level Access Control - Azure AI Search | Microsoft Learn, accessed June 6, 2026, https://learn.microsoft.com/en-us/azure/search/search-document-level-access-overview |