Volume 6 — Multimodal and Interface-Controlling Systems

How AI engineering changes when the system reads, sees, hears, speaks, and acts through interfaces.

Reports

AI-ENG-P — Multimodal Understanding: Documents, Images, Tables, Charts & Video

Covers OCR, layout-aware document parsing, table extraction, form understanding, chart interpretation, image retrieval, video sampling, visual grounding, multimodal embeddings, and evidence selection. Focuses on whether the system inspected the right evidence, not merely whether it produced plausible text.

AI-ENG-Q — Speech, Voice, and Real-Time Interaction Systems

Covers speech-to-text, text-to-speech, turn-taking, interruption handling, latency budgets, streaming generation, conversational repair, voice identity, accessibility, and the UX consequences of real-time AI behavior.

AI-ENG-R — UI Agents: Browser Control, Desktop Automation & Visual State

Covers agents that operate software interfaces. Includes DOM inspection, screenshot reasoning, visual state tracking, action planning, click/type verification, form submission safety, browser sandboxing, and recovery from interface drift.

← Back to Canon Map