Volume 4 — Runtime Architecture and Inference Mechanics
How AI systems actually execute under physical, computational, and operational constraints.
Reports
AI-ENG-J — Throughput Mechanics: Memory Pressure, KV Cache & Hardware Realities
Covers the brutal physics of scaling inference. Includes prefill vs. decode latency, KV cache management, eviction policies, prompt caching, semantic caching, continuous batching, paged attention, GPU memory pressure, CPU/GPU tradeoffs, and capacity planning under real workloads.
AI-ENG-K — Weight Dynamics: Quantization, Compression & Decoding Strategy
Covers INT8, INT4, FP8, AWQ, GPTQ, pruning, speculative decoding, draft models, constrained decoding, sampling behavior, and quality degradation boundaries. Teaches how to make models faster and cheaper without quietly destroying their reasoning, formatting, or instruction-following reliability.
AI-ENG-L — Model Serving Architecture: Routing, Load Balancing & Deployment Topologies
Covers inference servers, API gateways, local deployment, edge deployment, cloud deployment, multi-model routing, tenant-aware capacity, queueing, backpressure, autoscaling, cold starts, rate limits, and high-availability serving patterns.