ai-engineering-systems-canon

AI-ENG-K — Weight Dynamics - Quantization, Compression & Decoding Strategy

Architectural Foundations of Weight Dynamics

Conceptual Definitions and Scope

In high-dimensional artificial intelligence system architecture, weight dynamics is defined as the disciplined, behavior-aware optimization of model representations, numerical formats, and token-generation paths.1 It forms a core pillar of runtime execution engineering, operating at the boundary where neural networks interact with physical hardware constraints.3
To establish clear operational boundaries, weight dynamics must be distinguished from related but distinct runtime optimization vectors:

During autoregressive decoding, models execute in a memory-bandwidth-bound state, where the speed of transferring billions of parameters from High-Bandwidth Memory (HBM) to the processor’s registers dictates generation latency.3 Weight dynamics directly addresses this bottleneck by compressing static weights, managing dynamic activation ranges, and pruning redundant pathways.1 However, these structural changes introduce numerical approximation errors.10
The core doctrine of weight dynamics states: Compression must be behavior-preserving, not merely benchmark-preserving. The goal is not only to reduce bits or increase throughput, but to preserve task-critical reasoning, formatting, grounding, tool-use, safety, and instruction-following reliability inside production workflows.10 A faster model that silently degrades a workflow’s critical behavior has not been optimized; it has been damaged efficiently.14

Conceptual Glossary

Representation and Decoding Optimization Map

+------------------------------------------------------------------------------------------------+
|                              REPRESENTATION AND DECODING OPTIMIZATION MAP                      |
+------------------------------------------------------------------------------------------------+
|                                                                                                             
|                                      [ Core Inference Engine ]                                              
|                                                 |                                                           
|              +----------------------------------+----------------------------------+                        
|              |                                  |                                  |                        
|              v                                  v                                  v                        
|   +------------------------+          +------------------------+          +--------------------+            
|   | Representation & State |          | Decoding & Output Path |          | Acceleration Layer |            
|   +-----------+------------+          +-----------+------------+          +----------+---------+            
|               |                                   |                                  |                      
|      +--------+--------+                 +--------+--------+                 +-------+-------+              
|      |        |        |                 |        |        |                 |       |       |              
|      v        v        v                 v        v        v                 v       v       v              
| +---------+ +----------+ +----------+ +--------+ +----------+ +-----------+ +------+ +------+ +----------+  
| | Static  | | Active   | | KV Cache | | Logits | | Token    | | Output    | |Draft | |Runtime| | Serving |  
| | Weights | | Activat. | | State    | | Path   | |Selection | |Constraints| |Verify| |Kernels| |Scheduler|  
| +----+----+ +----+-----+ +----+-----+ +---+----+ +----+-----+ +-----+-----+ +--+---+ +--+---+ +----+-----+  
|      |           |            |          |           |             |          |        |          |         
|      v           v            v          v           v             v          v        v          v         
| INT4 / FP4   INT8 / FP8   INT8 / INT4  FP16 /    Min-P /      CFG / FSM   EAGLE-3  Marlin /  RadixAttn      
| EXL2 / FP8   E4M3 Act.    FP8 /        BF16      Top-P /      TagDispatch P-EAGLE  cuBLASLt  Paged          
|              Tensor Core  TurboQuant   FP32 acc. Top-K /      llguidance  HiSpec   Sparse    Block Mgr.     
| HBM -> L2    SM/Register  VRAM blocks  ALU       Logit Bias   Logit Mask  Medusa   Kernels   Scheduler      
| bandwidth    execution    DRAM fallback softmax   filtering    enforcement verify   dequant   policy        
|                                                                                                             
+------------------------------------------------------------------------------------------------+
| Optimization doctrine: reduce movement, preserve behavior, and validate every compression path.|
+------------------------------------------------------------------------------------------------+
Optimization Target Active Precision Formats Physical Hardware Element Operational Speed/Memory Target Behavioral Regression Risks
Static Weights INT4, FP4 (OCP MX), EXL2, FP8 30 High-Bandwidth Memory (HBM) to L2 Cache 1 4x reduction in weight transfer volume; bypasses HBM bandwidth bottlenecks.1 Severe reasoning degradation, loss of mathematical precision, syntax errors.13
Activations INT8, FP8 (E4M3 format) 18 GPU Streaming Multiprocessors (SM) & Registers 1 Enables native low-precision Tensor Core execution; 2x TFLOPS boost.20 Outlier saturation; clipping-induced hallucination; safety alignment drift.9
KV Cache Tiered INT8/INT4, FP8, TurboQuant 10 VRAM Block Allocation; Host System RAM 10 Bypasses linear capacity footprint growth; expands maximum concurrency.10 “Needle-in-a-haystack” retrieval failure; long-context attention decay.10
Logits FP16, BF16 (accumulated in FP32) 37 Vector ALUs & Thread Registers 1 Prevents dynamic range underflow before softmax normalizations.29 NaN failures; dynamic probability distribution collapse.19
Token Selection Min-P, Top-P, Top-K, Logit Bias 28 Thread Warp execution blocks 1 Eliminates low-probability garbage tokens without restricting creative vocabulary.28 Repetitive output loops; structural tag truncation; failure of safety refusals.28
Output Constraints CFG, FSM, TagDispatch, llguidance CPU host scheduler, grammar compiler, GPU/CPU logit processor Strong structural adherence with overhead determined by grammar complexity, compilation caching, per-token mask cost, and runtime integration. Parser deadlock, forced invalid semantics, masked correct tokens, schema-valid but meaning-invalid outputs, retry loops under malformed grammars.
Draft Verification EAGLE-3, P-EAGLE, HiSpec, Medusa 23 Parallel Target Verification Thread Blocks 23 2x to 3x throughput speedups by exploiting idle compute capacity.11 Rejection-loop latency tax; higher VRAM and HBM footprint during generation.25
Runtime Kernels Marlin, Sparse-Marlin, cuBLASLt 1 GPU Shared Memory (SMEM) registers 1 Dynamic dequantization outside the execution critical path.1 Memory misalignment; bank conflicts; register spilling.1
Serving Scheduler SGLang RadixAttention 46 Paged Memory Block Managers 46 5x throughput improvement on high-overlap multi-turn workflows.47 Queue starvation; context evictions under memory pressure.48

Quantization and Numerical Representation

Precision Formats and Hardware Architecture

Quantization scales down model representations to resolve the physical limitations of hardware serving environments.1 During the autoregressive decode phase, LLM execution is highly memory-bandwidth bound.3 Every generated token requires loading the model’s entire weight matrix from HBM to the chip’s processing registers, yielding an arithmetic intensity (FLOPs per byte) that is orders of magnitude lower than the prefill phase.3
To bypass this memory bus bottleneck, quantization formats map high-precision floating-point parameters to narrower bit-widths, reducing the volume of data that must be transferred 1:

 ---> (Compressed Quantized Weights Stream) --->  
                                                |  
                                 (Dynamic Dequantization to FP16)  
                                                v  
          <--- (High-Precision Accumulation) <---

Modern high-performance architectures leverage specialized floating-point and integer formats to manage this memory-compute boundary:

FP8 E4M3 Element: -> High Precision (Weights & Forward Activations)
FP8 E5M2 Element: -> High Range (Gradients & Training Backward)

Quantization Methodologies

Post-training quantization (PTQ) frameworks employ different optimization strategies to reduce representational errors 9:

1. SmoothQuant

SmoothQuant addresses the challenge of activation outlier channels, which frequently exhibit values 10 to 100 times larger than median channels in models exceeding 6.7B parameters.9 Standard per-tensor quantization collapses non-outlier channels into a few quantization bins, introducing severe representation errors.9
SmoothQuant resolves this by migrating quantization difficulty from activations to weights through a channel-wise re-parameterization of linear layers.7 For an activation matrix X (represented as a real-valued matrix of size N by C_in) and weight matrix W (represented as a real-valued matrix of size C_in by C_out), the forward multiplication Y = XW is mathematically transformed 9:
Y = (X * diag(s)^-1) * (diag(s) * W) = X_hat * W_hat
where s in R^(C_in) is a positive, channel-wise scaling vector.9 The optimal scaling factor s_j for each channel j is calculated using a hyperparameter alpha (migration strength) 9:
s_j = a_j^alpha / w_j^(1-alpha)
where a_j = max(|X_j|) represents the activation channel maxima profiled from a calibration dataset, and w_j = max(|W_j|) represents the weight channel maxima.9 Selecting alpha approximately equal to 0.5 balances the quantization difficulty equally between weights and activations, enabling stable W8A8 INT8 execution across linear layers.9 The scaling factors can be fused into preceding normalization layers (such as RMSNorm or LayerNorm) offline to eliminate runtime scaling overhead.7

2. AWQ (Activation-aware Weight Quantization)

AWQ is based on the insight that model parameters are not equally important; protecting only the top 1% of “salient” weights (specifically those corresponding to large-magnitude activations) drastically reduces reconstruction errors.16 AWQ identifies these salient weight channels using a calibration dataset and protects them via a scaling transformation 16:
Y = (X * diag(s)^-1) * (diag(s) * W)
The scale factor s is optimized automatically to minimize quantization error on the salient channels, allowing the remaining 99% of parameters to be quantized to 4-bit weights while activations remain in high-precision FP16 or BF16 format.16 This weight-only approach avoids the computational complexity of activation quantization.16

3. GPTQ

GPTQ is an offline, layer-wise post-training compression framework that minimizes reconstruction error by solving a joint optimization problem.13 The algorithm processes weights column-by-column, using approximate second-order Hessian information to update remaining unquantized parameters and compensate for the rounding error introduced by quantizing preceding columns.13
GPTQ achieves near-lossless 4-bit weight compression on massive models, but the second-order Hessian calculations are computationally intensive and can require several hours of offline compilation.13

Hardware and Kernel Dependencies

A quantized format only improves serving economics when the runtime engine can exploit it.1 The Marlin INT4 kernel addresses this by executing high-performance mixed-precision matrix multiplications (W4A16) directly on GPU Tensor Cores.1 Marlin uses several hardware-level optimizations 1:

Marlin-GPTQ on Qwen2.5-32B achieves 712 tokens per second (tok/s) compared to standard GPTQ’s 276 tok/s (a 2.6x speedup), while Marlin-AWQ achieves 741 tok/s compared to standard AWQ’s 68 tok/s (a 10.9x speedup).1
These optimizations deliver up to a 3.9x throughput speedup at low batch sizes (batch size 16-32) where memory bandwidth is the primary constraint.44 As batch sizes scale up to 128, the workload becomes compute-bound, and Marlin’s relative speedup decreases to 1.5x.44

Quantization Method Memory Savings Profile Hardware / Runtime Fit Calibration Requirements Behavioral Risks Ideal Workload
FP8 W8A8 / E4M3 Forward Inference About 50% reduction versus BF16/FP16 for supported weights and activations Best on Hopper, Ada Lovelace, and Blackwell-class accelerators with native FP8 paths; strong fit for vLLM/TRT-LLM style runtimes Low to moderate; requires representative activation samples and scale selection Rare-token instability, high-entropy reasoning degradation, activation outlier saturation High-throughput serving on modern accelerators
INT8 W8A8 / SmoothQuant About 50% reduction versus BF16/FP16 for weights and activations Strong fit for Ampere, Hopper, Blackwell, and mixed fleets lacking uniform native FP8 support High-quality activation profiling; must capture channel outliers, schemas, code, math, and long contexts Outlier clipping, syntax drift, safety/refusal boundary shifts Legacy or mixed GPU fleets needing robust W8A8 execution
AWQ INT4 / W4A16 Weight-Only About 75% reduction in static weight storage; activations remain FP16/BF16 Broad CUDA compatibility, but speedups require optimized kernels such as Marlin-AWQ Moderate; needs task-balanced calibration exposing salient activation channels Salient-channel miss, long-context degradation, brittle schema/tool behavior High-concurrency serving where weight bandwidth is the bottleneck
GPTQ INT4 / W4A16 Weight-Only About 75% reduction in static weight storage; activations remain FP16/BF16 Practical throughput depends on Marlin-GPTQ or equivalent kernels High offline cost; layer-wise reconstruction over representative calibration data Quantization noise on rare tokens, code/math precision loss, fragile formatting Stable single-model deployments and local inference
MXFP4 / NVFP4 Block-Scaled FP4 About 75% reduction versus BF16/FP16 for supported tensors Blackwell-native path; depends on FP4 Tensor Core support and block scaling Moderate to high; requires block-scaling strategy and strict regression validation Complex reasoning, code, math, and long-context recall can degrade sharply Ultra-high-throughput Blackwell deployments after heavy validation

Calibration Data Design Model

The behavioral preservation of post-training quantized models is heavily dependent on the design of their calibration datasets.18 Naive calibration on generic corpora (such as raw web crawls) fails to capture the activation outliers and dynamic ranges triggered by specialized tasks, causing catastrophic quality degradation in production.18

+--------------------------------------------------------------------------------------+
|                            CALIBRATION DATA DESIGN MODEL                             |
+--------------------------------------------------------------------------------------+
|                                                                                      
|  Goal: build a profiling corpus that preserves behavior after quantization.          
|                                                                                      
|                         [ Production Workload Traces ]                               
|                                      |                                               
|                                      v                                               
|  +--------------------------------------------------------------------------------+  
|  |                         REPRESENTATIVE CALIBRATION CORPUS                      |  
|  |                                                                                |  
|  |  +-----------------------------+   +-----------------------------------------+ |  
|  |  | Workload Representativeness |   | Domain & Language Coverage              | |  
|  |  |                             |   |                                         | |  
|  |  | - Real user prompt shapes   |   | - Code, math, reasoning                 | |  
|  |  | - Real task mix             |   | - Multilingual inputs                   | |  
|  |  | - Real template patterns    |   | - Rare-token / outlier channels         | |  
|  |  +-----------------------------+   +-----------------------------------------+ |  
|  |                                                                                |  
|  |  +-----------------------------+   +-----------------------------------------+ |  
|  |  | Structural Adherence        |   | Context-Length Coverage                 | |  
|  |  |                             |   |                                         | |  
|  |  | - JSON schemas              |   | - Short, medium, long sequences         | |  
|  |  | - XML wrappers              |   | - Target production context windows     | |  
|  |  | - Tool definitions          |   | - 2K -> 32K+ activation profiles        | |  
|  |  +-----------------------------+   +-----------------------------------------+ |  
|  |                                                                                |  
|  |  +--------------------------------------------------------------------------+  |  
|  |  | Governance & Privacy                                                     |  |  
|  |  |                                                                          |  |  
|  |  | - PII anonymization                                                      |  |  
|  |  | - Provenance and version tracking                                        |  |  
|  |  | - Separation from validation / evaluation test sets                      |  |  
|  |  | - Release-linked calibration manifests                                   |  |  
|  |  +--------------------------------------------------------------------------+  |  
|  +--------------------------------------------------------------------------------+  
|                                      |                                               
|                                      v                                               
|  +--------------------------------------------------------------------------------+  
|  |                         QUANTIZATION CALIBRATION PASS                          |  
|  |                                                                                |  
|  |  - Profile activation ranges and outlier channels                              |  
|  |  - Compute weight / activation scale factors                                   |  
|  |  - Preserve fragile behaviors: reasoning, schemas, tools, safety, recall       |  
|  +--------------------------------------------------------------------------------+  
|                                      |                                               
|                                      v                                               
|  +--------------------------------------------------------------------------------+  
|  |                            OPTIMIZED MODEL CANDIDATE                           |  
|  |                                                                                |  
|  |  Candidate proceeds only after regression testing against held-out eval sets.  |  
|  +--------------------------------------------------------------------------------+  
|                                                                                      
+--------------------------------------------------------------------------------------+

Pruning, Sparsity, and Architectural Compression

Pruning Mechanics and Structured Sparsity

Pruning is a model compression technique that sets specific parameters to zero, seeking to reduce both storage footprint and the computational complexity of matrix operations.12 It is classified into two primary categories:

NVIDIA’s 2:4 Structured Sparsity is the dominant hardware-supported structured pruning standard.12 It dictates that within every contiguous block of four weights, exactly two elements must be zeroed.12

Original Dense Weights (1x4 block):  [  0.85 |  0.12  | -0.64  |  0.03  ]  
                                             |  
                                  (Magnitude-based Pruning)  
                                             v  
2:4 Structured Sparse Layout:        [  0.85 |   0    | -0.64  |   0    ]  
                                             |  
                               (Hardware-level Compression)  
                                             v  
Compressed Storage Format: Weights: [ 0.85 | -0.64 ]   Metadata: [ 00 | 10 ] (2-bit indices)

NVIDIA Ampere, Hopper, and Blackwell GPUs feature dedicated Sparse Tensor Core logic designed specifically to exploit this 2:4 pattern.12 The hardware compresses the weight matrix by storing only the non-zero parameters along with a highly compact 2-bit index metadata block per group of four, reducing static weight storage requirements and HBM memory transfer overhead by 50%.8 During matrix multiplication, the hardware uses this metadata to skip zero-valued operands entirely, doubling GEMM computational throughput.8

Pruning Algorithms and Arbitrary Sparsity Adaptation

Enforcing structured sparsity on a pre-trained model introduces severe information loss.14 Two prominent post-training pruning frameworks address this:

  1. SparseGPT: An offline, one-shot post-training pruning framework that treats pruning as a global reconstruction error minimization problem.12 Operating layer-by-layer, SparseGPT uses second-order information from the loss landscape to compute the inverse Hessian of the weights against a calibration dataset.12 It then calculates which parameters to remove and dynamically updates the remaining non-zero weights to compensate for the lost information.12 While highly effective at retaining model quality, computing the inverse Hessian (which is a d x d matrix for weight dimension d) is extremely VRAM-intensive and slow, requiring up to 75 minutes on an H100 to prune a 70B model.12
  2. Wanda (Pruning by Weights and Activations): Wanda addresses the computational bottleneck of SparseGPT by completely bypassing the inverse Hessian calculation.12 It relies on a simpler, local ranking heuristic: for every weight, it computes the product of its absolute magnitude and the L2 norm of its corresponding input activation vector 12:

S_ij = |W_ij| * ||X_j||_2
Weights with the lowest scores are zeroed out while satisfying the structured 2:4 constraint.12 Wanda only requires a single fast forward sweep over calibration data, allowing a 70B model to be pruned in under 10 minutes using half the peak VRAM of SparseGPT with only a minor (0.1 to 0.2) perplexity increase on standard benchmarks.12

The Reasoning Collapse and SlideSparse Adaptation

Despite hardware-level efficiency, forcing a strict 50% pruning ratio via 2:4 structured sparsity often causes unrecoverable accuracy degradation in highly optimized frontier models.14 For instance, testing Qwen3 under a 50% 2:4 structured pruning regime resulted in its reasoning benchmark performance collapsing from a dense baseline of 54.0% down to 15.3%.14 Conversely, a moderate 25% structured pruning pattern (6:8 sparsity) preserved near-dense accuracy (51.6%), but because modern Tensor Cores do not natively support 6:8 execution, inference engines must expand the sparse weights back to dense form, yielding zero serving acceleration.14

Original Sparsity (2:8 pattern - 25% sparse):
W_source: (length L_source = 8, Z_source = 2, N = 6 non-zeros)
|
(SlideSparse Sliding Window)
v
Compressed Hardware-Conforming Layout (Target 2:4 Pattern):
W_target_0: Metadata: [ 00 | 10 ]
W_target_1: Metadata: [ 00 | 10 ]
W_target_2: Metadata: [ 00 | 10 ]

To bridge this deployment gap, SlideSparse introduces an overlapping sliding window transformation that maps arbitrary Z:L sparsity patterns to the 2:4 format.56 SlideSparse defines a generalized sparsity format Z:L, where L is the source window size and Z is the minimum number of zeros within that window.56
Through a greedy residual allocation strategy, SlideSparse duplicates and shifts non-zero elements—a process called K-expansion—to construct an expanded matrix that structurally conforms to the hardware’s 2:4 format requirements.56 This allows models with lower, accuracy-preserving sparsity profiles (e.g., 25% sparsity) to execute directly on Sparse Tensor Cores, unlocking proportional speedups (e.g., 1.33x speedup for 2:8 patterns) without suffering the severe cognitive collapse triggered by aggressive 50% pruning.56

Compression Approach Underpinning Mechanics Hardware Dependency Acceleration Mechanism Behavioral Risk
Unstructured Pruning Individual low-importance weights are zeroed without enforcing a physical layout General GPU compatible, but not directly accelerated by standard Tensor Core paths Usually none in dense execution; zero-valued weights may still be loaded and processed Low to moderate; can preserve behavior at modest sparsity but offers little serving benefit alone
2:4 Structured Sparsity Exactly two weights are zeroed in every contiguous group of four Native support on NVIDIA Ampere, Hopper, and Blackwell Sparse Tensor Core paths Sparse Tensor Cores skip zero operands and store compact metadata for non-zero positions High; forced 50% pruning can collapse reasoning, code accuracy, and rare-token competence
SlideSparse Adaptation Lower-risk sparsity patterns such as 2:8 are transformed into 2:4-compatible layouts through expansion Requires hardware capable of accelerating 2:4 sparse layouts after transformation Maps accuracy-preserving lower sparsity into hardware-conforming sparse execution Low to moderate; preserves more behavior than strict 2:4 but adds expansion/layout complexity
Low-Rank Compression Dense matrices are replaced with lower-rank projections, such as truncated SVD General linear algebra kernels; benefit depends on rank target and kernel support Reduces effective matrix dimensions, projection cost, and memory traffic Moderate; may damage retrieval, attention detail, and fine-grained reasoning

Practical fit: structured sparsity only matters when the sparse representation survives all the way into hardware execution. If the runtime expands sparse weights back into dense form, the model has merely been compressed on paper.

KV Cache Quantization and State Compression

Dynamic Memory Pressures of the Key-Value Cache

The Key-Value (KV) cache stores attention states from previous tokens to avoid redundant calculations during autoregressive decoding.3 While effective at reducing computational latency, the KV cache grows linearly with context length and batch size, rapidly outpacing static weight footprints to become the dominant VRAM bottleneck in high-concurrency systems.10 At a 128K context window, a 70B model’s uncompressed BF16 KV cache alone consumes approximately 40 GB of memory—nearly double the headroom available on two H100 GPUs after loading model weights.34
Quantizing this dynamic state reduces VRAM pressure, but keys and values exhibit highly irregular, non-normal spatial distributions that are sensitive to precision loss.10 Keys carry structural positional encodings (such as RoPE), while values contain high-magnitude outlier activations that scale-up reconstruction errors.34
Applying naive low-bit quantization causes catastrophic failure in “needle-in-a-haystack” retrieval tasks and long-context processing.10 To resolve this, specialized compression frameworks have been developed:

1. Runtime-Certified Bounded-Error Quantized Attention

This tiered KV cache architecture treats compression as a dynamic, runtime-verified computation rather than a static approximation.10 It stores keys as per-channel INT8 tensors and values as per-group INT4 tensors in VRAM (Tier-1) to achieve a 4x reduction in cache memory footprint.10 Crucially, the original, uncompressed FP16 copies are retained in system RAM (Tier-2).10
At every decode step, the GPU-based attention kernel computes a two-term error bound that independently measures the output perturbation caused by key and value compression.10 This error bound is calculated using real-time values including query norms, quantization scales, and value ranges.10 If the estimated error margin for a specific attention block exceeds a strict quality threshold, the system triggers a fast Host-to-Device (H2D) page-in from System RAM to device memory, executing a fallback to uncompressed FP16 attention for that block.10 This prevents the silent degradation of retrieval accuracy while keeping 99% of attention blocks compressed in VRAM.10

2. TurboQuant (Google Research)

TurboQuant compresses the KV cache up to 6x and accelerates attention computations by 8x without requiring a calibration dataset.34 It uses a two-stage approach:

3. ShadowKV

ShadowKV uses a low-rank approximation approach, applying a truncated Singular Value Decomposition (SVD) to compress high-dimensional key vectors down to low-rank subspaces (typically rank 160 or 256).36 Because the value projections are less sensitive to low-rank transformations, ShadowKV quantizes values to low-bit formats (e.g., FP8 or NVFP4) while maintaining full-precision key projections, preserving context-intensive retrieval capabilities with minimal latency overhead.36

Compression Scheme Active Representation VRAM Allocation vs. BF16 Context Scaling Envelope Attention Degradation Risk
FP8 KV Cache Keys and values stored in FP8 instead of BF16/FP16 About 50% reduction Roughly 2× effective context capacity or concurrency before memory pressure Low to moderate; can degrade positional precision and long-range recall
Tiered INT8 / INT4 Bounded Fallback Keys stored as per-channel INT8; values stored as per-group INT4 in VRAM; FP16 originals retained in host DRAM Up to about 4× reduction while blocks remain compressed About 3× to 4× context or concurrency expansion depending on H2D budget Bounded by runtime error checks; risk shifts from silent error to fallback latency
TurboQuant Rotated low-bit key/value states using PolarQuant smoothing plus QJL error correction Up to about 6× reduction depending on workload and implementation About 5× to 6× context expansion if CUDA path is optimized Moderate; custom kernels become load-bearing, but rotation/projection reduce outlier damage
ShadowKV / Low-Rank KV Compression Keys compressed through low-rank approximation; values quantized to FP8, NVFP4, or similar formats About 4× reduction depending on target rank and value format About 4× context expansion when workload tolerates approximation Low to moderate for context-heavy tasks if keys preserve enough structure; higher risk in dense multi-turn attention
Scheme System Interoperability
FP8 KV Cache Best fit for runtimes with native FP8 KV support, such as modern vLLM/SGLang-style deployments
Tiered INT8 / INT4 Requires custom attention kernels, pinned host memory, and fast H2D fallback paths
TurboQuant Requires custom CUDA kernels and validation against target context workloads
ShadowKV / Low-Rank KV Fits paged-attention-style engines if block tables and compression metadata stay scheduler-compatible

Operational doctrine: compress KV cache only with long-context regression tests. NIAH/RULER-style recall failures are the canary.

Speculative Decoding and Draft-Model Engineering

The Draft-and-Verify Paradigm

Speculative decoding addresses the memory-bandwidth bottleneck of autoregressive generation by exploiting underutilized parallel compute capacity on modern accelerators.23 Standard decoding generates one token per forward pass, requiring the GPU to transfer billions of parameters from HBM to registers to perform a single vector calculation.3 Speculative decoding breaks this one-token-per-pass limit using a draft-and-verify paradigm 11:

+-----------------------------------------------------------------------------------------+  
|                                  SPECULATIVE DECODING TIMELINE                          |  
|                                                                                         |  
|     ====> Draft Engine (EAGLE-3 / HiSpec) cheaply proposes K tokens.                    |  
|                             (E.g., K = 5 proposed tokens:)                              |  
|                                                                                         |  
|    ====> Target LLM runs a single parallel forward pass over all K.                     |  
|                             Computes verification probabilities simultaneously.         |  
|                                                                                         |  
|    ====> Accept first non-divergent tokens (E.g., T1, T2, T3).                          |  
|                             Reject T4, resample next token, discard T5.                 |  
+-----------------------------------------------------------------------------------------+
  1. Drafting Phase: A smaller, cheaper draft engine (or adapter head) quickly and autoregressively generates K candidate tokens.11
  2. Verification Phase: The larger, highly accurate target model takes those K candidate tokens and processes them in a single parallel forward pass.23 Modern GPUs can verify K tokens in nearly the same time they take to generate a single token because the target model’s weights only need to be transferred from HBM to registers once for the entire sequence.11
  3. Acceptance and Correction: The target model compares its token probability distribution against the draft model’s proposals.11 It accepts tokens that align with its distribution, rejects the first divergent token, and generates a fresh correction token.11 If the draft model matches the target well, the system accepts multiple tokens per step, dramatically accelerating generation without changing output quality.11

For an acceptance probability alpha and a proposed draft length K, the expected number of tokens advanced per target verification pass is:

E_tokens_advanced = (1 - alpha^(K + 1)) / (1 - alpha)

This expression counts the progress made by a verification step, including the correction token produced after the first rejected draft token. It should not be described as the expected number of accepted draft tokens. When alpha = 0, no draft tokens are accepted, but the target still advances by one corrected token. When alpha approaches 1, most or all drafted tokens are accepted, and the verification pass advances close to K + 1 tokens.

Speculative Architectures

The design of the draft engine directly impacts speculative decoding performance.23 Platform teams choose from three primary structural approaches:

  1. Vanilla (Independent Model) —> Target: 70B, Draft: 1B-3B Model

  2. Medusa (Multi-Head Token) —> Target Model Hidden States —> Multi-Head Classifiers (T+1, T+2…)

  3. EAGLE-3 (Feature Autoregression) -> Target Hidden States —> Extrapolator —> LM Head (Dynamic Tree)

  4. Vanilla Speculative Decoding: Uses an independent, smaller model from the same architecture family as the draft engine (for example, pairing a Llama 3.2 1B draft with a Llama 3 70B target).11 This approach requires zero target model modification, but the draft model must fit in VRAM alongside the target, and tokenizer differences or misaligned training corpora can degrade the token acceptance rate.11
  5. Medusa (Multi-Head Token Prediction): Medusa adds multiple feed-forward classification heads on top of the target model’s final hidden state layer.23 Instead of executing an independent model, these heads predict tokens at subsequent offsets (T+1, T+2,…) in parallel.23 This architecture avoids VRAM overhead and tokenizer mismatches, but training the specialized classification heads is computationally expensive and difficult to scale.23
  6. EAGLE-3 (Feature Autoregression): The current industry standard for speculative decoding, supported natively in vLLM, SGLang, and TensorRT-LLM.4 Unlike vanilla draft models that predict tokens at the surface vocabulary layer, EAGLE-3 performs feature-level autoregressive predictions.59 It captures the target model’s final-layer hidden states and feeds them into a lightweight, 4-layer Transformer extrapolator to predict subsequent hidden states.59 These predicted states are then passed through the target model’s language model head to generate tokens.59

EAGLE-3 further incorporates a Training-Time Test (TTT) procedure that simulates multi-step error accumulation during draft training, allowing it to maintain a high acceptance rate (approximately 80%) on complex reasoning, mathematical, and coding tasks.4 On Llama 3.3 70B, EAGLE-3 delivers up to a 4.79x latency speedup without degrading output quality.4
P-EAGLE (Parallel Drafting): P-EAGLE addresses the sequential processing bottleneck of standard speculative drafting.41 Instead of requiring K sequential forward passes through the draft model to propose K tokens, P-EAGLE uses a lightweight, parallel-capable draft model that generates all K candidate tokens in a single forward pass, constructing verification trees that are verified by the target model with minimal latency overhead.41

Hierarchical Speculative Decoding (HiSpec)

While draft generation is fast, verifying a long sequence of proposed tokens against a massive target model (like a 70B or 405B parameter model) still incurs substantial verification latency, taking up to 6.9x longer than draft generation.25 HiSpec addresses this verification bottleneck by using low-overhead Early-Exit (EE) models as intermediate verifiers.25
Rather than executing a full forward pass through all layers of the target model for every verification step, HiSpec routes candidate tokens through an intermediate verifier that exits at an early layer (typically one-fourth of the target model’s total depth).25 This intermediate verifier rejects inaccurate candidate tokens early in the execution path, reducing the volume of tokens that must pass through the remaining expensive layers of the target model.25

Speculative Pipeline Draft Engine Architecture Target Verification Path Acceptance Profile Throughput Gain Resource Demands
Vanilla Speculation Independent small model, usually 1B–3B parameters, paired with same-family target Full target model verifies drafted token sequence in parallel Moderate; strong only when tokenizer and distributions align About 1.8×–2.3× when draft and target align well Adds draft-model VRAM; requires tokenizer/vocab alignment
Medusa Multiple lightweight prediction heads attached to target hidden states Target hidden states are extended through trained multi-token heads Moderate to good; best on predictable continuations About 1.5×–2.1×; lower ceiling than EAGLE-style drafting Minimal added VRAM, but requires custom head training and integration
EAGLE-3 Lightweight feature-level autoregressive extrapolator predicts future hidden states Target model verifies tokens generated from predicted hidden-state trajectories High; often around 80% on aligned workloads About 2.5×–3.5× in production-like batch workloads Requires draft training and runtime support in vLLM/SGLang/TRT-LLM
P-EAGLE Parallel draft model proposes multiple candidate tokens in one forward pass Verification tree is processed by the target with minimal sequential drafting High if mask and placeholder layout remain stable About 2.8×–3.6× when parallel drafting avoids draft tax Rebuilds batch metadata slots; needs specialized runtime support
HiSpec Early-exit verifier filters candidates before full target depth Hierarchical verification: weak candidates exit early; strong candidates continue Good when verification cost dominates Up to about 4× when early rejection saves target work Requires reusable hidden states, KV-cache coordination, and early-exit verifier integration

Fit rule: speculation is useful only when accepted draft tokens outnumber the cost of drafting, verification, and rejected-token recovery. If acceptance falls below the operational threshold, fall back to standard autoregressive decoding.

Draft Model Selection Framework

Implementing speculative decoding in production requires careful engineering when selecting and managing draft models.4 Platform teams should evaluate draft models using a systematic framework:

  1. Structural and Tokenizer Alignment: The draft model must use the exact same tokenizer and vocabulary layout as the target model.11 Tokenizer mismatches require custom runtime translation layers that add latency and degrade token alignment.11
  2. Dynamic VRAM Footprint Splitting: Both the draft and target models must fit in GPU memory simultaneously.57 To prevent out-of-memory (OOM) errors, platform teams should configure serving engines (like vLLM) with a strict VRAM allocation split (for example, raising –gpu-memory-utilization to 0.94 and pinning the draft model to a single GPU via –speculative-draft-tensor-parallel-size 1 while tensor-paralleling the target model).57
  3. Workload-Aware Speculative Depth Tuning: The draft length K (how many tokens are proposed per step) must be tuned to match workload characteristics.11 High-entropy tasks like creative writing or reasoning benefit from a short draft length (e.g., K = 2 or 3), which minimizes the latency penalty of rejected tokens.11 Conversely, highly structured or predictable tasks like code completion or JSON formatting can use a longer draft length (e.g., K = 5 or 8) to maximize throughput.11
  4. Graceful Runtime Fallbacks: Production engines must monitor token acceptance rates in real time.42 If the average acceptance rate drops below a defined threshold (e.g., 50%) due to a shift in user prompts, the engine should gracefully fallback to standard autoregressive decoding to avoid speculation overhead.11

Constrained Decoding and Structured Generation

Mathematical Mechanics of Logit Masking

Constrained decoding mathematically guarantees that model outputs comply with structured formatting rules—such as JSON schemas, XML templates, regular expressions, or custom domain-specific grammars—by modifying the model’s token probability distribution at every generation step.2
An unconstrained model generates a token by converting its output logits L (of dimension equal to vocabulary size |V|) into a probability distribution via a standard softmax operation.26 Constrained decoding inserts a specialized logit processor between the model’s output layer and the sampling step.2 This processor queries a grammar-compiling state machine to determine which tokens are structurally valid next steps.2 Any token that violates the structural rules is masked by setting its logit to negative infinity 2:
L_i^(constrained) = L_i if i is in V_valid, else -infinity if i is not in V_valid
The remaining valid tokens are then passed to the softmax function, ensuring that the model can only select structurally valid outputs.2

                 [ Model Output Logits ] (V = 32,000)  
                            |  
            +---------------+---------------+  
            |                               |  
            v                               v  
   [ Context-Independent ]          
   - Static string literals        - Numbers, Dynamic keys  
   - FSM-tracked tokens            - CFG Stack-tracked tokens  
            |                               |  
     (O(1) Hash Map Lookup)       (PDA/Earley Stack Inspection)  
            |                               |  
            +---------------+---------------+  
                            |  
                            v  
                [ Logit Masking Processor ]  
                - Set invalid logits to -inf  
                            |  
                            v  

To enforce these rules efficiently, logit processors compile constraints into formal automata 26:

Advanced Constrained Runtimes

While logit masking guarantees structural compliance, naive constraint engines introduce severe processing bottlenecks, adding up to several seconds of compilation overhead and slowing down generation speeds.39 Advanced structured generation engines like XGrammar-2 and llguidance address these limitations using three primary runtime optimizations 2:

  1. TagDispatch (TriggeredTags): Agentic workflows often mix free-form reasoning text with structured tool calls.26 Forcing the entire output into a single, massive JSON schema is highly inefficient.39 TagDispatch addresses this by keeping the model in a lightweight “free-text” scanning mode by default, utilizing a fast Aho-Corasick automaton to monitor outputs.54 Once the model generates a specific trigger tag (such as <|DSML|tool_calls>), the engine dynamically dispatches and enforces the corresponding structured grammar.39 This defers compilation overhead and minimizes cache usage.54
  2. Cross-Grammar Cache via FSM Hashing: Preprocessing and compiling unique grammars for dozens of active tools during a session can introduce significant latency spikes. Different tool schemas often reuse identical substructures, such as standard string fields, enum wrappers, nullable arrays, object-property separators, or repeated JSON fragments. Cross-grammar caching hashes these reusable grammar fragments or FSM subgraphs so that common schema components compile once and can be reused across many tools. Instead of treating every tool schema as a completely new grammar, the runtime decomposes schemas into shared automata components, caches their compiled states, and links them into the active grammar graph at dispatch time. This reduces first-use compilation overhead, improves cache locality, and prevents large tool catalogs from turning schema enforcement into a hidden prefill-latency tax.
  3. Repetition State Compression: Many JSON schemas enforce repetition constraints, such as validating arrays with a high item limit.39 Naive engines compile these patterns by duplicating the state representation proportionally, resulting in an O(repetition_count) complexity that degrades performance.39 XGrammar-2 compresses this by introducing a dedicated “repetition” grammar primitive that maintains a constant O(1) state size regardless of the allowed repetition limit, speeding up array compilation times by up to 100x.39

Differentiating Structure from Semantics

An important engineering doctrine in constrained decoding is that valid structure is not correct meaning.2 Constrained decoding guarantees syntactic compliance—ensuring that braces balance, required fields appear, and data types match the schema.2
However, a syntactically perfect JSON object can still contain incorrect values, such as selecting the wrong enum field, hallucinating tool arguments, or failing downstream verification checks.2 Therefore, constrained decoding must be paired with robust semantic validation, tool-result verification, and automated retry loops to protect downstream systems.40

Framework / Mode Enforcement Mechanism Compilation Overhead Per-Token Overhead Structural Guarantee
JSON Mode Provider-guided formatting bias and generic JSON-shape steering Near-zero Near-zero Weak to moderate; improves JSON likelihood but may fail complex schemas
Provider-Native Strict Structured Outputs API-managed constrained decoding against declared schema/grammar Provider-managed; may include first-schema setup or cache cost Low to minimal; hidden inside serving runtime Strong; enforces declared schema paths before token selection
XGrammar-2 Hybrid FSM/PDA grammar masking, TagDispatch, repetition compression, dynamic grammar dispatch Low after optimization via cross-grammar cache, FSM hashing, and O(1) repetition primitives Near-zero with grammar caching and compressed repetition state Strong; handles complex nested structures, dynamic tools, and mixed prose/structured flow
llguidance CFG/regex/token-indexed constrained generation with optimized parser state Moderate; requires grammar preprocessing and parser state construction Low when precomputed Strong; good fit for recursive schemas, XML, and regex-heavy formats

Doctrine: constrained decoding guarantees valid structure, not truthful content. Use it for syntax; use validators, tools, and grounded checks for semantics.

Sampling Dynamics and Serving Policies

Mathematical Logit Transformation and Sampling Controls

Sampling parameters shape how a language model selects final tokens from its output probability distribution.28 These controls do not alter the underlying model weights; instead, they change how the system evaluates predictions at every generation step 29:

+--------------------------------------------------------------------------------------+
|                                  SAMPLING PIPELINE                                   |
+--------------------------------------------------------------------------------------+
|                                                                                      
|  [ Model Forward Pass ]                                                              
|          |                                                                           
|          v                                                                           
|  [ Raw Logits ]                                                                      
|          |                                                                           
|          |  Vector of unnormalized token scores across the vocabulary                
|          v                                                                           
|  +--------------------------------------------------------------------------------+  
|  |                           LOGIT PROCESSORS                                     |  
|  |                                                                                |  
|  |  - Logit bias: raise or lower specific token scores                            |  
|  |  - Repetition penalty: adjust scores for previously used tokens                |  
|  |  - Constraint mask: set structurally invalid tokens to -infinity               |  
|  +-----------------------------------+--------------------------------------------+  
|                                      |                                               
|                                      v                                               
|  [ Temperature Scaling ]                                                             
|          |                                                                           
|          |  L_i_scaled = L_i / T                                                     
|          |                                                                           
|          |  Lower T -> sharper, more deterministic distribution                      
|          |  Higher T -> flatter, more diverse distribution                           
|          v                                                                           
|  [ Softmax Normalization ]                                                           
|          |                                                                           
|          |  Converts scaled logits into token probabilities                          
|          v                                                                           
|  +--------------------------------------------------------------------------------+  
|  |                         CANDIDATE SET FILTERING                                |  
|  |                                                                                |  
|  |  - Top-K: keep only the K highest-probability tokens                           |  
|  |  - Top-P: keep smallest set whose cumulative probability exceeds p             |  
|  |  - Min-P: keep tokens above threshold scaled by the top token probability      |  
|  +-----------------------------------+--------------------------------------------+  
|                                      |                                               
|                                      v                                               
|  [ Probability Renormalization ]                                                     
|          |                                                                           
|          |  Redistribute probability mass across surviving candidates                
|          v                                                                           
|  [ Token Selection ]                                                                 
|          |                                                                           
|          |  - Greedy / argmax when deterministic                                     
|          |  - Random sampling when stochastic                                        
|          |  - Seeded sampling when reproducibility is required                       
|          v                                                                           
|  [ Selected Next Token ]                                                             
|          |                                                                           
|          v                                                                           
|  [ Append Token to Context and Continue Decode Loop ]                                
|                                                                                      
+--------------------------------------------------------------------------------------+
| Doctrine: sampling changes token choice, not model knowledge. It tunes expression,   |
| diversity, determinism, and structural stability at the final decoding boundary.     |
+--------------------------------------------------------------------------------------+

1. Temperature Scaling

Temperature (T) acts as a sharpness control for the probability distribution.28 Before applying the softmax function, the engine divides all raw logits (L_i) by T 29:
P(x_i) = e^(L_i / T) / sum_j(e^(L_j / T))

2. Top-K and Top-P Truncation

3. Min-P Dynamic Truncation

Min-P addresses the quality-diversity trade-off by scaling the truncation threshold based on the model’s confidence at each step.28 Rather than using an absolute cumulative cutoff, Min-P filters out any token whose probability falls below a dynamic threshold scaled by the top token’s probability (p_max) 28:
p_scaled = p_base * p_max

Example Scenario: base_p = 0.1

Case A: Highly Confident Model (p_max = 0.8)

Case B: Uncertain Model (p_max = 0.2)

By adapting dynamically, Min-P preserves coherence on structured tasks while allowing creative variations at high temperatures without generating incoherent output.38

Decoding Parameter Core Mathematical Impact Typical Range / Envelope Output Stability Risk Serving Impact
Temperature Divides logits before softmax: L_i_scaled = L_i / T; lower values sharpen probabilities, higher values flatten them 0.0–2.0; common production range 0.0–1.0 High values increase incoherence, syntax drift, and schema fragility Negligible compute impact; mostly changes token selection behavior
Top-K Restricts candidate pool to the K highest-probability tokens K=1 gives greedy/argmax; K=20–100 common Too small: repetitive or brittle output; too large: little filtering Low to moderate overhead depending on ranking implementation
Top-P / Nucleus Sampling Keeps the smallest ranked set whose cumulative probability exceeds threshold p 0.0–1.0; common range 0.8–0.95 High p admits long-tail tokens at high temperature; low p may over-truncate valid alternatives Moderate sorting/cumulative-probability overhead at large batches
Min-P Keeps tokens whose probability exceeds base_p * p_max 0.0–1.0; common range 0.05–0.10 Usually more stable than Top-P under high confidence; excessive values over-prune useful tokens Lightweight filtering after probability scores are available
Repetition Penalty Adjusts logits for tokens that already appear in context; values above 1 discourage reuse Usually 1.0–2.0; 1.0 means disabled High values can corrupt fixed phrases, code, XML, JSON keys, and required structural tags Negligible overhead; implemented as a logit processor
Logit Bias Adds or subtracts offsets to selected token logits before softmax Engine-specific positive/negative bias values applied to token IDs or strings Strong bias can force unnatural phrasing, block required terms, or distort refusal behavior Negligible overhead; useful as a nudge, dangerous as a policy substitute
Decoding Parameter Reproducibility Profile Best Fit Bad Fit Operational Note
Temperature Deterministic only at T=0 or with seeded stochastic sampling under stable runtime conditions Low-temperature factual QA, extraction, structured tasks; high-temperature ideation Creative writing at T=0; strict schemas at very high temperature Pair low temperature with structural constraints when format matters
Top-K Deterministic only when K=1 or when sampling is seeded and runtime behavior is stable Autocomplete, bounded generation, low-entropy completion Open-ended generation needing broad lexical variety Useful when candidate pool must be hard-capped
Top-P Non-deterministic unless seeded and engine behavior remains stable General chat, creative drafting, open-ended natural language Strict schema generation unless paired with grammar or schema enforcement Tune jointly with temperature; high T plus high p can admit low-probability incoherent tokens
Min-P Non-deterministic unless seeded and runtime behavior remains stable Creative but coherent output, high-temperature sampling with reduced garbage tokens Tasks needing maximal determinism or exhaustive low-probability exploration Often a better high-temperature guardrail than Top-P alone
Repetition Penalty Deterministic if paired with deterministic decoding Reducing loops, boilerplate repetition, degenerate repeated phrases Code, schemas, poetry, legal text, XML/JSON, or required repeated labels Keep mild when exact token reuse is required
Logit Bias Deterministic if paired with deterministic decoding Encouraging/discouraging specific vocabulary, formats, controlled terminology Replacing validation, policy enforcement, or semantic verification Treat as steering, not correctness

Doctrine: sampling policies shape expression and candidate selection; they do not fix knowledge, reasoning, grounding, or semantics.

Quality Verification, Regression, and Lifecycle Management

Quality Degradation Boundary Framework

Optimizing model representations through quantization or pruning introduces mathematical approximation errors.10 While a compressed model may appear fluent on the surface, specific task-critical behaviors are often the first to degrade.13 To prevent silent regressions in production, platform teams must establish a structured Quality Degradation Boundary Framework:

+--------------------------------------------------------------------------------------+
|                            QUALITY DEGRADATION FRAMEWORK                             |
+--------------------------------------------------------------------------------------+
|                                                                                      
|  Goal: prove that an optimized model is behavior-preserving, not merely faster.      
|                                                                                      
|  [ Full-Precision Baseline ]                                                         
|          |                                                                           
|          |  Establish control metrics for behavior, latency, cost, and safety        
|          v                                                                           
|  +--------------------------------------------------------------------------------+  
|  |                         BASELINE CONTROL PROFILE                               |  
|  |                                                                                |  
|  |  - Task success rate                                                           |  
|  |  - Reasoning / code / arithmetic accuracy                                      |  
|  |  - Schema and tool-call compliance                                             |  
|  |  - Long-context retrieval accuracy                                             |  
|  |  - Safety, refusal, and alignment behavior                                     |  
|  |  - TTFT, ITL, throughput, and cost-per-success                                 |  
|  +-----------------------------------+--------------------------------------------+  
|                                      |                                               
|                                      v                                               
|  [ Candidate Optimization ]                                                          
|          |                                                                           
|          |  Examples: FP8, AWQ INT4, GPTQ INT4, KV-cache quantization, pruning,      
|          |  speculative decoding, constrained generation, runtime-kernel swap        
|          v                                                                           
|  +--------------------------------------------------------------------------------+  
|  |                        TARGETED REGRESSION EVALUATION                          |  
|  |                                                                                |  
|  |  +-----------------------------+   +----------------------------------------+  |  
|  |  | Logical Depth               |   | Schema / Tool Compliance               |  |  
|  |  |                             |   |                                        |  |  
|  |  | - Multi-step reasoning      |   | - JSON / XML validity                  |  |  
|  |  | - Math and code execution   |   | - Required fields and enum choices     |  |  
|  |  | - Symbolic synthesis        |   | - Tool argument names, types, values   |  |  
|  |  +-----------------------------+   +----------------------------------------+  |  
|  |                                                                                |  
|  |  +-----------------------------+   +----------------------------------------+  |  
|  |  | Grounding / Retrieval       |   | Safety / Alignment Coherence           |  |  
|  |  |                             |   |                                        |  |  
|  |  | - NIAH / RULER recall       |   | - Refusal consistency                  |  |  
|  |  | - Long-context fidelity     |   | - Jailbreak resistance                 |  |  
|  |  | - Citation / evidence use   |   | - False-positive refusal drift         |  |  
|  |  +-----------------------------+   +----------------------------------------+  |  
|  |                                                                                |  
|  |  +--------------------------------------------------------------------------+  |  
|  |  | Runtime / Economics                                                      |  |  
|  |  |                                                                          |  |  
|  |  | - TTFT and ITL under realistic batch loads                               |  |  
|  |  | - Tokens per second and requests per second                              |  |  
|  |  | - VRAM / HBM footprint                                                   |  |  
|  |  | - Cost per successful task, not just cost per generated token            |  |  
|  |  +--------------------------------------------------------------------------+  |  
|  +-----------------------------------+--------------------------------------------+  
|                                      |                                               
|                                      v                                               
|  [ Item-Level Baseline Comparison ]                                                  
|          |                                                                           
|          | Compare candidate outputs against the full-precision control item-by-item 
|          | to catch fluent but damaged behavior.                                     
|          |                                                                           
|          v                                                                           
|     +----+---------------------------------------------+                             
|     |                                                  |                             
|     v                                                  v                             
|  [ All Gates Pass ]                              [ Any Critical Gate Fails ]         
|     |                                                  |                             
|     |                                                  +--> Roll back candidate      
|     |                                                  +--> Raise precision          
|     |                                                  +--> Recalibrate / retune     
|     |                                                  +--> Change compression path  
|     |                                                  |                             
|     v                                                  |                             
|  [ Promote Optimized Artifact ]                 [ Re-test Against Baseline ]         
|     |                                                  ^                             
|     |                                                  |                             
|     +---------------------> [ Production Monitoring ] ----------------------+
|                                                                                      
+--------------------------------------------------------------------------------------+
| Doctrine: fluency is not proof of preservation. Compression passes only when fragile |
| behaviors survive item-level regression against the dense baseline.                  |
+--------------------------------------------------------------------------------------+
  1. Logical Processing and Multi-Step Synthesis: Evaluates complex deduction, mathematical processing, and symbolic execution.18 Quantization often damages multi-step logic pathways before affecting conversational fluency.14
  2. Schema Compliance and Syntax Constraints: Monitors the model’s ability to output valid JSON structures, conform to strict tool definitions, and generate correct closing tags.26
  3. Grounding and Retrieval Integrity: Tests long-context information retrieval using RULER and “Needle-in-a-haystack” benchmarks.10 This is critical when evaluating KV cache compression, which can cause silent information retrieval failures.10
  4. Safety, Refusals, and Alignment Coherence: Verifies that optimization interventions do not alter model safety alignments, trigger false-positive refusals, or make the model vulnerable to prompt injections.9
+-----------------------------------------------------------------------------------------+  
|                               BEHAVIOR PRESERVATION ENVELOPE                            |  
|                                                                                         |  
|                                                                                         |  
|  - Task Success Rate      : Match dense baseline within 1.0% margin.                    |  
|  - Reasoning Coherence    : GSM8K and code execution accuracy gates.                    |  
|  - Schema Adherence       : Zero syntax parser failures.                                |  
|  - Safety Integrity       : Zero false-refusal drift on standard benchmarks.            |  
|  - Long-Context Recall    : 100% NIAH accuracy up to operational limit.                 |  
|  - Latency Constraints    : TTFT and ITL reductions verified under batch loads.         |  
+-----------------------------------------------------------------------------------------+

Optimization Regression Suite

Before deploying an optimized artifact (such as a quantized weight matrix, compressed KV cache configuration, or new speculative draft model) to production, platform teams must evaluate it against an Optimization Regression Suite:

Testing Domain Baseline Control Benchmark Instrument Item-Level Verification Failure Gate Condition
Logical Depth Full BF16/FP16 dense model GSM8K, GPQA, code execution, symbolic reasoning tasks Compare reasoning traces, arithmetic steps, code outputs, and final answers Accuracy drop exceeds allowed margin; critical reasoning item fails; code no longer compiles or executes
Schema & Tool Use Full schema prompt with dense model Multi-tool schema validation, function-call test sets Verify JSON/XML validity, required fields, enum choices, names, types, and tool argument values Any parser failure, malformed structure, missing required field, wrong argument type, or unsafe tool-call construction
Retrieval & Recall Uncompressed KV cache and full context precision NIAH, RULER, long-context recall probes Check exact recovery of buried facts, citations, and long-range references Retrieval accuracy falls below threshold; cited evidence is lost, displaced, or hallucinated
Safety Alignment Baseline refusal and compliance profile Adversarial prompts, policy probes, threat logs, red-team sets Audit refusal consistency, false refusals, unsafe compliance, and boundary behavior Any jailbreak increase, unsafe compliance, false-positive refusal drift, or alignment regression
Latency & Cost Baseline serving profile under realistic traffic Concurrency load tests, trace replay, hardware telemetry Measure TTFT, ITL, TPS, RPS, VRAM footprint, preemption, and cost per successful task Candidate is slower, exceeds VRAM limits, raises preemption, or improves cost per token while worsening cost per successful task
Kernel & Runtime Compatibility Known-good runtime engine and hardware configuration Engine compatibility tests, canary deploys, production replicas Confirm quantization format, scheduler behavior, grammar parser, and draft verifier compatibility Kernel mismatch, unsupported GPU path, dequantization in critical path, or runtime crash under batch load

Promotion rule: an optimization passes only if it improves the intended latency, memory, or cost target without crossing any behavioral regression gate. A faster model that fails reasoning, schemas, retrieval, safety, or runtime compatibility is not optimized; it is damaged efficiently.

Optimization Failure Mode Map

Altering model weights or runtime decoding execution can trigger specialized system failures.1 Platform teams should use this Failure Mode Map to identify, diagnose, and resolve optimization regressions:

Failure Syndrome Root-Cause Mechanism Behavioral Presentation System Detection Path Smallest Effective Corrective Move Rollback Strategy
Fluent Degradation Over-quantization, aggressive pruning, or lossy KV compression preserves surface fluency while damaging internal reasoning paths Output sounds plausible but contains wrong math, bad code, weak reasoning, or subtle contradictions Item-level reasoning evals, code execution tests, arithmetic checks, dense-baseline comparison Raise precision, reduce pruning ratio, recalibrate with reasoning-heavy examples Revert to dense or higher-bit artifact
Outlier Clipping Per-tensor or poorly calibrated scales clip rare activation outliers Syntax breaks on rare structures, code/math errors increase, safety boundaries shift Activation-range telemetry, outlier-channel inspection, targeted rare-token tests Use SmoothQuant, group-wise scaling, microscaling, or broader calibration data Roll back quantized activation path
Calibration-Set Mismatch Calibration corpus fails to represent production prompt shapes, schemas, languages, or context lengths Model works on generic tests but fails real workloads Production trace replay, workload-distribution comparison, schema/tool evals Rebuild calibration set from representative traces, isolate eval set, recalibrate Freeze current artifact; reissue optimized build
Dequantization in Critical Path Runtime upcasts compressed weights before execution or uses inefficient layout Disk/VRAM footprint improves but latency does not; ITL remains high Kernel profiling, memory-bandwidth counters, dequantization trace spans Switch to layout-aware kernels such as Marlin-compatible paths Revert to known-good runtime/kernel pair
Kernel Incompatibility Quantization format, sparse layout, GPU generation, or runtime engine do not match Worker crashes, illegal memory access, silent fallback to slow dense path Runtime compatibility tests, canary crashes, kernel logs Pin supported engine/hardware pair; rebuild artifact for target kernel Route to compatible hardware pool or dense fallback
Speculation Overhead Draft model mismatch, low acceptance rate, bad speculative depth, or weak verifier integration Generation slows down despite speculation; rejection loops dominate Draft acceptance rate, verification latency, rejected-token ratio Lower draft length, switch draft model, or disable speculation below threshold Fall back to standard autoregressive decoding
Speculative Parity Drift Approximate verifier, runtime bug, or draft integration changes target distribution Candidate output differs from target model beyond accepted tolerance Target-output parity tests, seeded replay, canary divergence Tighten verification, disable approximate path, fix runtime integration Disable speculation and use target-only decoding
Parser Deadlock Grammar compiler, FSM/PDA state, or constrained decoder enters invalid recursive state Generation stalls, times out, or loops under strict schema Parser timeout alarms, grammar-state traces, retry-loop metrics Use TagDispatch, simplify grammar, cache compiled fragments, add escape path Fall back to downstream validation/retry rather than token-level constraints
Semantic-Validity Gap Constrained decoding enforces syntax but not truth JSON is valid but fields are wrong, enum choice is unsafe, tool arguments hallucinated Semantic validators, tool dry-runs, grounded claim checks Add external validation and tool-result verification Disable action execution until semantic validation passes
Long-Context Divergence KV-cache compression or low-rank state approximation damages attention over distant context Model loses buried facts, citations, or earlier tool state NIAH/RULER, long-context trace replay, citation-retention tests Raise KV precision, enable bounded fallback, reduce compression on keys Disable compressed KV path for long-context workloads
Pruning Collapse Structured sparsity removes behavior-critical parameters Reasoning, code, or rare-token performance collapses Dense-vs-sparse evals, item-level negative flips, perplexity jump Reduce sparsity target, use SlideSparse/Wanda, retrain or recalibrate Restore dense checkpoint or lower-pruned artifact
Safety Boundary Shift Quantization/pruning/preference interactions alter refusal or policy behavior False refusals rise or unsafe compliance increases Safety evals, jailbreak tests, refusal-classifier drift Raise precision on safety-sensitive routes, adjust safety layer, recalibrate Route safety-sensitive traffic to dense baseline
Cost-Per-Success Regression Optimization lowers token cost but increases retries, failures, or human overrides Cost per generated token improves while total task cost worsens Cost-per-success telemetry, retry rate, override rate Optimize for successful task completion, not raw token cost Roll back optimization for affected workflow
Fallback Failure Runtime lacks automatic exit from failed optimization path Bad artifact continues serving after threshold breach Alert threshold breach without route change, canary incident Add automatic precision/fallback thresholds Route all traffic to verified baseline manifest

Deployment Compatibility Model

Before deploying any optimized artifact, platform teams must verify that it is fully compatible with all layers of the serving infrastructure 5:

+--------------------------------------------------------------------------------
| DEPLOYMENT COMPATIBILITY MODEL
+--------------------------------------------------------------------------------
|
| Goal:
|   Verify that the optimized artifact can actually run safely and efficiently
|   in the target serving environment.
|
| [ Optimized Artifact ]
|       |
|       v
| +-------------------------+
| | Model Layer             |
| | architecture, tokenizer |
| | base checkpoint, heads  |
| +-----------+-------------+
|             |
|             v
| +-------------------------+
| | Quantization Format     |
| | AWQ, GPTQ, FP8, FP4,    |
| | EXL2, sparse layout     |
| +-----------+-------------+
|             |
|             v
| +-------------------------+
| | Runtime Kernel Layer    |
| | Marlin, cuBLASLt,       |
| | TRT-LLM, vLLM, SGLang   |
| +-----------+-------------+
|             |
|             v
| +-------------------------+
| | Hardware Layer          |
| | GPU generation, memory, |
| | Tensor Core support     |
| +-----------+-------------+
|             |
|             v
| +-------------------------+
| | Decoding Runtime Layer  |
| | grammar engine, draft   |
| | verifier, sampler, FSM  |
| +-----------+-------------+
|             |
|             v
| +-------------------------+
| | Serving Policy Layer    |
| | scheduler, batching,    |
| | KV cache, fallback      |
| +-----------+-------------+
|             |
|             v
| +-------------------------+
| | Validation Gate         |
| | behavior, latency, cost,|
| | safety, rollback        |
| +-------------------------+
|
+--------------------------------------------------------------------------------
| Rule:
|   An optimized checkpoint is not deployable merely because it loads. It must
|   match the model architecture, runtime kernel path, hardware instruction set,
|   decoding stack, scheduler policy, and behavioral validation gates.
+--------------------------------------------------------------------------------
  1. Model Layer: Verifies that the base architecture, parameter configuration, and tokenizer layout are fully supported.11
  2. Quantization Format: Confirms that the optimization format (such as AWQ, GPTQ, or EXL2) is compatible with the target runtime engine’s kernels.1
  3. Hardware Accelerator Layer: Verifies that the target GPUs support the compiled formats (for example, native FP8 computation requires Blackwell, Hopper, or Ada Lovelace architectures with a Compute Capability greater than or equal to 8.9).33
  4. Decoding and Speculation Layer: Confirms that the serving scheduler natively supports the speculative draft model, schema parsers, and caching strategies.24

Evaluation and Observability Guidance

To monitor optimized artifacts in production, platform teams should track several key performance indicators:

Compression Decision Ladder

This Compression Decision Ladder guides platform teams through a systematic sequence of optimization interventions, prioritizing low-risk steps before moving to aggressive compression 1:

+---------------------------------------------------------------------------------------+
|                              COMPRESSION DECISION LADDER                              |
+---------------------------------------------------------------------------------------+
|                                                                                       
|  Goal: reduce latency, memory pressure, or cost without crossing the quality          
|  degradation boundary. Climb only as far as the bottleneck requires.                  
|                                                                                       
|  [ Diagnose Bottleneck ]                                                              
|          |                                                                            
|          |  Is the problem TTFT, ITL, VRAM pressure, HBM bandwidth, schema overhead,  
|          |  long-context capacity, or cost-per-success?                               
|          v                                                                            
|                                                                                       
|  Level 1: Serving Engine Tuning                                                       
|  +--------------------------------------------------------------------------------+   
|  | Apply prefix caching / RadixAttention, prompt-root stabilization, batching,    |   
|  | scheduler tuning, and cache-hit instrumentation.                               |   
|  |                                                                                |   
|  | Savings: no static weight reduction; reduces redundant prefill work and KV use.|   
|  | Risk: lowest. No model behavior is directly changed.                           |   
|  | Gate: prefix hit rate, TTFT, queue wait, and cache eviction telemetry.         |   
|  +-----------------------------------+--------------------------------------------+   
|                                      |                                                
|                                      v                                                
|                                                                                       
|  Level 2: Native Low-Precision Runtime                                                
|  +--------------------------------------------------------------------------------+   
|  | Move to native FP8 / W8A8 execution where hardware and serving kernels support |   
|  | it cleanly.                                                                    |   
|  |                                                                                |   
|  | Savings: ~50% weight / activation footprint versus BF16 / FP16 paths.          |   
|  | Risk: low to moderate on modern Hopper / Blackwell-class systems.              |   
|  | Gate: reasoning, schema, tool-use, safety, and activation outlier tests.       |   
|  +-----------------------------------+--------------------------------------------+   
|                                      |                                                
|                                      v                                                
|                                                                                       
|  Level 3: KV Cache Compression                                                        
|  +--------------------------------------------------------------------------------+   
|  | Quantize or compress KV cache using FP8, tiered INT8 / INT4, bounded fallback, |   
|  | TurboQuant-style schemes, or low-rank KV compression.                          |   
|  |                                                                                |   
|  | Savings: expands context length and concurrency by shrinking dynamic state.    |   
|  | Risk: moderate; long-context recall can silently degrade.                      |   
|  | Gate: NIAH / RULER recall, citation fidelity, tool-state retention, H2D latency|   
|  +-----------------------------------+--------------------------------------------+   
|                                      |                                                
|                                      v                                                
|                                                                                       
|  Level 4: Speculative Acceleration                                                    
|  +--------------------------------------------------------------------------------+   
|  | Deploy EAGLE-3, P-EAGLE, Medusa, vanilla draft models, or HiSpec-style early   |   
|  | verification to accelerate decode.                                             |   
|  |                                                                                |   
|  | Savings: no weight savings; may add draft-model VRAM. Improves tokens/sec if   |   
|  | draft acceptance exceeds verification overhead.                                |   
|  | Risk: low to moderate when fallback to standard decoding is automatic.         |   
|  | Gate: draft acceptance rate, rejection-loop latency, VRAM split, output parity.|   
|  +-----------------------------------+--------------------------------------------+   
|                                      |                                                
|                                      v                                                
|                                                                                       
|  Level 5: INT4 Weight-Only Quantization                                               
|  +----------------------------------------------------------------------------------+ 
|  | Apply AWQ, GPTQ, EXL2, or equivalent INT4 weight-only compression with optimized | 
|  | kernels such as Marlin.                                                          | 
|  |                                                                                  | 
|  | Savings: ~75% static weight footprint reduction.                                 | 
|  | Risk: high; reasoning, code, math, formatting, and rare-token behavior may fail. | 
|  | Gate: item-level regression suite against full-precision baseline.               | 
|  +-----------------------------------+----------------------------------------------+ 
|                                      |                                                
|                                      v                                                
|                                                                                       
|  Level 6: Structured Sparsity / Pruning                                               
|  +--------------------------------------------------------------------------------+   
|  | Apply 2:4 structured sparsity, SlideSparse adaptation, Wanda, SparseGPT, or    |   
|  | other pruning paths only after safer interventions are exhausted.              |   
|  |                                                                                |   
|  | Savings: potential physical weight and GEMM acceleration when hardware kernels |   
|  | execute the sparse layout directly.                                            |   
|  | Risk: highest; aggressive pruning can cause cognitive collapse.                |   
|  | Gate: full behavioral regression, canary rollout, rollback-ready release plan. |   
|  +--------------------------------------------------------------------------------+   
|                                                                                       
+---------------------------------------------------------------------------------------+
| Ladder rule: stop climbing when the bottleneck is solved. Every higher level buys     |
| capacity by taking on more behavioral and deployment risk.                            |
+---------------------------------------------------------------------------------------+
Level Optimization Intervention Primary Savings Performance Impact Behavioral Regression Risk Required Validation Suite
1 Serving Engine Tuning — prefix caching, RadixAttention, prompt-root stabilization, batching, scheduler tuning No static weight savings; reduces redundant prefill work and improves KV reuse Can strongly improve TTFT and throughput on prefix-heavy or multi-turn workloads Very low; model behavior is not directly altered Prefix hit rate, TTFT, queue wait, cache eviction telemetry, prompt-layout stability tests
2 Native Low-Precision Runtime — FP8 / W8A8 where hardware and kernels support it About 50% reduction versus BF16/FP16 paths for supported tensors Improves memory movement and Tensor Core utilization on compatible accelerators Low to moderate; activation outliers and rare-token behavior can degrade Reasoning, schema, tool-use, safety, activation-outlier, and long-context tests
3 KV Cache Compression — FP8 KV, tiered INT8/INT4, bounded fallback, TurboQuant, low-rank KV Shrinks dynamic KV state; expands context length and concurrency Reduces memory pressure under long-context or high-concurrency loads Moderate; long-context recall and citation fidelity can silently degrade NIAH/RULER recall, citation fidelity, tool-state retention, H2D fallback latency, long-context regression
4 Speculative Acceleration — EAGLE-3, P-EAGLE, Medusa, HiSpec, vanilla draft model No weight savings; may add draft-model or verifier overhead Improves tokens/sec when accepted draft tokens exceed drafting and verification cost Low to moderate; exact verification should preserve target distribution, but implementation bugs and fallback behavior still need parity tests Draft acceptance rate, target-output parity, rejection-loop latency, VRAM split, fallback monitoring
5 INT4 Weight-Only Quantization — AWQ, GPTQ, EXL2, Marlin-compatible paths About 75% static weight footprint reduction Can improve decode speed when optimized kernels keep dequantization off the critical path High; reasoning, code, math, formatting, tool-use, and rare-token behavior may degrade Item-level regression suite against dense baseline, code/math evals, schema/tool tests, safety checks
6 Structured Sparsity / Pruning — 2:4 sparsity, SlideSparse, Wanda, SparseGPT Potential physical weight and GEMM savings only when sparse layout reaches hardware execution Can accelerate compatible sparse Tensor Core paths; no benefit if expanded back to dense Highest; aggressive pruning can cause cognitive collapse Full behavioral regression, pruning-specific evals, canary rollout, rollback-ready release plan

Strategic Integration and Canon Continuity

Cross-Canon Handoff Map

The optimizations defined in AI-ENG-K directly impact downstream deployment configurations, routing architectures, and validation gates across the engineering canon:

Destination Report Shared Artifact Bundle Primary Operational Touchpoint Downstream Risk Trigger
AI-ENG-G — Model Selection Precision degradation metrics; cost/performance profiles; dense vs. optimized benchmark deltas Feeds model choice decisions when compression changes quality, cost, latency, or deployment feasibility Aggressive quantization makes a previously acceptable model fail target tasks or safety boundaries
AI-ENG-H — Model Adaptation Quantization-aware training requirements; adapter compatibility notes; calibration corpus constraints Determines whether adaptation artifacts can survive quantization, pruning, or runtime-kernel changes LoRA adapters, fine-tuned weights, or preference-tuned models degrade after optimization
AI-ENG-I — Regression Control Quantized checkpoints; model metadata; optimization manifests; parity-evaluation signatures Registers compressed artifacts into release manifests and gates them through replay, shadow, and canary workflows Missing tokenizer pins, kernel mismatches, or behavioral regressions during rollout
AI-ENG-J — Throughput Mechanics KV-cache quantization parameters; weight footprint; decoding latency profile; memory-bandwidth assumptions Configures dynamic memory budgets, batch capacity, HBM pressure limits, and scheduler policies Long-context workloads trigger KV exhaustion, preemption, or degraded ITL
AI-ENG-L — Model Serving Architecture Specialized dequantization kernels; serving-engine compatibility; GPU generation requirements Deploys optimized artifacts to target hardware pools and runtime engines Kernel compilation failure, unsupported hardware path, early dequantization, or runtime crash
AI-ENG-M — Agent Orchestration Speculative decoding behavior; tool-call formatting stability; schema adherence results Determines whether optimized models remain reliable inside multi-step agent workflows Compressed models fail planning, tool routing, or multi-turn state handling
AI-ENG-N — Tool Contracts Grammar schema state; constrained decoding engines; parser behavior; tool-call eval results Enforces structural logit masking and validates tool arguments at the decoding boundary Parser deadlocks, malformed tool calls, correct schema with wrong values
AI-ENG-O — Action Verification Semantic validation requirements; post-constraint verification; tool-result checking Pairs constrained decoding with external correctness checks so valid structure does not masquerade as valid action Syntactically valid outputs contain hallucinated, unsafe, or semantically wrong action parameters
AI-ENG-W — Fallbacks & Resilience Precision fallback tiers; dense baseline routes; speculation-disable thresholds; rollback policies Routes around failed optimization paths by raising precision, disabling speculation, or falling back to dense models Optimization improves speed but crosses a quality or safety boundary under live load
AI-ENG-Z — Telemetry & APM Draft acceptance rate; parser rejection rate; KV escalation rate; quantized artifact latency; cost-per-success Monitors optimized artifacts after deployment and detects runtime degradation Acceptance collapse, parser-retry storms, KV fallback spikes, or hidden quality loss
AI-ENG-AA — Evaluation Architecture Dense baseline comparisons; optimization regression suites; long-context recall tests; item-level deltas Supplies the eval harnesses required to prove compression is behavior-preserving Benchmark averages improve while critical item-level behaviors regress
AI-ENG-AB — Auditability Optimization manifest; calibration dataset hash; kernel version; hardware target; eval traces Preserves reproducibility and explainability for optimized model artifacts No one can reconstruct why a compressed model behaved differently from the dense baseline
AI-ENG-AC — Incident Response Optimization failure modes; rollback triggers; canary failures; hardware incompatibility signals Converts runtime optimization regressions into operational playbooks Latency, quality, or safety incident requires rapid dense-model rollback
AI-ENG-AD — Governance Risk classification; safety evaluations; deployment approvals; data-handling constraints Ensures compression, pruning, and constrained decoding changes respect governance boundaries Unsafe capability shift, degraded refusal behavior, or noncompliant artifact promotion

Doctrinal Principles for Weight Dynamics and Decoding Strategy

This research report establishes four memorably phrased, durable principles to guide systems engineering across weight dynamics and decoding strategy:

  1. Compression is a Behavioral Intervention, Not a Storage Optimization: Model parameters are not arbitrary numbers; they encode fragile behavioral capacities.13 A quantized model that runs twice as fast but silently forgets how to format, reason, or refuse has not been optimized—it has been damaged efficiently.10
  2. Valid Structure is Not Correct Meaning: Enforcing valid formatting through constrained logit masking does not guarantee correct reasoning.2 Syntactically perfect JSON objects can still contain hallucinated values, requiring downstream validation gates.2
  3. The Memory Bottleneck Must Be Resolved at the Register, Not the HBM: Reducing weight footprint on disk is useless if the serving engine upcasts values back to high-precision formats in the GPU’s critical path.1 True acceleration requires dequantization to be integrated directly into hardware execution kernels.1
  4. Verification is the True Speculative Bottleneck: Speculative decoding is only beneficial when the draft engine aligns with the target model’s output distribution.11 If the target model rejects the majority of proposed draft tokens, speculation acts as a computational tax that degrades serving throughput.25

Works cited

  1. The Complete Guide to LLM Quantization with vLLM … - Jarvislabs.ai, accessed June 8, 2026, https://jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks
  2. How Structured Outputs and Constrained Decoding Work Let’s Data Science, accessed June 8, 2026, https://letsdatascience.com/blog/structured-outputs-making-llms-return-reliable-json
  3. Ultimate Guide to Local LLMs in 2026 - Agent Native, accessed June 8, 2026, https://www.agentnative.dev/ultimate-guide-to-local-llms
  4. 𝚂𝚙𝚎𝚌𝙵𝚘𝚛𝚐𝚎: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding - arXiv, accessed June 8, 2026, https://arxiv.org/html/2603.18567v1
  5. TensorRT-LLM vs vLLM vs SGLang: Choosing an Inference Engine for Production - Cloud, accessed June 8, 2026, https://acethecloud.com/blog/tensorrt-llm-vs-vllm-vs-sglang-production-inference/
  6. Accelerating LLM inference with post-training weight and activation using AWQ and GPTQ on Amazon SageMaker AI Artificial Intelligence, accessed June 8, 2026, https://aws.amazon.com/blogs/machine-learning/accelerating-llm-inference-with-post-training-weight-and-activation-using-awq-and-gptq-on-amazon-sagemaker-ai/
  7. Quantization Algorithms - TheStage AI documentation!, accessed June 8, 2026, https://docs.thestage.ai/qlip/docs/source/qlip_algorithms_api.html
  8. Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines, accessed June 8, 2026, https://developer.nvidia.com/blog/structured-sparsity-in-the-nvidia-ampere-architecture-and-applications-in-search-engines/
  9. SmoothQuant: Efficient PTQ for Large Models - Emergent Mind, accessed June 8, 2026, https://www.emergentmind.com/topics/smoothquant
  10. KV cache quantization [1, 2, 3, 4] addresses this by storing keys and values in reduced-precision formats—typically INT8, INT4, or lower. The savings are substantial - arXiv, accessed June 8, 2026, https://arxiv.org/html/2605.20868v1
  11. The Evolution of LLM Inference: Decoding algorithms — Part 1 - Towards AI, accessed June 8, 2026, https://pub.towardsai.net/the-evolution-of-llm-inference-decoding-algorithms-part-1-13ba81396cf7
  12. LLM Pruning on GPU Cloud: SparseGPT and Wanda for 50% Model …, accessed June 8, 2026, https://www.spheron.network/blog/llm-pruning-sparsegpt-wanda-gpu-cloud/
  13. A practical guide to INT4 quantization for SLMs: GPTQ vs AWQ, Olive, and real‑world results, accessed June 8, 2026, https://medium.com/data-science-at-microsoft/a-practical-guide-to-int4-quantization-for-slms-gptq-vs-awq-olive-and-real-world-results-2f63d6963d1d
  14. SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity - arXiv, accessed June 8, 2026, https://arxiv.org/html/2603.05232v1
  15. GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models, accessed June 8, 2026, https://www.researchgate.net/publication/399595793_GPU-Accelerated_INT8_Quantization_for_KV_Cache_Compression_in_Large_Language_Models
  16. GPTQ vs AWQ vs NF4: Choosing the Right LLM Quantization Pipeline - Abstract Algorithms, accessed June 8, 2026, https://www.abstractalgorithms.dev/gptq-vs-awq-vs-nf4-choosing-the-right-llm-quantization-pipeline
  17. 5 Essential LLM Quantization Techniques Explained - ApX Machine Learning, accessed June 8, 2026, https://apxml.com/posts/llm-quantization-techniques-explained
  18. aiha-lab/qllm-infer: Quantization Framework for LLM Inferences - GitHub, accessed June 8, 2026, https://github.com/aiha-lab/qllm-infer
  19. Using FP8 with Transformer Engine - NVIDIA Documentation Hub, accessed June 8, 2026, https://docs.nvidia.com/deeplearning/transformer-engine-releases/release-2.4/user-guide/examples/fp8_primer.html
  20. What is FP8 Quantization? AI Inference Performance, Accuracy, and Hardware Support Explained (2026) Spheron Blog, accessed June 8, 2026, https://www.spheron.network/blog/fp8-quantization-inference-performance-hardware-explained/
  21. How Quantization Reduces LLM Latency - Latitude.so, accessed June 8, 2026, https://latitude.so/blog/how-quantization-reduces-llm-latency
  22. Advanced Quantization Techniques: The Ultimate Guide to Unlocking LLM Potential, accessed June 8, 2026, https://ai.plainenglish.io/advanced-quantization-techniques-the-ultimate-guide-to-unlocking-llm-potential-3590cc8638d2
  23. CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs - NIPS, accessed June 8, 2026, https://proceedings.neurips.cc/paper_files/paper/2025/file/5e27f5d8ecd30ac6235892ef97676e78-Paper-Conference.pdf
  24. Cross-Attention Speculative Decoding - arXiv, accessed June 8, 2026, https://arxiv.org/html/2505.24544v3
  25. HiSpec: Hierarchical Speculative Decoding for LLMs - arXiv, accessed June 8, 2026, https://arxiv.org/html/2510.01336v2
  26. XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs - arXiv, accessed June 8, 2026, https://arxiv.org/html/2601.04426v2
  27. Structured Decoding in vLLM: a gentle introduction, accessed June 8, 2026, https://vllm.ai/blog/2025-01-14-struct-decode-intro
  28. LLM Settings & Parameters Guide PromptOps Ecosystem, accessed June 8, 2026, https://justprompting.in/settings
  29. LLM Temperature, Top-P, and Top-K Explained — With Python Simulations, accessed June 8, 2026, https://machinelearningplus.com/gen-ai/llm-temperature-top-p-top-k-explained/
  30. An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats - arXiv, accessed June 8, 2026, https://arxiv.org/html/2606.09686v1
  31. vuiseng9/fp4-training: mxfp8/nvfp4 training - from concept to implementation (cuBLASLt + Microxcaling). - GitHub, accessed June 8, 2026, https://github.com/vuiseng9/fp4-training
  32. GitHub - turboderp-org/exllamav2: A fast inference library for running LLMs locally on modern consumer-class GPUs, accessed June 8, 2026, https://github.com/turboderp-org/exllamav2
  33. FP8 W8A8 - vLLM, accessed June 8, 2026, https://docs.vllm.ai/en/stable/features/quantization/fp8/
  34. Google TurboQuant: 6x KV Cache Compression for LLM Inference Spheron Blog, accessed June 8, 2026, https://www.spheron.network/blog/google-turboquant-llm-compression-gpu-cloud/
  35. Runtime-Certified Bounded-Error Quantized Attention - arXiv, accessed June 8, 2026, https://arxiv.org/pdf/2605.20868
  36. KV Cache Offloading for Context-Intensive Tasks - arXiv, accessed June 8, 2026, https://arxiv.org/html/2604.08426v1
  37. NVIDIA Transformer Engine: FP8 Mixed Precision on H100 and H200 (Setup, Code, Benchmarks 2026) Spheron Blog, accessed June 8, 2026, https://www.spheron.network/blog/nvidia-transformer-engine-h100-h200-fp8/
  38. Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs - arXiv, accessed June 8, 2026, https://arxiv.org/html/2407.01082v8
  39. XGrammar-2: Fast and Customizable Structured Generation … - MLC, accessed June 8, 2026, https://blog.mlc.ai/2026/05/04/xgrammar-2-fast-customizable-structured-generation
  40. Grammar-Constrained Generation: The Output Reliability Technique Most Teams Skip, accessed June 8, 2026, https://tianpan.co/blog/2026-04-16-grammar-constrained-generation-output-reliability
  41. P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM - AWS, accessed June 8, 2026, https://aws.amazon.com/blogs/machine-learning/p-eagle-faster-llm-inference-with-parallel-speculative-decoding-in-vllm/
  42. Speculative Decoding: Achieving 2-3x LLM Inference Speedup Introl Blog, accessed June 8, 2026, https://introl.com/blog/speculative-decoding-llm-inference-speedup-guide-2025
  43. Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding - arXiv, accessed June 8, 2026, https://arxiv.org/html/2606.01019v1
  44. MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models, accessed June 8, 2026, https://www.research-collection.ethz.ch/bitstreams/b4908ef6-b203-4218-8bb5-e46d7d4f0dca/download
  45. vLLM on Jetson Orin — pre-built wheel with Marlin GPTQ support (3.8x prefill speedup) : r/LocalLLaMA - Reddit, accessed June 8, 2026, https://www.reddit.com/r/LocalLLaMA/comments/1rtswjx/vllm_on_jetson_orin_prebuilt_wheel_with_marlin/
  46. SGLang Production Deployment Guide: RadixAttention and Multi-Turn Inference on GPU Cloud (2026) Spheron Blog, accessed June 8, 2026, https://www.spheron.network/blog/sglang-production-deployment-guide/
  47. Inside SGLang: Anatomy of a High-Performance Structured LLM Inference System, accessed June 8, 2026, https://blog.sugiv.fyi/inside-sglang-anatomy-high-performance-structured-llm-inference-system
  48. SGLang vs vLLM in 2026: Benchmarks, Architecture, and When to Use Each, accessed June 8, 2026, https://particula.tech/blog/sglang-vs-vllm-inference-engine-comparison
  49. SGLang: The Complete Guide to High-Performance LLM Inference, accessed June 8, 2026, https://inference.net/content/sglang-complete-guide/
  50. An Empirical Study of Microscaling Formats for Low-Precision LLM Training - ARITH 2025, accessed June 8, 2026, https://arith2025.org/proceedings/215900a001.pdf
  51. How low-bit inference enables efficient AI - Dropbox Tech Blog, accessed June 8, 2026, https://dropbox.tech/machine-learning/how-low-bit-inference-enables-efficient-ai
  52. MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models - arXiv, accessed June 8, 2026, https://arxiv.org/html/2408.11743v1
  53. Guide to choosing quants and engines : r/LocalLLaMA - Reddit, accessed June 8, 2026, https://www.reddit.com/r/LocalLLaMA/comments/1anb2fz/guide_to_choosing_quants_and_engines/
  54. XGrammar 2: High-Performance Grammar Systems - Emergent Mind, accessed June 8, 2026, https://www.emergentmind.com/topics/xgrammar-2
  55. Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT NVIDIA Technical Blog, accessed June 8, 2026, https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/
  56. bcacdwk/vllmbench: End to end benchmark using Slidesparse GEMM kernels - GitHub, accessed June 8, 2026, https://github.com/bcacdwk/vllmbench
  57. Speculative Decoding Production Guide: 2-5x Faster LLM Inference on GPU Cloud, accessed June 8, 2026, https://www.spheron.network/blog/speculative-decoding-production-guide/
  58. SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification - arXiv, accessed June 8, 2026, https://arxiv.org/html/2512.02337v1
  59. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test - arXiv, accessed June 8, 2026, https://arxiv.org/html/2503.01840v1
  60. [width=0.06]./figs/logo EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty - arXiv, accessed June 8, 2026, https://arxiv.org/pdf/2401.15077
  61. Performance improvements with speculative decoding in vLLM for gpt-oss, accessed June 8, 2026, https://developers.redhat.com/articles/2026/04/16/performance-improvements-speculative-decoding-vllm-gpt-oss
  62. Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs - arXiv, accessed June 8, 2026, https://arxiv.org/pdf/2407.1082
  63. How to Deploy an LLM On-Premise (2026): GPU Sizing & vLLM - Iternal Technologies, accessed June 8, 2026, https://iternal.ai/how-to-deploy-llm-on-premise
  64. SGLang in Production: A Developer’s Guide to Structured Generation, RadixAttention, and Multi-Step LLM Pipelines - Runpod, accessed June 8, 2026, https://www.runpod.io/articles/guides/blog-sglang-production-llm-pipelines

← Back to Canon Map