vllm.v1.attention.ops.triton_unified_attention ¶
_get_tile_size ¶
Select tile size with Gemma3-specific optimization.
For Gemma3, use 32 for both prefill and decode to better utilize the larger head dimension (128/256). For other models, use the default vLLM behavior.
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_is_gemma3_attention ¶
Detect Gemma3 models via unique (head_size, sliding_window) signature.
Gemma3 models are the only ones using sliding_window=1024 with head_size 128 (27B) or 256 (1B, 4B, 12B). Other SWA models use different window sizes (Mistral=4096, Phi-3=2047).
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_prepare_kv_tile ¶
_prepare_kv_tile(
data,
Q,
tensor_scale,
scale_cache_ptr,
physical_block_idx,
seq_offset,
kv_head_idx,
stride_s_blk,
stride_s_slot,
stride_s_head,
tile_mask,
BLOCK_SIZE: constexpr,
KV_QUANT_MODE: constexpr,
)
Prepare a loaded KV tile for attention computation.
Casts the raw KV data to Q's dtype and loads per-token-head scales when applicable:
KV_QUANT_MODE == 0: cast only (no-op for bf16/fp16).KV_QUANT_MODE == 1(FP8 per-tensor): dequantize inline using the tensor-wide scale.KV_QUANT_MODE >= 2(per-token-head int8/fp8): cast to Q's dtype and return per-head scales separately — the caller applies them after the dot product for better numerical efficiency.
Returns (data, token_head_scales). token_head_scales is only meaningful when KV_QUANT_MODE >= 2; callers gate its use on the same constexpr so the compiler eliminates dead code.