vllm.model_executor.models.config ¶
Gemma4Config ¶
Bases: VerifyAndUpdateConfig
Source code in vllm/model_executor/models/config.py
verify_and_update_config staticmethod ¶
verify_and_update_config(vllm_config: VllmConfig) -> None
Force unified attention backend for models with heterogeneous head dimensions.
Some Gemma4 variants use different head dimensions for sliding window (head_dim) vs full attention (global_head_dim) layers. When global_head_dim > 256, FlashAttention rejects those layers (head_size <= 256 kernel limit), causing vLLM to select a different backend for each layer type. This mixed-backend execution produces numerical divergence and output corruption.
The fix detects heterogeneous head dimensions from the model config and forces TRITON_ATTN (which has no head_size ceiling) for all layers when the user hasn't explicitly chosen a backend.
TODO: Heterogeneous head_sizes (head_dim != global_head_dim) require NixlConnector changes to support per-layer KV transfer with different head dimensions for prefill-decode disaggregation.
Source code in vllm/model_executor/models/config.py
HybridAttentionMambaModelConfig ¶
Bases: VerifyAndUpdateConfig
Source code in vllm/model_executor/models/config.py
verify_and_update_config classmethod ¶
verify_and_update_config(vllm_config: VllmConfig) -> None
Perform early validation and setup for hybrid attention/mamba models.
Block size alignment with mamba page sizes is handled later by Platform.update_block_size_for_backend(), which runs after model layers are constructed and the attention backend is known.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vllm_config | VllmConfig | vLLM Config | required |
Source code in vllm/model_executor/models/config.py
LlamaNemotronVLConfig ¶
Bases: VerifyAndUpdateConfig
Config handler for LlamaNemotronVL embedding models.
Source code in vllm/model_executor/models/config.py
MambaModelConfig ¶
Bases: VerifyAndUpdateConfig
Source code in vllm/model_executor/models/config.py
verify_and_update_config classmethod ¶
verify_and_update_config(vllm_config: VllmConfig) -> None
Enable FULL_AND_PIECEWISE cuda graph mode by default (required to get good performance for mamba layers in V1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vllm_config | VllmConfig | vLLM Config | required |
Source code in vllm/model_executor/models/config.py
NemotronHForCausalLMConfig ¶
Bases: VerifyAndUpdateConfig
Source code in vllm/model_executor/models/config.py
verify_and_update_config staticmethod ¶
verify_and_update_config(vllm_config: VllmConfig) -> None
Update mamba_ssm_cache_dtype for NemotronH models when set to 'auto' (or not explicitly set), to the value specified in the HF config, or to float16 if not specified.
Source code in vllm/model_executor/models/config.py
Qwen3_5ForConditionalGenerationConfig ¶
Bases: VerifyAndUpdateConfig
Source code in vllm/model_executor/models/config.py
verify_and_update_config staticmethod ¶
verify_and_update_config(vllm_config: VllmConfig) -> None
Update mamba_ssm_cache_dtype for Qwen3.5 models when set to 'auto' (or not explicitly set), to the value specified in the HF config's mamba_ssm_dtype field. Warn if the user explicitly overrides it to a different value.