vllm.config.attention ¶
AttentionConfig ¶
Configuration for attention mechanisms in vLLM.
Source code in vllm/config/attention.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
backend class-attribute instance-attribute ¶
backend: AttentionBackendEnum | None = None
Attention backend to use. Use "auto" or None for automatic selection.
disable_flashinfer_prefill class-attribute instance-attribute ¶
disable_flashinfer_prefill: bool = True
Whether to disable flashinfer prefill.
disable_flashinfer_q_quantization class-attribute instance-attribute ¶
disable_flashinfer_q_quantization: bool = False
If set, when using fp8 kv, do not quantize Q to fp8.
flash_attn_max_num_splits_for_cuda_graph class-attribute instance-attribute ¶
flash_attn_max_num_splits_for_cuda_graph: int = 32
Flash Attention max number splits for cuda graph decode.
flash_attn_version class-attribute instance-attribute ¶
flash_attn_version: Literal[2, 3, 4] | None = None
Force vllm to use a specific flash-attention version (2, 3, or 4). Only valid when using the flash-attention backend.
mla_prefill_backend class-attribute instance-attribute ¶
mla_prefill_backend: MLAPrefillBackendEnum | None = None
MLA prefill backend to use. If None, will be selected automatically. Valid options: FLASH_ATTN, FLASHINFER, CUDNN, TRTLLM_RAGGED. This option supersedes use_cudnn_prefill, use_trtllm_ragged_deepseek_prefill, and disable_flashinfer_prefill which are deprecated.
use_cudnn_prefill class-attribute instance-attribute ¶
use_cudnn_prefill: bool = False
Whether to use cudnn prefill.
use_prefill_decode_attention class-attribute instance-attribute ¶
use_prefill_decode_attention: bool = False
Use separate prefill and decode kernels for attention instead of the unified triton kernel.
use_prefill_query_quantization class-attribute instance-attribute ¶
use_prefill_query_quantization: bool = False
If set, quantize query for attention in prefill.
use_trtllm_attention class-attribute instance-attribute ¶
use_trtllm_attention: bool | None = None
If set to True/False, use or don't use the TRTLLM attention backend in flashinfer. If None, auto-detect the attention backend in flashinfer.
use_trtllm_ragged_deepseek_prefill class-attribute instance-attribute ¶
use_trtllm_ragged_deepseek_prefill: bool = True
Whether to use TRTLLM ragged deepseek prefill.
_migrate_deprecated_mla_prefill_flags ¶
Migrate deprecated MLA prefill flags to mla_prefill_backend.
Source code in vllm/config/attention.py
compute_hash ¶
compute_hash() -> str
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.
Source code in vllm/config/attention.py
validate_backend_before classmethod ¶
Enable parsing of the backend enum type from string.
The special value "auto" is treated as None, which triggers automatic backend selection.
Source code in vllm/config/attention.py
validate_mla_prefill_backend_before classmethod ¶
Enable parsing of the mla_prefill_backend enum type from string.