vllm.v1.attention.backends.mla.prefill.base ¶
Abstract base classes for MLA prefill backends.
This module defines the interface for MLA prefill backends, enabling priority-based selection similar to how MLA decode backends work.
MLAPrefillBackend ¶
Bases: ABC
Abstract base class for MLA prefill backends.
Each prefill backend declares its capabilities (supported dtypes, compute capabilities, etc.) and provides a factory method for creating the implementation class.
Source code in vllm/v1/attention/backends/mla/prefill/base.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |
create_builder_state classmethod ¶
create_builder_state(
vllm_config: VllmConfig,
kv_cache_spec: AttentionSpec,
layer_names: list[str],
device: device,
) -> MLAPrefillBuilderState
Create backend-specific state for the metadata builder.
This is called once when the metadata builder is initialized. Override to allocate workspaces, create wrappers, etc.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vllm_config | VllmConfig | The vLLM configuration. | required |
kv_cache_spec | AttentionSpec | The attention specification. | required |
layer_names | list[str] | Names of attention layers. | required |
device | device | The device to allocate tensors on. | required |
Returns:
| Type | Description |
|---|---|
MLAPrefillBuilderState | A state object containing backend-specific resources. |
Source code in vllm/v1/attention/backends/mla/prefill/base.py
finalize_attention_metadata classmethod ¶
finalize_attention_metadata(
attn_metadata: Any,
builder_state: MLAPrefillBuilderState,
num_prefills: int,
num_heads: int,
kv_cache_spec: AttentionSpec,
mla_dims: Any,
model_config: Any,
) -> None
Finalize the attention metadata after all components are built.
This is called after the full attention metadata is constructed. Use this for any final processing (e.g., building FlashInfer wrappers).
Source code in vllm/v1/attention/backends/mla/prefill/base.py
get_chunked_context_metadata_cls staticmethod ¶
get_chunked_context_metadata_cls() -> type
Return the ChunkedContextMetadata class for this backend.
Override if the backend needs a specialized ChunkedContextMetadata.
Source code in vllm/v1/attention/backends/mla/prefill/base.py
get_prefill_impl_cls abstractmethod staticmethod ¶
get_prefill_impl_cls() -> type[MLAPrefillImpl]
Return the implementation class for this prefill backend.
get_prefill_metadata_cls staticmethod ¶
get_prefill_metadata_cls() -> type[
MLACommonPrefillMetadata
]
Return the metadata class for this prefill backend.
Override this method if the backend requires a specialized metadata class (e.g., FlashInferPrefillMetadata).
Source code in vllm/v1/attention/backends/mla/prefill/base.py
is_available classmethod ¶
is_available() -> bool
Check if this backend's dependencies are available.
Override this method to check for required libraries/imports.
post_process_prefill_metadata classmethod ¶
post_process_prefill_metadata(
prefill_metadata: MLACommonPrefillMetadata,
builder_state: MLAPrefillBuilderState,
prefill_query_start_loc: Tensor,
) -> None
Post-process the prefill metadata after creation.
This is called after the prefill metadata is created but before it's attached to the attention metadata. Use this to set backend-specific fields on the metadata.
Source code in vllm/v1/attention/backends/mla/prefill/base.py
supports_compute_capability classmethod ¶
supports_compute_capability(
device_capability: DeviceCapability,
) -> bool
Check if this backend supports the given compute capability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device_capability | DeviceCapability | The device's compute capability. | required |
Override this method if the backend has specific hardware requirements.
Source code in vllm/v1/attention/backends/mla/prefill/base.py
supports_dtype classmethod ¶
validate_configuration classmethod ¶
validate_configuration(
device_capability: DeviceCapability,
selector_config: MLAPrefillSelectorConfig,
) -> list[str]
Validate if this backend can be used with the given configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device_capability | DeviceCapability | The device's compute capability. | required |
selector_config | MLAPrefillSelectorConfig | Hashable configuration for backend selection. | required |
Returns:
| Type | Description |
|---|---|
list[str] | A list of invalid reasons. Empty list if configuration is valid. |
Source code in vllm/v1/attention/backends/mla/prefill/base.py
MLAPrefillBuilderState dataclass ¶
State created by a prefill backend for use during metadata building.
This class holds backend-specific resources (workspaces, wrappers, etc.) that persist across metadata build calls. Backends can subclass this to add their own state.
Source code in vllm/v1/attention/backends/mla/prefill/base.py
MLAPrefillImpl ¶
Bases: ABC
Abstract base class for MLA prefill implementations.
Each implementation provides the actual prefill attention computation for new tokens (causal) and context chunks (non-causal).
Source code in vllm/v1/attention/backends/mla/prefill/base.py
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 | |
__init__ ¶
__init__(
num_heads: int,
scale: float,
kv_lora_rank: int,
qk_nope_head_dim: int,
qk_rope_head_dim: int,
v_head_dim: int,
vllm_config: VllmConfig,
device: device,
) -> None
Initialize the prefill implementation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_heads | int | Number of attention heads. | required |
scale | float | Softmax scale factor. | required |
kv_lora_rank | int | Latent dimension for KV. | required |
qk_nope_head_dim | int | QK head dimension without RoPE. | required |
qk_rope_head_dim | int | QK head dimension with RoPE. | required |
v_head_dim | int | Value head dimension. | required |
vllm_config | VllmConfig | vLLM configuration. | required |
device | device | Device to use for computation. | required |
Source code in vllm/v1/attention/backends/mla/prefill/base.py
run_prefill_context_chunk abstractmethod ¶
run_prefill_context_chunk(
prefill_metadata: MLACommonPrefillMetadata,
chunk_idx: int,
q: Tensor,
k: Tensor,
v: Tensor,
) -> tuple[Tensor, Tensor]
Run prefill attention for context chunks (non-causal).
This is used for chunked prefill where we process cached context in chunks to manage memory usage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prefill_metadata | MLACommonPrefillMetadata | Metadata for the prefill operation. | required |
chunk_idx | int | Index of the current context chunk. | required |
q | Tensor | Query tensor of shape [num_tokens, num_heads, qk_head_dim]. | required |
k | Tensor | Key tensor of shape [chunk_tokens, num_heads, qk_head_dim]. | required |
v | Tensor | Value tensor of shape [chunk_tokens, num_heads, v_head_dim]. | required |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor] | Tuple of (output, lse) where: output has shape [num_tokens, num_heads, v_head_dim] lse has shape [num_heads, num_tokens] |
Source code in vllm/v1/attention/backends/mla/prefill/base.py
run_prefill_new_tokens abstractmethod ¶
run_prefill_new_tokens(
prefill_metadata: MLACommonPrefillMetadata,
q: Tensor,
k: Tensor,
v: Tensor,
return_softmax_lse: bool,
) -> Tensor | tuple[Tensor, Tensor]
Run prefill attention for new tokens (causal).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prefill_metadata | MLACommonPrefillMetadata | Metadata for the prefill operation. | required |
q | Tensor | Query tensor of shape [num_tokens, num_heads, qk_head_dim]. | required |
k | Tensor | Key tensor of shape [num_tokens, num_heads, qk_head_dim]. | required |
v | Tensor | Value tensor of shape [num_tokens, num_heads, v_head_dim]. | required |
return_softmax_lse | bool | Whether to return log-sum-exp values. | required |
Returns:
| Type | Description |
|---|---|
Tensor | tuple[Tensor, Tensor] | If return_softmax_lse is False: Output tensor of shape [num_tokens, num_heads, v_head_dim]. |
Tensor | tuple[Tensor, Tensor] | If return_softmax_lse is True: Tuple of (output, lse) where lse has shape [num_heads, num_tokens]. |