vllm.reasoning.gemma4_utils ¶
Gemma4 thinking/reasoning output parsing utilities for offline inference.
Standalone functions that parse decoded model text to extract structured thinking content from Gemma4 models. These are pure-Python utilities with zero heavy dependencies — they work on raw decoded strings from any inference backend (vLLM, HuggingFace, TGI, etc.).
For the OpenAI-compatible API reasoning parser (streaming + non-streaming), see vllm.reasoning.gemma4_reasoning_parser. For tool call parsing, see vllm.tool_parsers.gemma4_utils.
Usage with vLLM offline inference::
from vllm import LLM, SamplingParams
from vllm.reasoning.gemma4_utils import parse_thinking_output
llm = LLM(model="google/gemma-4-it")
outputs = llm.generate(prompt, SamplingParams(...))
text = tokenizer.decode(outputs[0].outputs[0].token_ids, skip_special_tokens=False)
# Extract thinking / answer (works with or without enable_thinking)
result = parse_thinking_output(text)
print(result["thinking"]) # chain-of-thought or None
print(result["answer"]) # final answer
Ported from transformers.models.gemma4.utils_gemma4 so that vLLM users do not need a transformers dependency for output parsing.
_clean_answer ¶
Clean trailing sentinel tokens from the answer text.
Strips <turn|>, <eos>, and surrounding whitespace that the model appends at the end of its response.
Source code in vllm/reasoning/gemma4_utils.py
_strip_thought_label ¶
Strip the spurious thought\n label from the start of text.
Only strips when thought appears as the very first word followed by a newline — preserving the word thought in any other context.
Source code in vllm/reasoning/gemma4_utils.py
parse_thinking_output ¶
Parse decoded Gemma4 model output.
Use this on all Gemma4 output regardless of whether thinking mode was enabled. It handles three cases:
- Thinking enabled, tags present — splits on
<|channel>/<channel|>to separate chain-of-thought from the answer and strips thethought\nrole label. - Thinking disabled, spurious label — strips the bare
thought\nprefix that some Gemma4 models emit even without thinking mode. - Clean output — returns the text unchanged.
The answer text is always cleaned of trailing sentinel tokens (<turn|>, <eos>, etc.).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text | str | Decoded model output text (from | required |
Returns:
| Type | Description |
|---|---|
dict[str, str | None] | A dict with keys: - |
Example::
>>> from vllm.reasoning.gemma4_utils import parse_thinking_output
>>> output_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
>>> result = parse_thinking_output(output_text)
>>> print(result["thinking"]) # chain-of-thought reasoning or None
>>> print(result["answer"]) # final answer