Troubleshooting TypeError Flash_attn_varlen_func With Qwen3 GGUF In VLLM
This article addresses a bug encountered while serving a GGUF version of the Qwen3 model using vLLM. The error manifest as a TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'num_splits'
during inference. This comprehensive guide will explore the issue, analyze the error logs, and provide potential solutions to resolve this problem.
Understanding the Bug
The core issue lies within the interaction between the vLLM library, the FlashAttention backend, and the Qwen3 model when running in GGUF format. The traceback clearly indicates that the flash_attn_varlen_func
is receiving an unexpected keyword argument, num_splits
. This suggests an incompatibility or a misconfiguration in how FlashAttention is being called within vLLM's attention mechanism when processing the Qwen3 model.
Delving into the Error Log
To effectively troubleshoot this bug, let's break down the error log provided:
TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'num_splits'
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 590, in run_engine_core
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 579, in run_engine_core
engine_core.run_busy_loop()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 606, in run_busy_loop
self._process_engine_step()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 631, in _process_engine_step
outputs, model_executed = self.step_fn()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 235, in step
model_output = self.execute_model(scheduler_output)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 221, in execute_model
raise err
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 212, in execute_model
return self.model_executor.execute_model(scheduler_output)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 87, in execute_model
output = self.collective_rpc("execute_model",
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2736, in run_method
return func(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 308, in execute_model
output = self.model_runner.execute_model(scheduler_output,
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1370, in execute_model
model_output = self.model(
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3.py", line 303, in forward
hidden_states = self.model(input_ids, positions, intermediate_tensors,
File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 246, in __call__
model_output = self.forward(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 337, in forward
def forward(
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
raise e
File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.58", line 240, in forward
submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None
File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
raise e
File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.2", line 5, in forward
unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, 'model.layers.0.self_attn.attn'); query_2 = key_2 = value = output_1 = unified_attention_with_output = None
File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
return self._op(*args, **(kwargs or {}))
File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 451, in unified_attention_with_output
self.impl.forward(self,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 539, in forward
flash_attn_varlen_func(
TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'num_splits'
Key observations from the log include:
- The error originates within the
vllm.v1.attention.backends.flash_attn.py
file, specifically in theforward
function of theflash_attn
backend. - The
unified_attention_with_output
function invllm/attention/layer.py
is involved in the call to FlashAttention. - The traceback traverses through various layers of vLLM's architecture, including the engine core, model executor, GPU worker, and model runner, highlighting the depth of the issue.
Analyzing the Environment
The provided environment information sheds light on the setup used when the bug occurred. Key aspects to consider are:
- vLLM Version: The vLLM version is
0.9.2.dev226+g9a3b88328.d20250705
, which is a development version. Development versions might contain bugs or unoptimized code. - Model: The model being used is
Qwen3-0.6B-Q4_1.gguf
, a GGUF quantized version of the Qwen3 model. - Hardware: The platform is CUDA-enabled, indicating GPU usage.
- Configuration:
max_model_len
: 16000, suggests a large context window.gpu_memory_utilization
: 0.6, indicating 60% GPU memory usage.swap_space
: 8.0 GiB, a significant amount of swap space is allocated.quantization
: gguf, confirming the use of GGUF quantization.
Potential Causes and Solutions
Based on the error message and environment details, several potential causes and solutions can be explored:
1. Incompatible FlashAttention Version
- Cause: The version of the FlashAttention library being used might not be compatible with the vLLM version or the specific way vLLM is calling it. The
num_splits
argument might have been introduced in a later version or removed in an earlier one. - Solution:
- Verify FlashAttention Version: Check the installed version of FlashAttention. You can use
pip show flash-attn
orpip list
. - Update/Downgrade FlashAttention: Try updating or downgrading FlashAttention to a version known to be compatible with vLLM 0.9.2. Consult vLLM's documentation or release notes for recommended FlashAttention versions. For instance:
pip install flash-attn==<compatible_version>
- Verify FlashAttention Version: Check the installed version of FlashAttention. You can use
2. vLLM Version Bug
- Cause: As the vLLM version is a development build, it might contain bugs related to FlashAttention integration, particularly with GGUF models.
- Solution:
- Try a Stable vLLM Release: Consider switching to a stable release of vLLM if available. Stable releases undergo more testing and are less likely to have critical bugs.
pip install vllm==<stable_version>
- Report the Issue: If the bug persists on a stable release or you need to use the development version, report the issue on the vLLM GitHub repository. Provide detailed information, including the error log, environment details, and steps to reproduce the bug.
3. GGUF Quantization Incompatibility
- Cause: There might be an incompatibility between FlashAttention and GGUF quantized models in vLLM. Quantization can sometimes introduce numerical instability or require specific handling in attention mechanisms.
- Solution:
- Try a Non-Quantized Model: As a test, try running vLLM with a non-quantized version of the Qwen3 model (e.g., FP16 or BF16). If the error disappears, it suggests a quantization-related issue.
- Check vLLM Documentation: Consult vLLM's documentation for specific instructions or limitations regarding GGUF quantization and FlashAttention.
4. Incorrect Configuration
- Cause: While less likely, an incorrect configuration parameter might be causing the issue. For instance, the
max_model_len
or other parameters related to attention might be triggering the bug. - Solution:
- Review Configuration: Double-check all configuration parameters passed to vLLM, especially those related to memory, attention, and quantization. Ensure they are within the recommended ranges for your hardware and model.
- Experiment with
max_model_len
: Try reducing themax_model_len
to see if it resolves the error. A very large context window can sometimes expose issues in attention implementations.
5. FlashAttention Installation Issues
- Cause: FlashAttention might not be installed correctly or might have been compiled with incompatible CUDA versions.
- Solution:
- Reinstall FlashAttention: Try reinstalling FlashAttention from scratch:
pip uninstall flash-attn pip cache purge # Optional: Clear pip cache pip install flash-attn
- Check CUDA Compatibility: Ensure that the FlashAttention version is compatible with your installed CUDA version. Refer to FlashAttention's documentation for CUDA version requirements.
- Compile from Source: If you encounter persistent issues, consider compiling FlashAttention from source, following the instructions in its repository. This can sometimes resolve compatibility problems.
6. Memory Issues
- Cause: Although the logs don't directly point to a memory issue, a large
max_model_len
(16000) and a relatively lowgpu_memory_utilization
(0.6) combined with GGUF quantization might be pushing the limits of available memory, leading to unexpected errors. - Solution:
- Reduce
max_model_len
: Try reducing the maximum sequence length to free up memory. - Increase
gpu_memory_utilization
(with caution): If your system has sufficient GPU memory, you could try increasinggpu_memory_utilization
, but be careful not to set it too high, as it can lead to out-of-memory errors. - Monitor Memory Usage: Use tools like
nvidia-smi
to monitor GPU memory usage during inference. This can help identify if memory exhaustion is a contributing factor.
- Reduce
Detailed Troubleshooting Steps
To effectively resolve this issue, follow these detailed troubleshooting steps:
- Isolate the Problem:
- Try running vLLM with a smaller, non-quantized model to see if the error persists.
- Run vLLM with a very short prompt and a low
max_model_len
.
- Check FlashAttention:
- Verify the installed FlashAttention version using
pip show flash-attn
. - Consult FlashAttention's documentation for compatibility with your CUDA version and vLLM.
- Try reinstalling FlashAttention or compiling it from source.
- Verify the installed FlashAttention version using
- Examine vLLM Configuration:
- Review all command-line arguments and configuration parameters.
- Experiment with different values for
max_model_len
,gpu_memory_utilization
, and other relevant parameters.
- Test Different vLLM Versions:
- If using a development version, switch to a stable release.
- If the error persists, try a slightly older stable release.
- Monitor Memory Usage:
- Use
nvidia-smi
to monitor GPU memory usage during inference. - Check system RAM usage as well, as swap space is being used.
- Use
- Report the Bug:
- If you cannot resolve the issue, create a detailed bug report on the vLLM GitHub repository.
- Include the error log, environment information, steps to reproduce the bug, and any troubleshooting steps you have already taken.
Conclusion
The TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'num_splits'
error in vLLM when serving a Qwen3 GGUF model likely stems from an incompatibility between FlashAttention, vLLM, and the GGUF quantization format. By systematically troubleshooting the environment, FlashAttention installation, vLLM configuration, and versions, you can pinpoint the root cause and implement the appropriate solution. Remember to consult the official documentation of vLLM and FlashAttention for the most accurate and up-to-date information. This detailed guide provides a comprehensive approach to resolving the bug and ensuring smooth inference with vLLM and Qwen3 models. Remember the importance of consistent debugging and thorough documentation in resolving complex software issues. By carefully analyzing the logs, understanding your environment, and systematically testing potential solutions, you can overcome this error and harness the power of vLLM for your language model serving needs. Always prioritize stable releases for production environments and contribute back to the community by reporting any bugs you encounter.
Keywords for SEO Optimization
- vLLM
- Qwen3
- GGUF
- FlashAttention
- TypeError
- num_splits
- Bug
- Troubleshooting
- Inference
- Language Models
- Large Language Models (LLMs)
- GPU
- CUDA
- Quantization
- Model Serving
By incorporating these keywords strategically throughout the article, we can improve its search engine visibility and ensure that users encountering this issue can easily find this comprehensive guide.