Minimal working setup for SGLang (and vLLM?)

#2
by rocca - opened

I tried this on 4xMI300X:

# lmsysorg/sglang:v0.5.3.post3-rocm700-mi35x-srt
source /opt/venv/bin/activate
SGLANG_USE_AITER=1 python -m sglang.launch_server --model-path amd/DeepSeek-R1-0528-MXFP4-ASQ --tp 4 --port 3000 --attention-backend aiter

But got:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 11, in <module>
    server_args = prepare_server_args(sys.argv[1:])
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 3484, in prepare_server_args
    return ServerArgs.from_cli_args(raw_args)
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 3133, in from_cli_args
    return cls(**{attr: getattr(args, attr) for attr in attrs})
  File "<string>", line 258, in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 522, in __post_init__
    self._handle_gpu_memory_settings(gpu_mem)
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 763, in _handle_gpu_memory_settings
    model_config = ModelConfig.from_server_args(self)
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 208, in from_server_args
    return ModelConfig(
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 188, in __init__
    self._verify_quantization()
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 625, in _verify_quantization
    raise ValueError(
ValueError: Unknown quantization method: quark. Must be one of ['fp8', 'blockwise_int8', 'modelopt_fp8', 'modelopt_fp4', 'w8a8_int8', 'w8a8_fp8', 'awq', 'awq_marlin', 'gptq', 'gptq_marlin', 'moe_wna16', 'compressed-tensors', 'qoq', 'w4afp8', 'petit_nvfp4', 'fbgemm_fp8', 'aqlm', 'deepspeedfp', 'tpu_int8', 'marlin', 'gguf', 'gptq_marlin_24', 'bitsandbytes', 'qqq', 'experts_int8'].

And I tried this (based on this pull request):

# rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
VLLM_DISABLE_COMPILE_CACHE=1 AMDGCN_USE_BUFFER_OPS=1 VLLM_ROCM_USE_AITER=1 VLLM_TRITON_FP4_GEMM_USE_ASM=0 VLLM_ROCM_USE_AITER_MLA=1 VLLM_ROCM_USE_TRITON_ROPE=1 VLLM_ROCM_USE_CK_MXFP4_MOE=1 vllm serve amd/DeepSeek-R1-0528-MXFP4-ASQ --host 0.0.0.0 --port 3000 --swap-space 64 --dtype auto --max-model-len 8192 --tensor-parallel-size 4 --max-num-seqs 1024 --trust-remote-code --block-size 1 --gpu-memory-utilization 0.90 --max-num-batched-tokens 131072 --async-scheduling --compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}'

But got:

(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.worker.load_model()
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2629, in load_model
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.model = model_loader.load_model(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     model = initialize_model(vllm_config=vllm_config,
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 820, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.model = DeepseekV2Model(vllm_config=vllm_config,
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 201, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 748, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                                                     ^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 627, in make_layers
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 750, in <lambda>
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     lambda prefix: DeepseekV2DecoderLayer(vllm_config, prefix),
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 639, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.self_attn = attn_cls(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                      ^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 592, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.mla_attn = MultiHeadLatentAttention(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                     ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mla.py", line 86, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.mla_attn = Attention(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                     ^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 195, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 192, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     assert (num_heads == 16 or num_heads == 128), (
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597] AssertionError: Aiter MLA only supports 16 or 128 number of heads.

I guess that issue is reported already here:


Wondering if anyone has a minimal working setup for SGLang or vLLM on 4xMI300X?

Sign up or log in to comment