Minimal working setup for SGLang (and vLLM?)

by rocca - opened 12 days ago

12 days ago

I tried this on 4xMI300X:

# lmsysorg/sglang:v0.5.3.post3-rocm700-mi35x-srt
source /opt/venv/bin/activate
SGLANG_USE_AITER=1 python -m sglang.launch_server --model-path amd/DeepSeek-R1-0528-MXFP4-ASQ --tp 4 --port 3000 --attention-backend aiter

But got:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 11, in <module>
    server_args = prepare_server_args(sys.argv[1:])
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 3484, in prepare_server_args
    return ServerArgs.from_cli_args(raw_args)
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 3133, in from_cli_args
    return cls(**{attr: getattr(args, attr) for attr in attrs})
  File "<string>", line 258, in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 522, in __post_init__
    self._handle_gpu_memory_settings(gpu_mem)
  File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 763, in _handle_gpu_memory_settings
    model_config = ModelConfig.from_server_args(self)
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 208, in from_server_args
    return ModelConfig(
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 188, in __init__
    self._verify_quantization()
  File "/sgl-workspace/sglang/python/sglang/srt/configs/model_config.py", line 625, in _verify_quantization
    raise ValueError(
ValueError: Unknown quantization method: quark. Must be one of ['fp8', 'blockwise_int8', 'modelopt_fp8', 'modelopt_fp4', 'w8a8_int8', 'w8a8_fp8', 'awq', 'awq_marlin', 'gptq', 'gptq_marlin', 'moe_wna16', 'compressed-tensors', 'qoq', 'w4afp8', 'petit_nvfp4', 'fbgemm_fp8', 'aqlm', 'deepspeedfp', 'tpu_int8', 'marlin', 'gguf', 'gptq_marlin_24', 'bitsandbytes', 'qqq', 'experts_int8'].

And I tried this (based on this pull request):

# rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006
VLLM_DISABLE_COMPILE_CACHE=1 AMDGCN_USE_BUFFER_OPS=1 VLLM_ROCM_USE_AITER=1 VLLM_TRITON_FP4_GEMM_USE_ASM=0 VLLM_ROCM_USE_AITER_MLA=1 VLLM_ROCM_USE_TRITON_ROPE=1 VLLM_ROCM_USE_CK_MXFP4_MOE=1 vllm serve amd/DeepSeek-R1-0528-MXFP4-ASQ --host 0.0.0.0 --port 3000 --swap-space 64 --dtype auto --max-model-len 8192 --tensor-parallel-size 4 --max-num-seqs 1024 --trust-remote-code --block-size 1 --gpu-memory-utilization 0.90 --max-num-batched-tokens 131072 --async-scheduling --compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}'

But got:

(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597] WorkerProc failed to start.
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597] Traceback (most recent call last):
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.worker.load_model()
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2629, in load_model
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.model = model_loader.load_model(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     model = initialize_model(vllm_config=vllm_config,
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 820, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.model = DeepseekV2Model(vllm_config=vllm_config,
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 201, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 748, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                                                     ^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 627, in make_layers
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 750, in <lambda>
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     lambda prefix: DeepseekV2DecoderLayer(vllm_config, prefix),
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 639, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.self_attn = attn_cls(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                      ^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 592, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.mla_attn = MultiHeadLatentAttention(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                     ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mla.py", line 86, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.mla_attn = Attention(
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                     ^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 195, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 192, in __init__
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]     assert (num_heads == 16 or num_heads == 128), (
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5867) ERROR 10-29 19:04:50 [multiproc_executor.py:597] AssertionError: Aiter MLA only supports 16 or 128 number of heads.

I guess that issue is reported already here:

https://github.com/ROCm/aiter/issues/548

Wondering if anyone has a minimal working setup for SGLang or vLLM on 4xMI300X?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment