Qwen
/

Qwen3-30B-A3B-Thinking-2507

Text Generation

Model card Files Files and versions

feihu.hf commited on Aug 7

Commit

4baa2f7

·

1 Parent(s): 51c9cad

update README

Files changed (1) hide show

README.md +12 -5

README.md CHANGED Viewed

@@ -229,6 +229,13 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
 Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
 #### Step 2: Launch Model Server
 After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
@@ -247,8 +254,8 @@ Then launch the server with Dual Chunk Flash Attention enabled:
 ```bash
 VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
-vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
-  --tensor-parallel-size 8 \
   --max-model-len 1010000 \
   --enable-chunked-prefill \
   --max-num-batched-tokens 131072 \
@@ -284,11 +291,11 @@ Launch the server with DCA support:
 ```bash
 python3 -m sglang.launch_server \
-    --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
     --context-length 1010000 \
     --mem-frac 0.75 \
     --attention-backend dual_chunk_flash_attn \
-    --tp 8 \
     --chunked-prefill-size 131072 \
     --reasoning-parser deepseek-r1
 ```
@@ -300,7 +307,7 @@ python3 -m sglang.launch_server \
 | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
 | `--context-length 1010000` | Defines max input length |
 | `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
-| `--tp 8` | Tensor parallelism size (matches model sharding) |
 | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
 #### Troubleshooting:

 Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
+```bash
+export MODELNAME=Qwen3-30B-A3B-Thinking-2507
+huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
+mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
+mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
+```
 #### Step 2: Launch Model Server
 After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
 ```bash
 VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
+vllm serve ./Qwen3-30B-A3B-Thinking-2507 \
+  --tensor-parallel-size 4 \
   --max-model-len 1010000 \
   --enable-chunked-prefill \
   --max-num-batched-tokens 131072 \
 ```bash
 python3 -m sglang.launch_server \
+    --model-path ./Qwen3-30B-A3B-Thinking-2507 \
     --context-length 1010000 \
     --mem-frac 0.75 \
     --attention-backend dual_chunk_flash_attn \
+    --tp 4 \
     --chunked-prefill-size 131072 \
     --reasoning-parser deepseek-r1
 ```
 | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
 | `--context-length 1010000` | Defines max input length |
 | `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
+| `--tp 4` | Tensor parallelism size (matches model sharding) |
 | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
 #### Troubleshooting: