feihu.hf
commited on
Commit
·
4baa2f7
1
Parent(s):
51c9cad
update README
Browse files
README.md
CHANGED
|
@@ -229,6 +229,13 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
|
|
| 229 |
|
| 230 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 231 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 232 |
#### Step 2: Launch Model Server
|
| 233 |
|
| 234 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
|
@@ -247,8 +254,8 @@ Then launch the server with Dual Chunk Flash Attention enabled:
|
|
| 247 |
|
| 248 |
```bash
|
| 249 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
| 250 |
-
vllm serve
|
| 251 |
-
--tensor-parallel-size
|
| 252 |
--max-model-len 1010000 \
|
| 253 |
--enable-chunked-prefill \
|
| 254 |
--max-num-batched-tokens 131072 \
|
|
@@ -284,11 +291,11 @@ Launch the server with DCA support:
|
|
| 284 |
|
| 285 |
```bash
|
| 286 |
python3 -m sglang.launch_server \
|
| 287 |
-
--model-path
|
| 288 |
--context-length 1010000 \
|
| 289 |
--mem-frac 0.75 \
|
| 290 |
--attention-backend dual_chunk_flash_attn \
|
| 291 |
-
--tp
|
| 292 |
--chunked-prefill-size 131072 \
|
| 293 |
--reasoning-parser deepseek-r1
|
| 294 |
```
|
|
@@ -300,7 +307,7 @@ python3 -m sglang.launch_server \
|
|
| 300 |
| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
|
| 301 |
| `--context-length 1010000` | Defines max input length |
|
| 302 |
| `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
|
| 303 |
-
| `--tp
|
| 304 |
| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
|
| 305 |
|
| 306 |
#### Troubleshooting:
|
|
|
|
| 229 |
|
| 230 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 231 |
|
| 232 |
+
```bash
|
| 233 |
+
export MODELNAME=Qwen3-30B-A3B-Thinking-2507
|
| 234 |
+
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
|
| 235 |
+
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
|
| 236 |
+
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
#### Step 2: Launch Model Server
|
| 240 |
|
| 241 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
|
|
|
| 254 |
|
| 255 |
```bash
|
| 256 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
| 257 |
+
vllm serve ./Qwen3-30B-A3B-Thinking-2507 \
|
| 258 |
+
--tensor-parallel-size 4 \
|
| 259 |
--max-model-len 1010000 \
|
| 260 |
--enable-chunked-prefill \
|
| 261 |
--max-num-batched-tokens 131072 \
|
|
|
|
| 291 |
|
| 292 |
```bash
|
| 293 |
python3 -m sglang.launch_server \
|
| 294 |
+
--model-path ./Qwen3-30B-A3B-Thinking-2507 \
|
| 295 |
--context-length 1010000 \
|
| 296 |
--mem-frac 0.75 \
|
| 297 |
--attention-backend dual_chunk_flash_attn \
|
| 298 |
+
--tp 4 \
|
| 299 |
--chunked-prefill-size 131072 \
|
| 300 |
--reasoning-parser deepseek-r1
|
| 301 |
```
|
|
|
|
| 307 |
| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
|
| 308 |
| `--context-length 1010000` | Defines max input length |
|
| 309 |
| `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
|
| 310 |
+
| `--tp 4` | Tensor parallelism size (matches model sharding) |
|
| 311 |
| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
|
| 312 |
|
| 313 |
#### Troubleshooting:
|