feihu.hf commited on
Commit
4baa2f7
·
1 Parent(s): 51c9cad

update README

Browse files
Files changed (1) hide show
  1. README.md +12 -5
README.md CHANGED
@@ -229,6 +229,13 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
229
 
230
  Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
231
 
 
 
 
 
 
 
 
232
  #### Step 2: Launch Model Server
233
 
234
  After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
@@ -247,8 +254,8 @@ Then launch the server with Dual Chunk Flash Attention enabled:
247
 
248
  ```bash
249
  VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
250
- vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
251
- --tensor-parallel-size 8 \
252
  --max-model-len 1010000 \
253
  --enable-chunked-prefill \
254
  --max-num-batched-tokens 131072 \
@@ -284,11 +291,11 @@ Launch the server with DCA support:
284
 
285
  ```bash
286
  python3 -m sglang.launch_server \
287
- --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
288
  --context-length 1010000 \
289
  --mem-frac 0.75 \
290
  --attention-backend dual_chunk_flash_attn \
291
- --tp 8 \
292
  --chunked-prefill-size 131072 \
293
  --reasoning-parser deepseek-r1
294
  ```
@@ -300,7 +307,7 @@ python3 -m sglang.launch_server \
300
  | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
301
  | `--context-length 1010000` | Defines max input length |
302
  | `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
303
- | `--tp 8` | Tensor parallelism size (matches model sharding) |
304
  | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
305
 
306
  #### Troubleshooting:
 
229
 
230
  Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
231
 
232
+ ```bash
233
+ export MODELNAME=Qwen3-30B-A3B-Thinking-2507
234
+ huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
235
+ mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
236
+ mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
237
+ ```
238
+
239
  #### Step 2: Launch Model Server
240
 
241
  After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
 
254
 
255
  ```bash
256
  VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
257
+ vllm serve ./Qwen3-30B-A3B-Thinking-2507 \
258
+ --tensor-parallel-size 4 \
259
  --max-model-len 1010000 \
260
  --enable-chunked-prefill \
261
  --max-num-batched-tokens 131072 \
 
291
 
292
  ```bash
293
  python3 -m sglang.launch_server \
294
+ --model-path ./Qwen3-30B-A3B-Thinking-2507 \
295
  --context-length 1010000 \
296
  --mem-frac 0.75 \
297
  --attention-backend dual_chunk_flash_attn \
298
+ --tp 4 \
299
  --chunked-prefill-size 131072 \
300
  --reasoning-parser deepseek-r1
301
  ```
 
307
  | `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
308
  | `--context-length 1010000` | Defines max input length |
309
  | `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
310
+ | `--tp 4` | Tensor parallelism size (matches model sharding) |
311
  | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
312
 
313
  #### Troubleshooting: