guide for runing this at 12gbvram and 180gb ram with dual cpu in vllm 0.5 to 0.6t/sec in vllm

#20
by gopi87 - opened

#install vllm

pip install uv
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install bitsandbytes
MAX_JOBS=16 uv pip install git+https://github.com/vllm-project/vllm.git

#batch work

// run this inside the same venv you start vllm from
python - <<'PY'
import site, os, re
site_pkg = site.getsitepackages()[0]
f = os.path.join(site_pkg, "vllm", "v1", "worker", "gpu_model_runner.py")
with open(f) as fh: s = fh.read()
s = re.sub(r'assert self.cache_config.cpu_offload_gb == 0,.*?)', '', s, flags=re.S)
with open(f, "w") as fh: fh.write(s)
print("patched β†’", f)
PY

#run this model
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve
Qwen/Qwen3-Next-80B-A3B-Instruct
--port 8000
--tensor-parallel-size 1
--max-model-len 4096
--dtype float16
--cpu-offload-gb 180
--enforce-eager
--max-num-seqs 2
--max-num-batched-tokens 4096

#result
on 12gb vram and 256gb ram getting arround 0.6 t/sec

Try using llama.cpp with cpu only, you can get ~10t/sec

Try using llama.cpp with cpu only, you can get ~10t/sec

no gguf . btw i am currently trying vllm cpu + spculative decode in vllm cpu+gpu

Sign up or log in to comment