openPangu-Embedded-1B-V1.1 / inference /vllm_ascend_for_openpangu_embedded_1b.md
wangrongsheng's picture
Upload folder using huggingface_hub
88a424e verified

Deployment Guide of openPangu Embedded 1B Based on vllm-ascend

Deployment Environment Description

The Atlas 800T A2 (64 GB) supports the deployment of Pangu Embedded 1B (bf16) with a single card. The vllm-ascend community image v0.9.1-dev is used and needs to be pulled on multiple nodes.

docker pull quay.io/ascend/vllm-ascend:v0.9.1-dev

Docker Boot and Inference Code

Perform the following operations on all nodes.

Run the following command to start the docker:

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1-dev  # Use correct image id
export NAME=vllm-ascend  # Custom docker name

# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# To prevent device interference from other docker containers, add the argument "--privileged"
docker run --rm \
--name $NAME \
--network host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash

If not inside the container, enter the container as the root user:

docker exec -itu root $NAME /bin/bash

Download vllm (v0.9.2) to replace the built-in vllm code of the image.

pip install --no-deps vllm==0.9.2 pybase64==1.4.1

Download vllm-ascend (v0.9.2rc1) and replace the built-in vllm-ascend code in the image (/vllm-workspace/vllm-ascend/). For example, download Source code (tar.gz) from Assets to get v0.9.2rc1.tar.gz, then extract and replace:

tar -zxvf vllm-ascend-0.9.2rc1.tar.gz -C /vllm-workspace/vllm-ascend/ --strip-components=1
export PYTHONPATH=/vllm-workspace/vllm-ascend/:${PYTHONPATH}

Use the Pangu model-adapted vllm-ascend code from the current repository to replace parts of the code in /vllm-workspace/vllm-ascend/vllm_ascend/:

yes | cp -r inference/vllm_ascend/* /vllm-workspace/vllm-ascend/vllm_ascend/

openPangu Embedded Inference

Perform the following operations on all nodes.

Configuration:

export VLLM_USE_V1=1
# Specifying HOST=127.0.0.1 (localhost) means the server can only be accessed from the master device.
# Specifying HOST=0.0.0.0 allows the vLLM server to be accessed from other devices on the same network or even from the internet, provided proper network configuration (e.g., firewall rules, port forwarding) is in place.
HOST=xxx.xxx.xxx.xxx
PORT=8080

openPangu Embedded 1B running command:

export ASCEND_RT_VISIBLE_DEVICES=0
LOCAL_CKPT_DIR=/root/.cache/pangu_embedded_1b  # The pangu_embedded_1b bf16 weight
SERVED_MODEL_NAME=pangu_embedded_1b

vllm serve $LOCAL_CKPT_DIR \
    --served-model-name $SERVED_MODEL_NAME \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --host $HOST \
    --port $PORT \
    --max-num-seqs 32 \
    --max-model-len 32768 \
    --max-num-batched-tokens 4096 \
    --tokenizer-mode "slow" \
    --dtype bfloat16 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.93 \
    --no-enable-prefix-caching \
    --no-enable-chunked-prefill \

Test Request

After server launched, send test request from master node or other nodes:

MASTER_NODE_IP=xxx.xxx.xxx.xxx  # server node ip
curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "'$SERVED_MODEL_NAME'",
        "messages": [
            {
                "role": "user",
                "content": "Who are you?"
            }
        ],
        "max_tokens": 512,
        "temperature": 0
    }'