## Deployment Guide of openPangu Embedded 1B Based on [vllm-ascend](https://github.com/vllm-project/vllm-ascend)

### Deployment Environment Description

The Atlas 800T A2 (64 GB) supports the deployment of Pangu Embedded 1B (bf16) with a single card. The vllm-ascend community image v0.9.1-dev is used and needs to be pulled on multiple nodes.
```bash
docker pull quay.io/ascend/vllm-ascend:v0.9.1-dev
```

### Docker Boot and Inference Code

Perform the following operations on all nodes.

Run the following command to start the docker:
```bash
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1-dev  # Use correct image id
export NAME=vllm-ascend  # Custom docker name

# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# To prevent device interference from other docker containers, add the argument "--privileged"
docker run --rm \
--name $NAME \
--network host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/root/.cache \
-it $IMAGE bash
```
If not inside the container, enter the container as the root user:
```
docker exec -itu root $NAME /bin/bash
```

Download vllm (v0.9.2) to replace the built-in vllm code of the image.
```bash
pip install --no-deps vllm==0.9.2 pybase64==1.4.1
```

Download [vllm-ascend (v0.9.2rc1)](https://github.com/vllm-project/vllm-ascend/releases/tag/v0.9.2rc1) and replace the built-in vllm-ascend code in the image (/vllm-workspace/vllm-ascend/). For example, download [Source code (tar.gz)](https://github.com/vllm-project/vllm-ascend/archive/refs/tags/v0.9.2rc1.tar.gz) from Assets to get v0.9.2rc1.tar.gz, then extract and replace:

```bash
tar -zxvf vllm-ascend-0.9.2rc1.tar.gz -C /vllm-workspace/vllm-ascend/ --strip-components=1
export PYTHONPATH=/vllm-workspace/vllm-ascend/:${PYTHONPATH}
```

Use the Pangu model-adapted vllm-ascend code from the current repository to replace parts of the code in `/vllm-workspace/vllm-ascend/vllm_ascend/`:

```bash
yes | cp -r inference/vllm_ascend/* /vllm-workspace/vllm-ascend/vllm_ascend/
```

### openPangu Embedded Inference

Perform the following operations on all nodes.

Configuration:
```bash
export VLLM_USE_V1=1
# Specifying HOST=127.0.0.1 (localhost) means the server can only be accessed from the master device.
# Specifying HOST=0.0.0.0 allows the vLLM server to be accessed from other devices on the same network or even from the internet, provided proper network configuration (e.g., firewall rules, port forwarding) is in place.
HOST=xxx.xxx.xxx.xxx
PORT=8080
```

openPangu Embedded 1B running command:
```bash
export ASCEND_RT_VISIBLE_DEVICES=0
LOCAL_CKPT_DIR=/root/.cache/pangu_embedded_1b  # The pangu_embedded_1b bf16 weight
SERVED_MODEL_NAME=pangu_embedded_1b

vllm serve $LOCAL_CKPT_DIR \
    --served-model-name $SERVED_MODEL_NAME \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --host $HOST \
    --port $PORT \
    --max-num-seqs 32 \
    --max-model-len 32768 \
    --max-num-batched-tokens 4096 \
    --tokenizer-mode "slow" \
    --dtype bfloat16 \
    --distributed-executor-backend mp \
    --gpu-memory-utilization 0.93 \
    --no-enable-prefix-caching \
    --no-enable-chunked-prefill \
```

### Test Request

After server launched, send test request from master node or other nodes:

```bash
MASTER_NODE_IP=xxx.xxx.xxx.xxx  # server node ip
curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "'$SERVED_MODEL_NAME'",
        "messages": [
            {
                "role": "user",
                "content": "Who are you?"
            }
        ],
        "max_tokens": 512,
        "temperature": 0
    }'
```