vLLM与qwen1.5

本文最后更新于 2024年6月19日早上

记录使用vLLM部署qwen1.5。

1 安装环境

下列是我的环境，之后根据这个环境进行安装：

Ubuntu 20.04
CUDA 11.8
python 10

使用conda：

1 2	`conda create -n vllm python=3.10 -y conda activate vllm`

1.1 安装vLLM

安装包在发布页找，注意cuda版本和python版本来选择vLLM的版本，下列则是vllm 0.3.3

1	`pip install https://github.com/vllm-project/vllm/releases/download/v0.3.3/vllm-0.3.3+cu118-cp310-cp310-manylinux1_x86_64.whl`

安装torch，由于vllm版本也需要对应torch，所有可能之前安装的会报错，如下：

1
2
3

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.3.3+cu118 requires torch==2.1.2, but you have torch 2.2.1+cu118 which is incompatible.
vllm 0.3.3+cu118 requires xformers==0.0.23.post1, but you have xformers 0.0.25+cu118 which is incompatible.

先卸载，然后指定版本安装：

1 2	`pip uninstall torch -y pip install torch==2.1.2 --upgrade --index-url https://download.pytorch.org/whl/cu118`

安装xformers

1
2
3

pip uninstall xformers -y
pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118
pip install  xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118

qwen2的环境

最近发布了qwen2，需要vllm>=0.4.0，补充该环境

# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.4.0
export PYTHON_VERSION=310
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

如需分布式部署

1	`pip install ray`

2 使用（OpenAI Compatible Server）

这里以qwen1.5的部署为例。

2.1 命令参数

vllm-openai-server命令参数说明（部分，完整见官网）：
* –port 服务端口
* –uvicorn-log-level 日志输出级别，默认 “info”
* –api-key 密匙，请求时需要在报头填入此密匙
* –served-model-name 请求时的对应模型名称
* –model 模型名称或者路径
* –download-dir 下载和加载权重的目录
* –dtype 权重精度，可选auto，half，float16，bfloat16，float，float32
* –max-model-len 模型上下文长度
* –gpu-memory-utilization GPU内存使用占比，范围从0到1，默认0.9
* –tensor-parallel-size 平行副本数，默认1

2.2 部署qwen

下列命令是部署Qwen1.5-7B-Chat的示例，模型最大上下文是32768，但显存不够用，所有要设置--max-model-len 2048：

1	`python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-7B-Chat --model Qwen/Qwen1.5-14B-Chat --max-model-len 2048 --port 1234`

默认是从huggingface下载模型，也可以去QWEN下载后填写地址，如果没有梯子，也可以设置VLLM_USE_MODELSCOPE=True从MODELSCOPE下载。又或者修改huggingface为国内镜像：

1 2	`os.environ["HF_ENDPOINT"] = "https://hf-mirror.com" HF_ENDPOINT=https://hf-mirror.com python xxx.py`

下列是使用镜像地址进行部署示例，参数--gpu-memory-utilization 0.8用于确定显存暂用比，否则直接拉满：

HF_ENDPOINT=https://hf-mirror.com python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-7B-Chat --model Qwen/Qwen1.5-7B-Chat --max-model-len 8192 --port 23238  --gpu-memory-utilization 0.8

多GPU分布式部署，使用tensor_parallel_size参数实现

HF_ENDPOINT=https://hf-mirror.com CUDA_VISIBLE_DEVICES=0,2  python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-7B-Chat --model Qwen/Qwen1.5-7B-Chat --max-model-len 8192 --port 23238 --tensor-parallel-size 2 --gpu-memory-utilization 0.8

qwen2

命令和上面一样，只是模型不同:huggingface，示例如下：

1	`python -m vllm.entrypoints.openai.api_server --model /home/server/AI/models/Qwen2-7B-Instruct-AWQ --served-model-name Qwen2-7b-awq`

2.3 使用量化模型

pip install autoawq
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen1.5-7B-Chat-AWQ \
    --quantization awq

pip install auto-gptq optimum
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen1.5-7B-Chat-GPTQ-Int8 \
    --quantization gptq

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen1.5-7B-Chat-GPTQ-Int8 \
    --quantization gptq \
    --kv-cache-dtype fp8_e5m2

2.4 请求

请求方式没有变化，大家都是openai格式

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
    ]
    }'

python

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://172.1.1.1:1234/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen1.5-7B-Chat",
    messages=[
        {"role": "system", "content": "你的名字叫久远。来自图斯库尔"},
        {"role": "user", "content": "你是谁"},
    ]
)

print("Chat response:", chat_response)

2.5 docker

单卡部署

docker run --runtime nvidia --gpus all \
    -e CUDA_VISIBLE_DEVICES=1 \
    -e HF_ENDPOINT=https://hf-mirror.com \
    -v /home/server/AI/models:/root/.cache/huggingface \
    -p 8000:8000 \
    -d \
    --name Qwen1.5-7B-Chat-G1 \ 
    vllm/vllm-openai:v0.4.0 --served-model-name Qwen1.5-7B-Chat --model Qwen/Qwen1.5-7B-Chat --max-model-len 8192 --port 8000  --gpu-memory-utilization 0.9

双卡部署

docker run --runtime nvidia --gpus all \
    -e CUDA_VISIBLE_DEVICES=0,1 \
    -e HF_ENDPOINT=https://hf-mirror.com \
    -v /home/server/AI/models:/root/.cache/huggingface \
    -p 8000:8000 \
    -d \
    --name Qwen1.5-7B-Chat-G01 \ 
    vllm/vllm-openai:v0.4.0 --served-model-name Qwen1.5-14B-Chat --model Qwen/Qwen1.5-14B-Chat --max-model-len 4096 --port 8000 --tensor-parallel-size 2 --gpu-memory-utilization 0.9

docker重启策略：

no：这是默认策略，在容器退出时不重启容器。
on-failure：只有在容器非正常退出时（即退出状态非0），才会重启容器。
always：无论退出状态是如何，都会自动重启容器。

1	`docker run -d --restart=always your_image`

查看以运行容器的重启策略：

1	`sudo docker inspect -f "{{ .HostConfig.RestartPolicy }}" your_container`

报错解决

找到不到libnccl2
1
sudo apt install libnccl2 libnccl-dev
KeyError: ‘qwen2’
官方文档写着：
1
更新版本`transformers>=4.37.0`
但实际我还是更新到了
1
transformers 4.38.2

其他

技术类

#深度学习 #大型语言模型

vLLM与qwen1.5

https://blog.kala.love/posts/e99e83d8/

作者

久远·卡拉

发布于

2024年3月27日

许可协议

苹果盒子：apple TV 4K 上一篇

目前使用的数码产品（2024年3月）下一篇