vLLM
vLLM can be deployed using a docker image we provide, or directly from the python package.
If you are deploying a given model for the first time, you will first need to go to the model's card page on the HuggingFace website then accept the conditions of access.
This is a one-time operation for each model and does not affect their license terms.
With docker
On a GPU-enabled host, you can run the Mistral AI LLM Inference image with the following command to download the model from Hugging Face:
- Mistral-7B
- Mixtral-8X7B
- Mixtral-8X22B
docker run --gpus all \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
ghcr.io/mistralai/mistral-src/vllm:latest \
--host 0.0.0.0 \
--model mistralai/Mistral-7B-Instruct-v0.2
docker run --gpus all \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
ghcr.io/mistralai/mistral-src/vllm:latest \
--host 0.0.0.0 \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tensor-parallel-size 2 # adapt to your GPUs \
--load-format pt # needed since both `pt` and `safetensors` are available
docker run --gpus all \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
ghcr.io/mistralai/mistral-src/vllm:latest \
--host 0.0.0.0 \
--model mistralai/Mixtral-8x22B-Instruct-v0.1 \
--tensor-parallel-size 4 # adapt to your GPUs \
Where HF_TOKEN
is an environment variable containing your Hugging Face user access token.
This will spawn a vLLM instance exposing an OpenAI-like API, as documented in the API section.
If your GPU has CUDA capabilities below 8.0, you will see the error ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your XXX GPU has compute capability 7.0
. You need to pass the parameter --dtype half
to the Docker command line.
The dockerfile for this image can be found on our reference implementation github.
Without docker
Alternatively, you can directly spawn a vLLM server on a GPU-enabled host with Cuda 11.8.
Install vLLM
Firstly you need to install vLLM (or use conda add vllm
if you are using Anaconda):
pip install vllm
Log in to the Hugging Face hub
You will also need to log in to the Hugging Face hub using:
huggingface-cli login
Run the OpenAI compatible inference endpoint
You can then use the following command to start the server:
- Mistral-7B
- Mixtral-8X7B
- Mixtral-8X22B
python -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model mistralai/Mistral-7B-Instruct-v0.2
python -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model mistralai/Mixtral-8X7B-Instruct-v0.1 \
--tensor-parallel-size 2 # adapt to your GPUs \
--load-format pt # needed since both `pt` and `safetensors` are available
python -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model mistralai/Mixtral-8X22B-Instruct-v0.1 \
--tensor-parallel-size 4 # adapt to your GPUs \