vLLM

vLLM can be deployed using a docker image we provide, or directly from the python package.

info

If you are deploying a given model for the first time, you will first need to go to the model's card page on the HuggingFace website then accept the conditions of access.

This is a one-time operation for each model and does not affect their license terms.

With docker

On a GPU-enabled host, you can run the Mistral AI LLM Inference image with the following command to download the model from Hugging Face:

Mistral-7B
Mixtral-8X7B
Mixtral-8X22B

docker run --gpus all \
    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
    ghcr.io/mistralai/mistral-src/vllm:latest \
    --host 0.0.0.0 \
    --model mistralai/Mistral-7B-Instruct-v0.2

docker run --gpus all \
    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
    ghcr.io/mistralai/mistral-src/vllm:latest \
    --host 0.0.0.0 \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --tensor-parallel-size 2 # adapt to your GPUs \
    --load-format pt # needed since both `pt` and `safetensors` are available

docker run --gpus all \
    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
    ghcr.io/mistralai/mistral-src/vllm:latest \
    --host 0.0.0.0 \
    --model mistralai/Mixtral-8x22B-Instruct-v0.1 \
    --tensor-parallel-size 4 # adapt to your GPUs \

Where HF_TOKEN is an environment variable containing your Hugging Face user access token. This will spawn a vLLM instance exposing an OpenAI-like API, as documented in the API section.

info

If your GPU has CUDA capabilities below 8.0, you will see the error ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your XXX GPU has compute capability 7.0. You need to pass the parameter --dtype half to the Docker command line.

The dockerfile for this image can be found on our reference implementation github.

Without docker

Alternatively, you can directly spawn a vLLM server on a GPU-enabled host with Cuda 11.8.

Install vLLM

Firstly you need to install vLLM (or use conda add vllm if you are using Anaconda):

pip install vllm

Log in to the Hugging Face hub

You will also need to log in to the Hugging Face hub using:

huggingface-cli login

Run the OpenAI compatible inference endpoint

You can then use the following command to start the server:

Mistral-7B
Mixtral-8X7B
Mixtral-8X22B

python -u -m vllm.entrypoints.openai.api_server \
       --host 0.0.0.0 \
       --model mistralai/Mistral-7B-Instruct-v0.2

python -u -m vllm.entrypoints.openai.api_server \
       --host 0.0.0.0 \
       --model mistralai/Mixtral-8X7B-Instruct-v0.1 \
       --tensor-parallel-size 2 # adapt to your GPUs \
      --load-format pt # needed since both `pt` and `safetensors` are available

python -u -m vllm.entrypoints.openai.api_server \
       --host 0.0.0.0 \
       --model mistralai/Mixtral-8X22B-Instruct-v0.1 \
       --tensor-parallel-size 4 # adapt to your GPUs \

With docker​

Without docker​

Install vLLM​

Log in to the Hugging Face hub​

Run the OpenAI compatible inference endpoint​

With docker

Without docker

Install vLLM

Log in to the Hugging Face hub

Run the OpenAI compatible inference endpoint