Skip to main content

💥 Evaluate LLMs - OpenAI Proxy Server

A simple, fast, and lightweight OpenAI-compatible server to call 100+ LLM APIs.

LiteLLM Server supports:

  • Call Huggingface/Bedrock/TogetherAI/etc. in the OpenAI ChatCompletions format
  • Set custom prompt templates + model-specific configs (temperature, max_tokens, etc.)
  • Caching (In-memory + Redis)

See Code

info

We want to learn how we can make the server better! Meet the founders or join our discord

Quick Start​

$ litellm --model huggingface/bigcode/starcoder

OpenAI Proxy running on http://0.0.0.0:8000

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'

This will now automatically route any requests for gpt-3.5-turbo to bigcode starcoder, hosted on huggingface inference endpoints.

Other supported models:​

$ export AWS_ACCESS_KEY_ID=""
$ export AWS_REGION_NAME="" # e.g. us-west-2
$ export AWS_SECRET_ACCESS_KEY=""
$ litellm --model bedrock/anthropic.claude-v2

Jump to Code

[TUTORIAL] LM-Evaluation Harness with TGI

Evaluate LLMs 20x faster with TGI via litellm proxy's /completions endpoint.

This tutorial assumes you're using lm-evaluation-harness

Step 1: Start the local proxy

$ litellm --model huggingface/bigcode/starcoder

OpenAI Compatible Endpoint at http://0.0.0.0:8000

Step 2: Set OpenAI API Base

$ export OPENAI_API_BASE="http://0.0.0.0:8000"

Step 3: Run LM-Eval-Harness

$ python3 main.py \
--model gpt3 \
--model_args engine=huggingface/bigcode/starcoder \
--tasks hellaswag

Endpoints:​

  • /chat/completions - chat completions endpoint to call 100+ LLMs
  • /embeddings - embedding endpoint for Azure, OpenAI, Huggingface endpoints
  • /models - available models on server

Set Custom Prompt Templates​

LiteLLM by default checks if a model has a prompt template and applies it (e.g. if a huggingface model has a saved chat template in it's tokenizer_config.json). However, you can also set a custom prompt template on your proxy in the config.yaml:

Step 1: Save your prompt template in a config.yaml

# Model-specific parameters
model_list:
- model_name: mistral-7b # model alias
litellm_params: # actual params for litellm.completion()
model: "huggingface/mistralai/Mistral-7B-Instruct-v0.1"
api_base: "<your-api-base>"
api_key: "<your-api-key>" # [OPTIONAL] for hf inference endpoints
initial_prompt_value: "\n"
roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
final_prompt_value: "\n"
bos_token: "<s>"
eos_token: "</s>"
max_tokens: 4096

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Multiple Models​

If you have 1 model running on a local GPU and another that's hosted (e.g. on Runpod), you can call both via the same litellm server by listing them in your config.yaml.

model_list:
- model_name: zephyr-alpha
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: huggingface/HuggingFaceH4/zephyr-7b-alpha
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: https://<my-hosted-endpoint>
$ litellm --config /path/to/config.yaml

Evaluate model​

If you're repo let's you set model name, you can call the specific model by just passing in that model's name -

import openai 
openai.api_base = "http://0.0.0.0:8000"

completion = openai.ChatCompletion.create(model="zephyr-alpha", messages=[{"role": "user", "content": "Hello world"}])
print(completion.choices[0].message.content)

If you're repo only let's you specify api base, then you can add the model name to the api base passed in -

import openai 
openai.api_base = "http://0.0.0.0:8000/openai/deployments/zephyr-alpha/chat/completions" # zephyr-alpha will be used

completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello world"}])
print(completion.choices[0].message.content)

Save Model-specific params (API Base, API Keys, Temperature, etc.)​

Use the router_config_template.yaml to save model-specific information like api_base, api_key, temperature, max_tokens, etc.

Step 1: Create a config.yaml file

model_list:
- model_name: gpt-3.5-turbo
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: azure/chatgpt-v-2 # azure/<your-deployment-name>
api_key: your_azure_api_key
api_version: your_azure_api_version
api_base: your_azure_api_base
- model_name: mistral-7b
litellm_params:
model: ollama/mistral
api_base: your_ollama_api_base

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Model Alias​

Set a model alias for your deployments.

In the config.yaml the model_name parameter is the user-facing name to use for your deployment.

E.g.: If we want to save a Huggingface TGI Mistral-7b deployment, as 'mistral-7b' for our users, we might save it as:

model_list:
- model_name: mistral-7b # ALIAS
litellm_params:
model: huggingface/mistralai/Mistral-7B-Instruct-v0.1 # ACTUAL NAME
api_key: your_huggingface_api_key # [OPTIONAL] if deployed on huggingface inference endpoints
api_base: your_api_base # url where model is deployed

Caching​

Add Redis Caching to your server via environment variables

### REDIS
REDIS_HOST = ""
REDIS_PORT = ""
REDIS_PASSWORD = ""

Docker command:

docker run -e REDIST_HOST=<your-redis-host> -e REDIS_PORT=<your-redis-port> -e REDIS_PASSWORD=<your-redis-password> -e PORT=8000 -p 8000:8000 ghcr.io/berriai/litellm:latest

Logging​

  1. Debug Logs Print the input/output params by setting SET_VERBOSE = "True".

Docker command:

docker run -e SET_VERBOSE="True" -e PORT=8000 -p 8000:8000 ghcr.io/berriai/litellm:latest
  1. Add Langfuse Logging to your server via environment variables
### LANGFUSE
LANGFUSE_PUBLIC_KEY = ""
LANGFUSE_SECRET_KEY = ""
# Optional, defaults to https://cloud.langfuse.com
LANGFUSE_HOST = "" # optional

Docker command:

docker run -e LANGFUSE_PUBLIC_KEY=<your-public-key> -e LANGFUSE_SECRET_KEY=<your-secret-key> -e LANGFUSE_HOST=<your-langfuse-host> -e PORT=8000 -p 8000:8000 ghcr.io/berriai/litellm:latest

Local Usage​

$ git clone https://github.com/BerriAI/litellm.git
$ cd ./litellm/litellm_server
$ uvicorn main:app --host 0.0.0.0 --port 8000

Setting LLM API keys​

This server allows two ways of passing API keys to litellm

  • Environment Variables - This server by default assumes the LLM API Keys are stored in the environment variables
  • Dynamic Variables passed to /chat/completions
    • Set AUTH_STRATEGY=DYNAMIC in the Environment
    • Pass required auth params api_key,api_base, api_version with the request params

Deploy on Google Cloud Run​

Click the button to deploy to Google Cloud Run

Deploy

On a successfull deploy your Cloud Run Shell will have this output

Testing your deployed server​

Assuming the required keys are set as Environment Variables

https://litellm-7yjrj3ha2q-uc.a.run.app is our example server, substitute it with your deployed cloud run app

curl https://litellm-7yjrj3ha2q-uc.a.run.app/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'

Set LLM API Keys​

Environment Variables​

More info here

  1. In the Google Cloud console, go to Cloud Run: Go to Cloud Run

  2. Click on the litellm service

  3. Click Edit and Deploy New Revision

  4. Enter your Environment Variables Example OPENAI_API_KEY, ANTHROPIC_API_KEY

Advanced​

Caching - Completion() and Embedding() Responses​

Enable caching by adding the following credentials to your server environment

REDIS_HOST = ""       # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
REDIS_PORT = "" # REDIS_PORT='18841'
REDIS_PASSWORD = "" # REDIS_PASSWORD='liteLlmIsAmazing'

Test Caching​

Send the same request twice:

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "write a poem about litellm!"}],
"temperature": 0.7
}'

Control caching per completion request​

Caching can be switched on/off per /chat/completions request

  • Caching on for completion - pass caching=True:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "write a poem about litellm!"}],
    "temperature": 0.7,
    "caching": true
    }'
  • Caching off for completion - pass caching=False:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "write a poem about litellm!"}],
    "temperature": 0.7,
    "caching": false
    }'

Tutorials (Chat-UI, NeMO-Guardrails, PromptTools, Phoenix ArizeAI, Langchain, ragas, LlamaIndex, etc.)​

Start server:

`docker run -e PORT=8000 -p 8000:8000 ghcr.io/berriai/litellm:latest`

The server is now live on http://0.0.0.0:8000

Here's the docker-compose.yml for running LiteLLM Server with Mckay Wrigley's Chat-UI:

version: '3'
services:
container1:
image: ghcr.io/berriai/litellm:latest
ports:
- '8000:8000'
environment:
- PORT=8000
- OPENAI_API_KEY=<your-openai-key>

container2:
image: ghcr.io/mckaywrigley/chatbot-ui:main
ports:
- '3000:3000'
environment:
- OPENAI_API_KEY=my-fake-key
- OPENAI_API_HOST=http://container1:8000

Run this via:

docker-compose up