Skip to main content

Query Endpoints

Once you've created an endpoint, you can call it through several different API styles depending on your needs.

Viewing Usage Examples

To see code examples for your endpoint, navigate to the Endpoints list and click either the Use button or the endpoint name itself. This opens a modal with comprehensive usage examples tailored to your specific endpoint.

Usage Modal

The usage modal organizes examples into two categories: unified APIs that work across any provider, and passthrough APIs that expose provider-specific features.

Unified APIs

Unified APIs provide a consistent interface regardless of the underlying model provider. These APIs make it easy to switch between different models or providers without changing your application code.

MLflow Invocations API

The MLflow Invocations API is the native interface for calling gateway endpoints. This API seamlessly handles model switching and advanced routing features like traffic splitting and fallbacks:

bash
curl -X POST http://localhost:5000/gateway/my-endpoint/mlflow/invocations \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'

API Specification

The MLflow Invocations API supports both OpenAI-style chat completions and embeddings endpoints.

Endpoint URL Pattern:

text
POST /gateway/{endpoint_name}/mlflow/invocations

Chat Completions Request Body:

The request body follows the OpenAI chat completions format with these supported parameters. See OpenAI Chat Completions API Reference for complete documentation.

ParameterTypeRequiredDescription
messagesarrayYesArray of message objects with role and content fields
temperaturenumberNoSampling temperature between 0 and 2. Higher values make output more random.
max_tokensintegerNoMaximum number of tokens to generate.
top_pnumberNoNucleus sampling parameter between 0 and 1. Alternative to temperature.
nintegerNoNumber of completions to generate. Default is 1.
streambooleanNoWhether to stream responses.
stream_optionsobjectNoOptions for streaming responses.
stoparrayNoList of sequences where the API will stop generating tokens.
presence_penaltynumberNoPenalizes new tokens based on presence in text so far. Range: -2.0 to 2.0.
frequency_penaltynumberNoPenalizes new tokens based on frequency in text so far. Range: -2.0 to 2.0.
toolsarrayNoList of tools the model can call. Each tool includes type, function with name, description, and parameters.
response_formatobjectNoFormat for the model output. Can specify "text", "json_object", or "json_schema" with schema definition.

Response Format:

json
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"model": "gpt-5",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}

Streaming Responses:

When stream: true is set, the response is sent as Server-Sent Events (SSE):

text
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-5","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-5","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: [DONE]

Embeddings Request Body:

For embeddings endpoints, the request body follows the OpenAI embeddings format. See OpenAI Embeddings API Reference for complete documentation.

ParameterTypeRequiredDescription
inputstring or arrayYesInput text(s) to embed. Can be a single string or array of strings.
encoding_formatstringNoFormat to return embeddings. Options: "float" (default) or "base64".

Embeddings Response Format:

json
{
"object": "list",
"data": [{
"object": "embedding",
"embedding": [0.0023064255, -0.009327292, ...],
"index": 0
}],
"model": "text-embedding-ada-002",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}

OpenAI-Compatible Chat Completions API

For teams already using the OpenAI chat completion style APIs, the gateway provides an OpenAI-compatible interface. Simply point your OpenAI client to the gateway's base URL and use your endpoint name as the model parameter. This lets you leverage existing OpenAI-based code while gaining the gateway's routing capabilities.

See OpenAI Chat Completions API Reference for complete documentation.

bash
curl -X POST http://localhost:5000/gateway/mlflow/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-endpoint",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'

Passthrough APIs

The Passthrough API relays requests to the provider's LLM endpoint using its native formats, allowing you to use their native client SDKs with the MLflow Gateway. While unified APIs work for most use cases, passthrough APIs give you full access to provider-specific features that may not be available through the unified interface.

For detailed information on passthrough APIs for each provider, see Model Providers.

OpenAI Passthrough

The OpenAI passthrough API exposes the full OpenAI API including Chat Completions, Embeddings, and Responses endpoints. See OpenAI API Reference for complete documentation.

bash
curl -X POST http://localhost:5000/gateway/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-endpoint",
"messages": [{"role": "user", "content": "Hello!"}]
}'

Anthropic Passthrough

Access Anthropic's Messages API directly through the gateway. See Anthropic API Reference for complete documentation.

bash
curl -X POST http://localhost:5000/gateway/anthropic/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "my-endpoint",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Hello!"}]
}'

Google Gemini Passthrough

The Gemini passthrough API follows Google's API structure. See Google Gemini API Reference for complete documentation.

bash
curl -X POST http://localhost:5000/gateway/gemini/v1beta/models/my-endpoint:generateContent \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"parts": [{"text": "Hello!"}]
}]
}'