Evaluating Prompts

Prompts are the core components of GenAI applications. However, iterating over prompts can be challenging because it is hard to know if the new prompt is better than the old one. MLflow provides a framework to systematically evaluate prompt templates and track performance over time.

Workflow

Create prompt template(s)

Define and register your prompt templates in MLflow Prompt Registry for version control and easy access.

Prepare evaluation dataset

Create test cases with inputs and expected outcomes to systematically evaluate prompt performance.

Define a wrapper function to generate responses

Wrap your prompt in a function that takes dataset inputs and generates responses using your model.

Define evaluation scorers

Set up built-in and custom scorers to measure quality, accuracy, and task-specific criteria.

Run evaluation

Execute the evaluation and review results in MLflow UI to analyze performance and iterate.

Example: Evaluating a Prompt Template

Prerequisites

First, install the required packages by running the following command:

pip install --upgrade mlflow>=3.3 openai

MLflow stores evaluation results in a tracking server. Connect your local environment to the tracking server by one of the following methods.

Local (pip)
Local (docker)
Remote MLflow Server
Databricks

For the fastest setup, you can install the mlflow Python package and run MLflow locally:

mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000

This will start the server at port 5000 on your local machine. Connect your notebook/IDE to the server by setting the tracking URI. You can also access to the MLflow UI at http://localhost:5000.

import mlflow

mlflow.set_tracking_uri("http://localhost:5000")

You can also brows the MLflow UI at http://localhost:5000.

MLflow provides a Docker Compose file to start a local MLflow server with a postgres database and a minio server.

git clone https://github.com/mlflow/mlflow.git
cd docker-compose
cp .env.dev.example .env
docker compose up -d

This will start the server at port 5000 on your local machine. Connect your notebook/IDE to the server by setting the tracking URI. You can also access to the MLflow UI at http://localhost:5000.

import mlflow

mlflow.set_tracking_uri("http://localhost:5000")

Refer to the instruction for more details, e.g., overriding the default environment variables.

If you have a remote MLflow tracking server, configure the connection:

import os
import mlflow

# Set your MLflow tracking URI
os.environ["MLFLOW_TRACKING_URI"] = "http://your-mlflow-server:5000"
# Or directly in code
mlflow.set_tracking_uri("http://your-mlflow-server:5000")

If you have a Databricks account, configure the connection:

import mlflow

mlflow.login()

This will prompt you for your configuration details (Databricks Host url and a PAT).

tip

If you are unsure about how to set up an MLflow tracking server, you can start with the cloud-based MLflow powered by Databricks: Sign up for free →

Step 1: Create prompt templates

Let's define a simple prompt template to evaluate. We use MLflow Prompt Registry to save the prompt and version control it, but it is optional for evaluation.

import mlflow

# Define prompt templates. MLflow supports both text and chat format prompt templates.
PROMPT_V1 = [
    {
        "role": "system",
        "content": "You are a helpful assistant. Answer the following question.",
    },
    {
        "role": "user",
        # Use double curly braces to indicate variables.
        "content": "Question: {{question}}",
    },
]

# Register the prompt template to the MLflow Prompt Registry for version control
# and convenience of loading the prompt template. This is optional.
mlflow.genai.register_prompt(
    name="qa_prompt",
    template=PROMPT_V1,
    commit_message="Initial prompt",
)

Step 2: Create evaluation dataset

The evaluation dataset is defined as a list of dictionaries, each with an inputs and expectations field.

eval_dataset = [
    {
        "inputs": {"question": "What causes rain?"},
        "expectations": {
            "key_concepts": ["evaporation", "condensation", "precipitation"]
        },
    },
    {
        "inputs": {"question": "Explain the difference between AI and ML"},
        "expectations": {
            "key_concepts": ["artificial intelligence", "machine learning", "subset"]
        },
    },
    {
        "inputs": {"question": "How do vaccines work?"},
        "expectations": {"key_concepts": ["immune", "antibodies", "protection"]},
    },
]

Step 3: Create prediction function

Now wrap the prompt template in a simple function that takes a question to generate responses using the prompt template. IMPORTANT: The function must take the keyword arguments used in the inputs field of the dataset. Therefore, we use question as the argument of the function here.

from openai import OpenAI

client = OpenAI()


@mlflow.trace
def predict_fn(question: str) -> str:
    prompt = mlflow.genai.load_prompt("prompts:/qa_prompt/latest")
    rendered_prompt = prompt.format(question=question)

    response = client.chat.completions.create(
        model="gpt-4.1-mini", messages=rendered_prompt
    )
    return response.choices[0].message.content

Step 4: Define task-specific scorers

Finally, let's define a few scorers that decide the evaluation criteria. Here we use two types of scorers:

Built-in LLM scorers for evaluating the qualitative aspects of the response.
Custom heuristic scorer for evaluating the coverage of the key concepts.

from mlflow.entities import Feedback
from mlflow.genai import scorer
from mlflow.genai.scorers import Guidelines

# Define LLM scorers
is_concise = Guidelines(
    name="is_concise", guidelines="The response should be concise and to the point."
)
is_professional = Guidelines(
    name="is_professional", guidelines="The response should be in professional tone."
)


# Evaluate the coverage of the key concepts using custom scorer
@scorer
def concept_coverage(outputs: str, expectations: dict) -> Feedback:
    concepts = set(expectations.get("key_concepts", []))
    included = {c for c in concepts if c.lower() in outputs.lower()}
    return Feedback(
        value=len(included) / len(concepts),
        rationale=(
            f"Included {len(included)} out of {len(concepts)} concepts. Missing: {concepts - included}"
        ),
    )

tip

LLM scorers use OpenAI's GPT 4.1-mini by default. You can use different models by passing the model parameter to the scorer constructor.

Step 5: Run evaluation

Now we are ready to run the evaluation!

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=predict_fn,
    scorers=[is_concise, is_professional, concept_coverage],
)

Once the evaluation is done, open the MLflow UI in your browser and navigate to the experiment page. You should see MLflow creates a new Run and logs the evaluation results.

By clicking on the each row in the result, you can open the trace and see the detailed score and rationale.

Iterating on Prompts

The prompt evaluation is an iterative process. You can register a new prompt version, run the same eval again, and compare the evaluation results. The prompt registry keep track of the version changes and lineage between the prompt versions and evaluation results.

# Define V2 prompt template
PROMPT_V2 = [
    {
        "role": "system",
        "content": "You are a helpful assistant. Answer the following question in three sentences.",
    },
    {"role": "user", "content": "Question: {{question}}"},
]

mlflow.genai.register_prompt(name="qa_prompt", template=PROMPT_V2)

# Run the same evaluation again.
# MLflow automatically loads the latest prompt template via the `@latest` alias.
mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=predict_fn,
    scorers=[is_concise, is_professional, concept_coverage],
)

Compare Evaluation Results

Once you have multiple evaluation runs, you can compare the result side-by-side to analyze the performance changes. To see the comparison view, open the evaluation result page for one of the runs, and pick another run to compare from the dropdown on the top.

To see the comparison view, open the evaluation result page for one of the runs, and pick another run to compare from the dropdown on the top.

MLflow will load the evaluation results for the two runs and display the comparison view. In this example, you can see the overall concise scorer is improved 33%, but the concept coverage is dropped 11%. The little arrow ↗️/↘️ in each row indicates where the change is coming from.

Evaluating Prompts

Workflow

Create prompt template(s)

Prepare evaluation dataset

Define a wrapper function to generate responses

Define evaluation scorers

Run evaluation

Example: Evaluating a Prompt Template

Prerequisites

Step 1: Create prompt templates

Step 2: Create evaluation dataset

Step 3: Create prediction function

Step 4: Define task-specific scorers

Step 5: Run evaluation

Iterating on Prompts

Compare Evaluation Results

Next steps

Customize Scorers

Evaluate Agents

Optimize Prompts

Workflow​

Create prompt template(s)

Prepare evaluation dataset

Define a wrapper function to generate responses

Define evaluation scorers

Run evaluation

Example: Evaluating a Prompt Template​

Prerequisites​

Step 1: Create prompt templates​

Step 2: Create evaluation dataset​

Step 3: Create prediction function​

Step 4: Define task-specific scorers​

Step 5: Run evaluation​

Iterating on Prompts​

Compare Evaluation Results​

Next steps​

Customize Scorers

Evaluate Agents

Optimize Prompts

Workflow

Example: Evaluating a Prompt Template

Prerequisites

Step 1: Create prompt templates

Step 2: Create evaluation dataset

Step 3: Create prediction function

Step 4: Define task-specific scorers

Step 5: Run evaluation

Iterating on Prompts

Compare Evaluation Results

Next steps