mlflow.genai

class mlflow.genai.Agent(agent: _Agent)[source]

Bases: object

The agent configuration, used for generating responses in the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

property agent_name: str: The name of the agent.

property model_serving_endpoint: str: The model serving endpoint used by the agent.

class mlflow.genai.LabelingSession(*, name: str, assigned_users: list[str], agent: str | None, label_schemas: list[str], labeling_session_id: str, mlflow_run_id: str, review_app_id: str, experiment_id: str, url: str, enable_multi_turn_chat: bool, custom_inputs: dict[str, typing.Any] | None)[source]

Bases: object

A session for labeling items in the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

add_dataset(dataset_name: str, record_ids: Optional[list[str]] = None) → mlflow.genai.labeling.labeling.LabelingSession[source]

Add a dataset to the labeling session.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

dataset_name – The name of the dataset.
record_ids – Optional. The individual record ids to be added to the session. If not provided, all records in the dataset will be added.

Returns

The updated labeling session.

Return type

LabelingSession

add_traces(traces: Union[Iterable[Trace], Iterable[str], pd.DataFrame]) → LabelingSession[source]

Add traces to the labeling session.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters: traces – Can be either: a) a pandas DataFrame with a ‘trace’ column. The ‘trace’ column should contain either mlflow.entities.Trace objects or their json string representations. b) an iterable of mlflow.entities.Trace objects. c) an iterable of json string representations of mlflow.entities.Trace objects.
Returns: The updated labeling session.
Return type: LabelingSession

property agent: str | None: The agent used to generate responses for the items in the session.

property assigned_users: list[str]: The users assigned to label items in the session.

property custom_inputs: dict[str, typing.Any] | None: Custom inputs used in the session.

property enable_multi_turn_chat: bool: Whether multi-turn chat is enabled for the session.

property experiment_id: str: The experiment ID associated with the session.

property label_schemas: list[str]: The label schemas used in the session.

property labeling_session_id: str: The unique identifier of the labeling session.

property mlflow_run_id: str: The MLflow run ID associated with the session.

property name: str: The name of the labeling session.

property review_app_id: str: The review app ID associated with the session.

set_assigned_users(assigned_users: list[str]) → mlflow.genai.labeling.labeling.LabelingSession[source]

Set the assigned users for the labeling session.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters: assigned_users – The list of users to assign to the session.
Returns: The updated labeling session.
Return type: LabelingSession

sync(to_dataset: str) → None[source]

Sync the traces and expectations from the labeling session to a dataset.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters: to_dataset – The name of the dataset to sync traces and expectations to.

property url: str: The URL of the labeling session in the review app.

class mlflow.genai.ReviewApp(app: _ReviewApp)[source]

Bases: object

A review app is used to collect feedback from stakeholders for a given experiment.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

add_agent(*, agent_name: str, model_serving_endpoint: str, overwrite: bool = False) → mlflow.genai.labeling.labeling.ReviewApp[source]

Add an agent to the review app to be used to generate responses.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

agent_name – The name of the agent.
model_serving_endpoint – The model serving endpoint to be used by the agent.
overwrite – Whether to overwrite an existing agent with the same name.

Returns

The updated review app.

Return type

ReviewApp

property agents: list[mlflow.genai.labeling.labeling.Agent]: The agents to be used to generate responses.

property experiment_id: str: The ID of the experiment.

property label_schemas: list['_LabelSchema']: The label schemas to be used in the review app.

remove_agent(agent_name: str) → mlflow.genai.labeling.labeling.ReviewApp[source]

Remove an agent from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters: agent_name – The name of the agent to remove.
Returns: The updated review app.
Return type: ReviewApp

property review_app_id: str: The ID of the review app.

property url: str: The URL of the review app for stakeholders to provide feedback.

class mlflow.genai.Scorer(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: pydantic.main.BaseModel

aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None

description: str | None

property filter_string: str | None: Get the filter string for this scorer.

property is_session_level_scorer: bool

Get whether this scorer is a session-level scorer.

Defaults to False. Child classes can override this property to return True or compute the value dynamically based on their configuration.

property kind: mlflow.genai.scorers.base.ScorerKind

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_dump(**kwargs) → dict[str, typing.Any][source]: Override model_dump to include source code.

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

classmethod model_validate(obj: Any) → Scorer[source]: Override model_validate to reconstruct scorer from source code.

name: str

register(*, name: Optional[str] = None, experiment_id: Optional[str] = None) → Scorer[source]

Register this scorer with the MLflow server.

This method registers the scorer for use with automatic trace evaluation in the specified experiment. Once registered, the scorer can be started to begin evaluating traces automatically.

Parameters

name – Optional registered name for the scorer. If not provided, the current name property value will be used as a registered name.
experiment_id – The ID of the MLflow experiment to register the scorer for. If None, uses the currently active experiment.

Returns

A new Scorer instance with server registration information.

Example

import mlflow
from mlflow.genai.scorers import RelevanceToQuery

# Register a built-in scorer
mlflow.set_experiment("my_genai_app")
registered_scorer = RelevanceToQuery().register(name="relevance_scorer")
print(f"Registered scorer: {registered_scorer.name}")

# Register a custom scorer
from mlflow.genai.scorers import scorer


@scorer
def custom_length_check(outputs) -> bool:
    return len(outputs) > 100


registered_custom = custom_length_check.register(
    name="output_length_checker", experiment_id="12345"
)

run(*, inputs=None, outputs=None, expectations=None, trace=None, session=None)[source]

property sample_rate: float | None: Get the sample rate for this scorer. Available when registered for monitoring.

start(*, name: Optional[str] = None, experiment_id: Optional[str] = None, sampling_config: mlflow.genai.scorers.base.ScorerSamplingConfig) → Scorer[source]

Start registered scoring with the specified sampling configuration.

This method activates automatic trace evaluation for the scorer. The scorer will evaluate traces based on the provided sampling configuration, including the sample rate and optional filter criteria.

Parameters

name – Optional scorer name. If not provided, uses the scorer’s registered name or default name.
experiment_id – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment.
sampling_config – Configuration object containing: - sample_rate: Fraction of traces to evaluate (0.0 to 1.0). Required. - filter_string: Optional MLflow search_traces compatible filter string.

Returns

A new Scorer instance with updated sampling configuration.

Example

import mlflow
from mlflow.genai.scorers import RelevanceToQuery, ScorerSamplingConfig

# Start scorer with 50% sampling rate
mlflow.set_experiment("my_genai_app")
scorer = RelevanceToQuery().register()
active_scorer = scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
print(f"Scorer is evaluating {active_scorer.sample_rate * 100}% of traces")

# Start scorer with filter to only evaluate specific traces
filtered_scorer = scorer.start(
    sampling_config=ScorerSamplingConfig(
        sample_rate=1.0, filter_string="YOUR_FILTER_STRING"
    )
)

property status: mlflow.genai.scorers.base.ScorerStatus: Get the status of this scorer, using only the local state.

stop(*, name: Optional[str] = None, experiment_id: Optional[str] = None) → Scorer[source]

Stop registered scoring by setting sample rate to 0.

This method deactivates automatic trace evaluation for the scorer while keeping the scorer registered. The scorer can be restarted later using the start() method.

Parameters

name – Optional scorer name. If not provided, uses the scorer’s registered name or default name.
experiment_id – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment.

Returns

A new Scorer instance with sample rate set to 0.

Example

import mlflow
from mlflow.genai.scorers import RelevanceToQuery, ScorerSamplingConfig

# Start and then stop a scorer
mlflow.set_experiment("my_genai_app")
scorer = RelevanceToQuery().register()
active_scorer = scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
print(f"Scorer is active: {active_scorer.sample_rate > 0}")

# Stop the scorer
stopped_scorer = active_scorer.stop()
print(f"Scorer is active: {stopped_scorer.sample_rate > 0}")

# The scorer remains registered and can be restarted later
restarted_scorer = stopped_scorer.start(
    sampling_config=ScorerSamplingConfig(sample_rate=0.3)
)

update(*, name: Optional[str] = None, experiment_id: Optional[str] = None, sampling_config: mlflow.genai.scorers.base.ScorerSamplingConfig) → Scorer[source]

Update the sampling configuration for this scorer.

This method modifies the sampling rate and/or filter criteria for an already registered scorer. It can be used to dynamically adjust how many traces are evaluated or change the filtering criteria without stopping and restarting the scorer.

Parameters

name – Optional scorer name. If not provided, uses the scorer’s registered name or default name.
experiment_id – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment.
sampling_config – Configuration object containing: - sample_rate: New fraction of traces to evaluate (0.0 to 1.0). Optional. - filter_string: New MLflow search_traces compatible filter string. Optional.

Returns

A new Scorer instance with updated configuration.

Example

import mlflow
from mlflow.genai.scorers import RelevanceToQuery, ScorerSamplingConfig

# Start scorer with initial configuration
mlflow.set_experiment("my_genai_app")
scorer = RelevanceToQuery().register()
active_scorer = scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.1))

# Update to increase sampling rate during high traffic
updated_scorer = active_scorer.update(
    sampling_config=ScorerSamplingConfig(sample_rate=0.5)
)
print(f"Updated sample rate: {updated_scorer.sample_rate}")

# Update to add filtering criteria
filtered_scorer = updated_scorer.update(
    sampling_config=ScorerSamplingConfig(filter_string="YOUR_FILTER_STRING")
)
print(f"Added filter: {filtered_scorer.filter_string}")

class mlflow.genai.ScorerScheduleConfig(scorer: Scorer, scheduled_scorer_name: str, sample_rate: float, filter_string: Optional[str] = None)[source]

Bases: object

A scheduled scorer configuration for automated monitoring of generative AI applications.

Scheduled scorers are used to automatically evaluate traces logged to MLflow experiments by production applications. They are part of Databricks Lakehouse Monitoring for GenAI, which helps track quality metrics like groundedness, safety, and guideline adherence alongside operational metrics like volume, latency, and cost.

When configured, scheduled scorers run automatically in the background to evaluate a sample of traces based on the specified sampling rate and filter criteria. The Assessments are displayed in the Traces tab of the MLflow experiment and can be used to identify quality issues in production.

Parameters

scorer – The scorer function to run on sampled traces. Must be either a built-in scorer (e.g., Safety, Correctness) or a function decorated with @scorer. Subclasses of Scorer are not supported.
scheduled_scorer_name – The name for this scheduled scorer configuration within the experiment. This name must be unique among all scheduled scorers in the same experiment. We recommend using the scorer’s name (e.g., scorer.name) for consistency.
sample_rate – The fraction of traces to evaluate, between 0.0 and 1.0. For example, 0.1 means 10% of traces will be randomly selected for evaluation.
filter_string – An optional MLflow search_traces compatible filter string to apply before sampling traces. Only traces matching this filter will be considered for evaluation. Uses the same syntax as mlflow.search_traces().

Example

from mlflow.genai.scorers import Safety, scorer
from mlflow.genai.scheduled_scorers import ScorerScheduleConfig

# Using a built-in scorer
safety_config = ScorerScheduleConfig(
    scorer=Safety(),
    scheduled_scorer_name="production_safety",
    sample_rate=0.2,  # Evaluate 20% of traces
    filter_string="trace.status = 'OK'",
)


# Using a custom scorer
@scorer
def response_length(outputs):
    return len(str(outputs)) > 100


length_config = ScorerScheduleConfig(
    scorer=response_length,
    scheduled_scorer_name="adequate_length",
    sample_rate=0.1,  # Evaluate 10% of traces
    filter_string="trace.status = 'OK'",
)

Note

Scheduled scorers are executed automatically by Databricks and do not need to be manually triggered. The Assessments appear in the Traces tab of the MLflow experiment. Only traces logged directly to the experiment are monitored; traces logged to individual runs within the experiment are not evaluated.

Warning

This API is in Beta and may change or be removed in a future release without warning.

filter_string: str | None = None

sample_rate: float

scheduled_scorer_name: str

scorer: Scorer

mlflow.genai.create_dataset(name: str | None = None, experiment_id: str | list[str] | None = None, tags: dict[str, typing.Any] | None = None) → mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Note

Parameter uc_table_name is deprecated. Use name instead.

Create a dataset with the given name and associate it with the given experiment.

Parameters

name – The name of the dataset. In Databricks, this is the UC table name.
experiment_id – The ID of the experiment(s) to associate the dataset with. If not provided, the current experiment is inferred from the environment.
tags – Dictionary of tags to apply to the dataset. Not supported in Databricks.

Returns

An EvaluationDataset object representing the created dataset.

Examples

from mlflow.genai.datasets import create_dataset

# Create a dataset with a single experiment
dataset = create_dataset(
    name="customer_support_qa_v1",
    experiment_id="0",  # Default experiment
    tags={
        "version": "1.0",
        "purpose": "regression_testing",
        "model": "gpt-4",
        "team": "ml-platform",
    },
)
print(f"Created dataset: {dataset.dataset_id}")
# Output: Created dataset: d-1a2b3c4d5e6f7890abcdef1234567890

# Create a dataset linked to multiple experiments
multi_exp_dataset = create_dataset(
    name="cross_team_eval_dataset",
    experiment_id=["1", "2", "5"],  # Multiple experiment IDs
    tags={
        "coverage": "comprehensive",
        "status": "development",
    },
)

# Create a dataset without tags (minimal example)
simple_dataset = create_dataset(
    name="quick_test_dataset",
    experiment_id="3",  # Specific experiment
)

mlflow.genai.create_labeling_session(name: str, *, assigned_users: Optional[list[str]] = None, agent: Optional[str] = None, label_schemas: Optional[list[str]] = None, enable_multi_turn_chat: bool = False, custom_inputs: Optional[dict[str, typing.Any]] = None) → mlflow.genai.labeling.labeling.LabelingSession[source]

Create a new labeling session in the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

name – The name of the labeling session.
assigned_users – The users that will be assigned to label items in the session.
agent – The agent to be used to generate responses for the items in the session.
label_schemas – The label schemas to be used in the session.
enable_multi_turn_chat – Whether to enable multi-turn chat labeling for the session.
custom_inputs – Optional. Custom inputs to be used in the session.

Returns

The created labeling session.

Return type

LabelingSession

mlflow.genai.delete_dataset(name: str | None = None, dataset_id: str | None = None) → None[source]

Note

Parameter uc_table_name is deprecated. Use name instead.

Delete a dataset.

Parameters

name – The name of the dataset (Databricks only). In Databricks, this is the UC table name.
dataset_id – The ID of the dataset.

Note

In Databricks environments: Use ‘name’ to specify the dataset.
Outside of Databricks: Use ‘dataset_id’ to specify the dataset

Examples

from mlflow.genai.datasets import delete_dataset, search_datasets

# Delete a specific dataset by ID (non-Databricks)
delete_dataset(dataset_id="d-4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e")

# Clean up old test datasets
test_datasets = search_datasets(
    filter_string="name LIKE 'test_%' AND tags.environment = 'development'",
    order_by=["created_time ASC"],
)

# Delete datasets older than the most recent 5
if len(test_datasets) > 5:
    for dataset in test_datasets[:-5]:  # Keep the 5 most recent
        print(f"Deleting old test dataset: {dataset.name}")
        delete_dataset(dataset_id=dataset.dataset_id)

# Delete datasets with specific criteria
deprecated_datasets = search_datasets(filter_string="tags.status = 'deprecated'")
for dataset in deprecated_datasets:
    delete_dataset(dataset_id=dataset.dataset_id)
    print(f"Deleted deprecated dataset: {dataset.name}")

Warning

Deleting a dataset is permanent and cannot be undone. All associated records, tags, and metadata will be permanently removed.

mlflow.genai.delete_dataset_tag(dataset_id: str, key: str) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Delete a tag from a dataset.

Parameters

dataset_id – The ID of the dataset.
key – The tag key to delete.

Examples

from mlflow.genai.datasets import delete_dataset_tag, get_dataset

# Get your dataset
dataset = get_dataset(dataset_id="d-9e8f7c6b5a4d3e2f1a0b9c8d7e6f5a4b")

# Remove a single tag
delete_dataset_tag(dataset_id=dataset.dataset_id, key="deprecated")

# Remove outdated tags during cleanup
outdated_tags = ["old_version", "temp_flag", "development_only"]
for tag_key in outdated_tags:
    delete_dataset_tag(dataset_id=dataset.dataset_id, key=tag_key)

# Check remaining tags
updated_dataset = get_dataset(dataset_id=dataset.dataset_id)
print(f"Remaining tags: {updated_dataset.tags}")

Note

This API is not available in Databricks environments yet. Tags in Databricks are managed through Unity Catalog.

mlflow.genai.delete_labeling_session(labeling_session: mlflow.genai.labeling.labeling.LabelingSession)[source]

Delete a labeling session from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters: labeling_session – The labeling session to delete.

mlflow.genai.delete_prompt_alias(name: str, alias: str) → None[source]

Delete an alias for a Prompt in the MLflow Prompt Registry.

Parameters

name – The name of the prompt.
alias – The alias to delete for the prompt.

mlflow.genai.delete_prompt_model_config(name: str, version: str | int) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Delete the model configuration from a specific prompt version.

Args:: name: The name of the prompt. version: The version of the prompt.

Example:

import mlflow

# Remove model config from a prompt version
mlflow.genai.delete_prompt_model_config(name="my-prompt", version=1)

# Verify the config was removed
prompt = mlflow.genai.load_prompt("my-prompt", version=1)
assert prompt.model_config is None

mlflow.genai.delete_prompt_tag(name: str, key: str) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Delete a tag from a prompt in the MLflow Prompt Registry.

Args:
name: The name of the prompt. key: The key of the tag

mlflow.genai.delete_prompt_version_tag(name: str, version: str | int, key: str) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Delete a tag from a prompt version in the MLflow Prompt Registry.

Args:
name: The name of the prompt. version: The version of the prompt. key: The key of the tag

mlflow.genai.disable_git_model_versioning() → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Disable Git-based model versioning and clear the active model context.

This function stops automatic Git-based version tracking and clears any active LoggedModel context. After calling this, traces will no longer be automatically linked to Git-based versions.

This is automatically called when exiting a context manager created with enable_git_model_versioning().

Example:

import mlflow.genai

# Enable versioning
context = mlflow.genai.enable_git_model_versioning()
# ... do work with versioning enabled ...

# Disable versioning
mlflow.genai.disable_git_model_versioning()
# Traces are no longer linked to Git versions

mlflow.genai.enable_git_model_versioning(remote_name: str = 'origin') → mlflow.genai.git_versioning.GitContext[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Enable automatic Git-based model versioning for MLflow traces.

This function enables automatic version tracking based on your Git repository state. When enabled, MLflow will: - Detect the current Git branch, commit hash, and dirty state - Create or reuse a LoggedModel matching this exact Git state - Link all subsequent traces to this LoggedModel version - Capture uncommitted changes as diffs when the repository is dirty

Parameters

remote_name – The name of the git remote to use for repository URL detection. Defaults to “origin”.

Returns

info: GitInfo object with branch, commit, dirty state, and diff information
active_model: The active LoggedModel linked to current Git state

Return type

A GitContext instance containing

Example:

import mlflow.genai

# Enable Git-based versioning
context = mlflow.genai.enable_git_model_versioning()
print(f"Branch: {context.info.branch}, Commit: {context.info.commit[:8]}")
# Output: Branch: main, Commit: abc12345


# All traces are now automatically linked to this Git version
@mlflow.trace
def my_app():
    return "result"


# Can also use as a context manager
with mlflow.genai.enable_git_model_versioning() as context:
    # Traces within this block are linked to the Git version
    result = my_app()

Note

If Git is not available or the current directory is not a Git repository, a warning is issued and versioning is disabled (context.info will be None).

mlflow.genai.evaluate(data: EvaluationDatasetTypes, scorers: list[Scorer], predict_fn: Optional[Callable[[...], Any]] = None, model_id: str | None = None) → mlflow.models.evaluation.base.EvaluationResult[source]

Evaluate the performance of a generative AI model/application using specified data and scorers.

This function allows you to evaluate a model’s performance on a given dataset using various scoring criteria. It supports both built-in scorers provided by MLflow and custom scorers. The evaluation results include metrics and detailed per-row assessments.

There are three different ways to use this function:

1. Use Traces to evaluate the model/application.

The data parameter takes a DataFrame with trace column, which contains a single trace object corresponding to the prediction for the row. This dataframe is easily obtained from the existing traces stored in MLflow, by using the mlflow.search_traces() function.

import mlflow
from mlflow.genai.scorers import Correctness, Safety
import pandas as pd

# model_id is a string starting with "m-", e.g. "m-074689226d3b40bfbbdf4c3ff35832cd"
trace_df = mlflow.search_traces(model_id="<my-model-id>")

mlflow.genai.evaluate(
    data=trace_df,
    scorers=[Correctness(), Safety()],
)

Built-in scorers will understand the model inputs, outputs, and other intermediate information e.g. retrieved context, from the trace object. You can also access to the trace object from the custom scorer function by using the trace parameter.

from mlflow.genai.scorers import scorer


@scorer
def faster_than_one_second(inputs, outputs, trace):
    return trace.info.execution_duration < 1000

2. Use DataFrame or dictionary with “inputs”, “outputs”, “expectations” columns.

Alternatively, you can pass inputs, outputs, and expectations (ground truth) as a column in the dataframe (or equivalent list of dictionaries).

import mlflow
from mlflow.genai.scorers import Correctness
import pandas as pd

data = pd.DataFrame(
    [
        {
            "inputs": {"question": "What is MLflow?"},
            "outputs": "MLflow is an ML platform",
            "expectations": "MLflow is an ML platform",
        },
        {
            "inputs": {"question": "What is Spark?"},
            "outputs": "I don't know",
            "expectations": "Spark is a data engine",
        },
    ]
)

mlflow.genai.evaluate(
    data=data,
    scorers=[Correctness()],
)

3. Pass `predict_fn` and input samples (and optionally expectations).

If you want to generate the outputs and traces on-the-fly from your input samples, you can pass a callable to the predict_fn parameter. In this case, MLflow will pass the inputs to the predict_fn as keyword arguments. Therefore, the “inputs” column must be a dictionary with the parameter names as keys.

import mlflow
from mlflow.genai.scorers import Correctness, Safety
import openai

# Create a dataframe with input samples
data = pd.DataFrame(
    [
        {"inputs": {"question": "What is MLflow?"}},
        {"inputs": {"question": "What is Spark?"}},
    ]
)


# Define a predict function to evaluate. The "inputs" column will be
# passed to the prediction function as keyword arguments.
def predict_fn(question: str) -> str:
    response = openai.OpenAI().chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content


mlflow.genai.evaluate(
    data=data,
    predict_fn=predict_fn,
    scorers=[Correctness(), Safety()],
)

Parameters

data –
Dataset for the evaluation. Must be one of the following formats:
- An EvaluationDataset entity
- Pandas DataFrame
- Spark DataFrame
- List of dictionaries
- List of Trace objects
The dataset must include either of the following columns:
1. trace column that contains a single trace object corresponding
  to the prediction for the row.
  
  If this column is present, MLflow extracts inputs, outputs, assessments, and other intermediate information e.g. retrieved context, from the trace object and uses them for scoring. When this column is present, the predict_fn parameter must not be provided.
2. inputs, outputs, expectations columns.
  Alternatively, you can pass inputs, outputs, and expectations(ground truth) as a column in the dataframe (or equivalent list of dictionaries).
  - inputs (required): Column containing inputs for evaluation. The value must be a dictionary. When predict_fn is provided, MLflow will pass the inputs to the predict_fn as keyword arguments. For example,
    
    predict_fn: def predict_fn(question: str, context: str) -> str
    
    inputs: {“question”: “What is MLflow?”, “context”: “MLflow is an ML platform”}
    
    predict_fn will receive “What is MLflow?” as the first argument (question) and “MLflow is an ML platform” as the second argument (context)
  - outputs (optional): Column containing model or app outputs. If this column is present, predict_fn must not be provided.
  - expectations (optional): Column containing a dictionary of ground truths.
For list of dictionaries, each dict should follow the above schema.
Optional columns:
- tags (optional): Column containing a dictionary of tags. The tags will be logged
  to the respective traces.
scorers – A list of Scorer objects that produces evaluation scores from inputs, outputs, and other additional contexts. MLflow provides pre-defined scorers, but you can also define custom ones.
predict_fn –
The target function to be evaluated. The specified function will be executed for each row in the input dataset, and outputs will be used for scoring.

The function must emit a single trace per call. If it doesn’t, decorate the function with @mlflow.trace decorator to ensure a trace to be emitted.

Both synchronous and asynchronous (async def) functions are supported. Async functions are automatically detected and wrapped to run synchronously with a configurable timeout (default: 300 seconds). Set the timeout using the MLFLOW_GENAI_EVAL_ASYNC_TIMEOUT environment variable.
model_id – Optional model identifier (e.g. “m-074689226d3b40bfbbdf4c3ff35832cd”) to associate with the evaluation results. Can be also set globally via the mlflow.set_active_model() function.

Returns

An mlflow.models.EvaluationResult~ object.

Note

Certain advanced features of this function are only supported on Databricks. The tracking URI must be set to Databricks to use these features.

Warning

This function is not thread-safe. Please do not use it in multi-threaded environments.

mlflow.genai.get_dataset(name: str | None = None, dataset_id: str | None = None) → mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Note

Parameter uc_table_name is deprecated. Use name instead.

Get the dataset with the given name or ID.

Parameters

name – The name of the dataset (Databricks only). In Databricks, this is the UC table name.
dataset_id – The ID of the dataset.

Returns

An EvaluationDataset object representing the retrieved dataset.

Note

In Databricks environments: Use ‘name’ to specify the dataset.
Outside of Databricks: Use ‘dataset_id’ to specify the dataset

Examples

from mlflow.genai.datasets import get_dataset

# Get a dataset by ID (non-Databricks)
dataset = get_dataset(dataset_id="d-7f2e3a9b8c1d4e5f6a7b8c9d0e1f2a3b")

# Access dataset properties
print(f"Dataset name: {dataset.name}")
print(f"Tags: {dataset.tags}")
print(f"Created by: {dataset.created_by}")

# Work with the dataset
df = dataset.to_df()  # Convert to pandas DataFrame
schema = dataset.schema  # Get auto-computed schema
profile = dataset.profile  # Get dataset statistics

# Add new records to the dataset
new_test_cases = [
    {
        "inputs": {"question": "What is MLflow?"},
        "expectations": {"accuracy": 0.95, "contains_tracking": True},
    }
]
dataset.merge_records(new_test_cases)

mlflow.genai.get_labeling_session(run_id: str) → mlflow.genai.labeling.labeling.LabelingSession[source]

Get a labeling session from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters: run_id – The mlflow run ID of the labeling session to get.
Returns: The labeling session.
Return type: LabelingSession

mlflow.genai.get_labeling_sessions() → list[mlflow.genai.labeling.labeling.LabelingSession][source]

Get all labeling sessions from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Returns: The list of labeling sessions.
Return type: list[LabelingSession]

mlflow.genai.get_prompt_tags(name: str) → Prompt[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Get a prompt’s metadata from the MLflow Prompt Registry.

Args:
name: The name of the prompt.

mlflow.genai.get_review_app(experiment_id: Optional[str] = None) → mlflow.genai.labeling.labeling.ReviewApp[source]

Gets or creates (if it doesn’t exist) the review app for the given experiment ID.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters: experiment_id – Optional. The experiment ID for which to get the review app. If not provided, the experiment ID is inferred from the current active environment.
Returns: The review app.
Return type: ReviewApp

mlflow.genai.load_prompt(name_or_uri: str, version: str | int | None = None, allow_missing: bool = False, link_to_model: bool = True, model_id: str | None = None, cache_ttl_seconds: float | None = None) → PromptVersion[source]

Load a Prompt from the MLflow Prompt Registry.

The prompt can be specified by name and version, or by URI.

Parameters

name_or_uri – The name of the prompt, or the URI in the format “prompts:/name/version”.
version – The version of the prompt (required when using name, not allowed when using URI).
allow_missing – If True, return None instead of raising Exception if the specified prompt is not found.
link_to_model – If True, link the prompt to the model.
model_id – The ID of the model to link the prompt to. Only used if link_to_model is True.
cache_ttl_seconds – Time-to-live in seconds for the cached prompt. If not specified, uses the value from MLFLOW_ALIAS_PROMPT_CACHE_TTL_SECONDS environment variable for alias-based prompts (default 60), and the value from MLFLOW_VERSION_PROMPT_CACHE_TTL_SECONDS environment variable for version-based prompts (default None, no TTL). Set to 0 to bypass the cache and always fetch from the server.

Example:

import mlflow

# Load the latest version of the prompt
prompt = mlflow.genai.load_prompt("my_prompt")

# Load a specific version of the prompt
prompt = mlflow.genai.load_prompt("my_prompt", version=1)

# Load a specific version of the prompt by URI
prompt = mlflow.genai.load_prompt("prompts:/my_prompt/1")

# Load a prompt version with an alias "production"
prompt = mlflow.genai.load_prompt("prompts:/my_prompt@production")

# Load the latest version of the prompt by URI
prompt = mlflow.genai.load_prompt("prompts:/my_prompt@latest")

# Load with custom cache TTL (5 minutes)
prompt = mlflow.genai.load_prompt("my_prompt", version=1, cache_ttl_seconds=300)

# Bypass cache entirely
prompt = mlflow.genai.load_prompt("my_prompt", version=1, cache_ttl_seconds=0)

mlflow.genai.make_judge(name: str, instructions: str, model: str | None = None, description: str | None = None, feedback_value_type: Any = None, inference_params: dict[str, typing.Any] | None = None) → mlflow.genai.judges.base.Judge[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Note

As of MLflow 3.4.0, this function is deprecated in favor of mlflow.genai.make_judge and may be removed in a future version.

Create a custom MLflow judge instance.

Parameters

name – The name of the judge
instructions – Natural language instructions for evaluation. Must contain at least one template variable: {{ inputs }}, {{ outputs }}, {{ expectations }}, {{ conversation }}, or {{ trace }} to reference evaluation data. Custom variables are not supported. Note: {{ conversation }} can only coexist with {{ expectations }}. It cannot be used together with {{ inputs }}, {{ outputs }}, or {{ trace }}.
model – The model identifier to use for evaluation (e.g., “openai:/gpt-4”)
description – A description of what the judge evaluates
feedback_value_type –
Type specification for the ‘value’ field in the Feedback object. The judge will use structured outputs to enforce this type. If unspecified, the feedback value type is determined by the judge. It is recommended to explicitly specify the type.

Supported types (matching FeedbackValueType):
- int: Integer ratings (e.g., 1-5 scale)
- float: Floating point scores (e.g., 0.0-1.0)
- str: Text responses
- bool: Yes/no evaluations
- Literal[values]: Enum-like choices (e.g., Literal[“good”, “bad”])
- dict[str, int | float | str | bool]: Dictionary with string keys and int, float, str, or bool values.
- list[int | float | str | bool]: List of int, float, str, or bool values
Note: Pydantic BaseModel types are not supported.
inference_params – Optional dictionary of inference parameters to pass to the model (e.g., temperature, top_p, max_tokens). These parameters allow fine-grained control over the model’s behavior during evaluation. For example, setting a lower temperature can produce more deterministic and reproducible evaluation results.

Returns

An InstructionsJudge instance configured with the provided parameters

Example

import mlflow
from mlflow.genai.judges import make_judge
from typing import Literal

# Create a judge that evaluates response quality using template variables
quality_judge = make_judge(
    name="response_quality",
    instructions=(
        "Evaluate if the response in {{ outputs }} correctly answers "
        "the question in {{ inputs }}. The response should be accurate, "
        "complete, and professional."
    ),
    model="openai:/gpt-4",
    feedback_value_type=Literal["yes", "no"],
)

# Evaluate a response
result = quality_judge(
    inputs={"question": "What is machine learning?"},
    outputs="ML is basically when computers learn stuff on their own",
)

# Create a judge that compares against expectations
correctness_judge = make_judge(
    name="correctness",
    instructions=(
        "Compare the {{ outputs }} against the {{ expectations }}. "
        "Rate how well they match on a scale of 1-5."
    ),
    model="openai:/gpt-4",
    feedback_value_type=int,
)

# Evaluate with expectations (must be dictionaries)
result = correctness_judge(
    inputs={"question": "What is the capital of France?"},
    outputs={"answer": "The capital of France is Paris."},
    expectations={"expected_answer": "Paris"},
)

# Create a judge that evaluates based on trace context
trace_judge = make_judge(
    name="trace_quality",
    instructions="Evaluate the overall quality of the {{ trace }} execution.",
    model="openai:/gpt-4",
    feedback_value_type=Literal["good", "needs_improvement"],
)

# Use with search_traces() - evaluate each trace
traces = mlflow.search_traces(experiment_ids=["1"], return_type="list")
for trace in traces:
    feedback = trace_judge(trace=trace)
    print(f"Trace {trace.info.trace_id}: {feedback.value} - {feedback.rationale}")

# Create a multi-turn judge that detects user frustration
frustration_judge = make_judge(
    name="user_frustration",
    instructions=(
        "Analyze the {{ conversation }} to detect signs of user frustration. "
        "Look for indicators such as repeated questions, negative language, "
        "or expressions of dissatisfaction."
    ),
    model="openai:/gpt-4",
    feedback_value_type=Literal["frustrated", "not frustrated"],
)

# Evaluate a multi-turn conversation using session traces
session = mlflow.search_traces(
    experiment_ids=["1"],
    filter_string="metadata.`mlflow.trace.session` = 'session_123'",
    return_type="list",
)
result = frustration_judge(session=session)

# Align a judge with human feedback
aligned_judge = quality_judge.align(traces)

# To see detailed optimization output during alignment, enable DEBUG logging:
# import logging
# logging.getLogger("mlflow.genai.judges.optimizers.simba").setLevel(logging.DEBUG)

mlflow.genai.optimize_prompt(*args, **kwargs)[source]

mlflow.genai.optimize_prompts(*, predict_fn: Callable[[...], Any], train_data: EvaluationDatasetTypes, prompt_uris: list[str], optimizer: mlflow.genai.optimize.optimizers.base.BasePromptOptimizer, scorers: list[Scorer], aggregation: Optional[Callable[[dict[str, bool | float | str | Feedback | list[Feedback]]], float]] = None, enable_tracking: bool = True) → mlflow.genai.optimize.types.PromptOptimizationResult[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Automatically optimize prompts using evaluation metrics and training data. This function uses the provided optimization algorithm to improve prompt quality based on your evaluation criteria and dataset.

Parameters

predict_fn – a target function that uses the prompts to be optimized. The callable should receive inputs as keyword arguments and return the response. The function should use MLflow prompt registry and call PromptVersion.format during execution in order for this API to optimize the prompt. This function should return the same type as the outputs in the dataset.
train_data –
an evaluation dataset used for the optimization. It should include the inputs and outputs fields with dict values. The data must be one of the following formats:
- An EvaluationDataset entity
- Pandas DataFrame
- Spark DataFrame
- List of dictionaries
The dataset must include the following columns:
- inputs: A column containing single inputs in dict format. Each input should contain keys matching the variables in the prompt template.
- outputs: A column containing an output for each input that the predict_fn should produce.
prompt_uris – a list of prompt uris to be optimized. The prompt templates should be used by the predict_fn.
optimizer – a prompt optimizer object that optimizes a set of prompts with the training dataset and scorers. For example, GepaPromptOptimizer(reflection_model=”openai:/gpt-4o”).
scorers – List of scorers that evaluate the inputs, outputs and expectations. Required parameter. Use builtin scorers like Equivalence or Correctness, or define custom scorers with the @scorer decorator.
aggregation – A callable that computes the overall performance metric from individual scorer outputs. Takes a dict mapping scorer names to scores and returns a float value (greater is better). If None and all scorers return numerical values, uses sum of scores by default.
enable_tracking – If True (default), automatically creates an MLflow run if no active run exists and logs the following information: - The optimization scores (initial, final, improvement) - Links to the optimized prompt versions - The optimizer name and parameters - Optimization progress If False, no MLflow run is created and no tracking occurs.

Returns

The optimization result object that includes the optimized prompts as a list of prompt versions, evaluation scores, and the optimizer name.

Examples

import mlflow
import openai
from mlflow.genai.optimize.optimizers import GepaPromptOptimizer
from mlflow.genai.scorers import Correctness

prompt = mlflow.genai.register_prompt(
    name="qa",
    template="Answer the following question: {{question}}",
)


def predict_fn(question: str) -> str:
    completion = openai.OpenAI().chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt.format(question=question)}],
    )
    return completion.choices[0].message.content


dataset = [
    {"inputs": {"question": "What is the capital of France?"}, "outputs": "Paris"},
    {"inputs": {"question": "What is the capital of Germany?"}, "outputs": "Berlin"},
]

result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-4o"),
    scorers=[Correctness(model="openai:/gpt-4o")],
)

print(result.optimized_prompts[0].template)

Example: Using custom scorers with an objective function

import mlflow
from mlflow.genai.optimize.optimizers import GepaPromptOptimizer
from mlflow.genai.scorers import scorer


# Define custom scorers
@scorer(name="accuracy")
def accuracy_scorer(outputs, expectations):
    return 1.0 if outputs.lower() == expectations.lower() else 0.0


@scorer(name="brevity")
def brevity_scorer(outputs):
    # Prefer shorter outputs (max 50 chars gets score of 1.0)
    return min(1.0, 50 / max(len(outputs), 1))


# Define objective to combine scores
def weighted_objective(scores):
    return 0.7 * scores["accuracy"] + 0.3 * scores["brevity"]


result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-4o"),
    scorers=[accuracy_scorer, brevity_scorer],
    aggregation=weighted_objective,
)

mlflow.genai.register_prompt(name: str, template: str | list[dict[str, typing.Any]], commit_message: str | None = None, tags: dict[str, str] | None = None, response_format: type[pydantic.main.BaseModel] | dict[str, typing.Any] | None = None, model_config: PromptModelConfig | dict[str, typing.Any] | None = None) → PromptVersion[source]

Register a new Prompt in the MLflow Prompt Registry.

A Prompt is a pair of name and template content at minimum. With MLflow Prompt Registry, you can create, manage, and version control prompts with the MLflow’s robust model tracking framework.

If there is no registered prompt with the given name, a new prompt will be created. Otherwise, a new version of the existing prompt will be created.

Parameters

name – The name of the prompt.
template –
The template content of the prompt. Can be either:
- A string containing text with variables enclosed in double curly braces, e.g. {{variable}}, which will be replaced with actual values by the format method.
- A list of dictionaries representing chat messages, where each message has ‘role’ and ‘content’ keys (e.g., [{“role”: “user”, “content”: “Hello {{name}}”}])
Note

If you want to use the prompt with a framework that uses single curly braces e.g. LangChain, you can use the to_single_brace_format method to convert the loaded prompt to a format that uses single curly braces.
```
prompt = client.load_prompt("my_prompt")
langchain_format = prompt.to_single_brace_format()
```
commit_message – A message describing the changes made to the prompt, similar to a Git commit message. Optional.
tags – A dictionary of tags associated with the prompt version. This is useful for storing version-specific information, such as the author of the changes. Optional.
response_format – Optional Pydantic class or dictionary defining the expected response structure. This can be used to specify the schema for structured outputs from LLM calls.
model_config – Optional PromptModelConfig instance or dictionary containing model-specific configuration including model name and settings like temperature, top_p, max_tokens. Using PromptModelConfig provides validation and type safety for common parameters. Example (dict): {“model_name”: “gpt-4”, “temperature”: 0.7} Example (PromptModelConfig): PromptModelConfig(model_name=”gpt-4”, temperature=0.7)

Returns

A Prompt object that was created.

Example:

import mlflow

# Register a text prompt
mlflow.genai.register_prompt(
    name="greeting_prompt",
    template="Respond to the user's message as a {{style}} AI.",
)

# Register a chat prompt with multiple messages
mlflow.genai.register_prompt(
    name="assistant_prompt",
    template=[
        {"role": "system", "content": "You are a helpful {{style}} assistant."},
        {"role": "user", "content": "{{question}}"},
    ],
    response_format={"type": "object", "properties": {"answer": {"type": "string"}}},
)

# Load and use the prompt
prompt = mlflow.genai.load_prompt("greeting_prompt")

# Use the prompt in your application
import openai

openai_client = openai.OpenAI()
openai_client.chat.completion.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": prompt.format(style="friendly")},
        {"role": "user", "content": "Hello, how are you?"},
    ],
)

# Update the prompt with a new version
prompt = mlflow.genai.register_prompt(
    name="greeting_prompt",
    template="Respond to the user's message as a {{style}} AI. {{greeting}}",
    commit_message="Add a greeting to the prompt.",
    tags={"author": "Bob"},
)

mlflow.genai.scorer(func=None, *, name: Optional[str] = None, description: Optional[str] = None, aggregations: Optional[list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]]] = None)[source]

A decorator to define a custom scorer that can be used in mlflow.genai.evaluate().

The scorer function should take in a subset of the following parameters:

Parameter	Description	Source
`inputs`	A single input to the target model/app.	Derived from either dataset or trace. When the dataset contains `inputs` column, the value will be passed as is. When traces are provided as evaluation dataset, this will be derived from the `inputs` field of the trace (i.e. inputs captured as the root span of the trace).
`outputs`	A single output from the target model/app.	Derived from either dataset, trace, or output of `predict_fn`. When the dataset contains `outputs` column, the value will be passed as is. When `predict_fn` is provided, MLflow will make a prediction using the `inputs` and the `predict_fn` and pass the result as the `outputs`. When traces are provided as evaluation dataset, this will be derived from the `response` field of the trace (i.e. outputs captured as the root span of the trace).
`expectations`	Ground truth or any expectation for each prediction e.g., expected retrieved docs.	Derived from either dataset or trace. When the dataset contains `expectations` column, the value will be passed as is. When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value].
`trace`	A trace object corresponding to the prediction for the row.	Specified as a `trace` column in the dataset, or generated during the prediction.

The scorer function should return one of the following:

A boolean value
An integer value
A float value
A string value
A single Feedback object
A list of Feedback objects

Note

The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer.

Parameters

func – The scorer function to be decorated.
name – The name of the scorer.
description – A description of what the scorer evaluates.
aggregations –
A list of aggregation functions to apply to the scorer’s output. The aggregation functions can be either a string or a callable.
- If a string, it must be one of [“min”, “max”, “mean”, “median”, “variance”, “p90”].
- If a callable, it must take a list of values and return a single value.
By default, “mean” is used as the aggregation function.

Example

import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, Feedback


# Basic scorers that returns primitive values
@scorer
def not_empty(outputs) -> bool:
    return outputs != ""


@scorer
def exact_match(outputs, expectations) -> bool:
    return outputs == expectations["expected_response"]


@scorer
def num_tool_calls(trace) -> int:
    spans = trace.search_spans(name="tool_call")
    return len(spans)


# Use `Feedback` object to return additional information about the scorer's
# result, such as a rationale for the score.
@scorer
def harmfulness(outputs) -> Feedback:
    import openai

    prompt = f'''
        Judge if the following text is harmful or not.

        Text:
        {outputs}

        Return the answer in a JSON object with the following format:
        {{
            "harmful": true
            "reason": "The text contains harmful content"
        }}

        Do not output any other characters than the json object.
    '''
    response = openai.OpenAI().chat.completions.create(
        model="o4-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    payload = json.loads(response.choices[0].message.content)
    return Feedback(
        value=payload["harmful"],
        rationale=payload["reason"],
        source=AssessmentSource(
            source_type="LLM_JUDGE",
            source_id="openai:/o4-mini",
        ),
    )


# Use the scorer in an evaluation
mlflow.genai.evaluate(
    data=data,
    scorers=[not_empty, exact_match, num_tool_calls, harmfulness],
)

mlflow.genai.search_datasets(experiment_ids: Optional[Union[str, list[str]]] = None, filter_string: Optional[str] = None, max_results: Optional[int] = None, order_by: Optional[list[str]] = None) → list[mlflow.genai.datasets.evaluation_dataset.EvaluationDataset][source]

Note

Experimental: This function may change or be removed in a future release without warning.

Search for datasets (non-Databricks only).

Warning

Calling search_datasets() without any parameters will return ALL datasets in your tracking server. This can be slow or even crash your Python session if you have many datasets. Always use filters or max_results to limit the results.

Parameters

experiment_ids – Single experiment ID (str) or list of experiment IDs to filter by. If None, searches across all experiments.
filter_string – SQL-like filter string for dataset attributes. If not specified, defaults to filtering for datasets created in the last 7 days. Supports filtering by: - name: Dataset name - created_by: User who created the dataset - last_updated_by: User who last updated the dataset - created_time: Creation timestamp (milliseconds since epoch) - tags.<key>: Tag values
max_results – Maximum number of results. If not specified, returns all datasets.
order_by – List of columns to order by. Each entry can include an optional “DESC” or “ASC” suffix (default is “ASC”). If not specified, defaults to [“created_time DESC”]. Supported columns: - name - created_time - last_update_time

Returns

List of EvaluationDataset objects matching the search criteria

Common Search Patterns
Search Pattern	Example Code
Find datasets by name	# Exact match datasets = search_datasets( filter_string="name = 'production_qa_v2'" ) # Pattern matching datasets = search_datasets( filter_string="name LIKE 'qa_%'" )
Find datasets by experiment	# Single experiment datasets = search_datasets( experiment_ids="1" ) # Multiple experiments datasets = search_datasets( experiment_ids=["0", "1", "2", "5"] )
Find datasets by tags	# Single tag datasets = search_datasets( filter_string="tags.environment = 'production'" ) # Multiple tags with AND datasets = search_datasets( filter_string="tags.status = 'validated' AND tags.version = '2.0'" )
Find datasets by creator	datasets = search_datasets( filter_string="created_by = 'alice@company.com'" )
Find recent datasets	# Last 10 datasets created datasets = search_datasets( order_by=["created_time DESC"], max_results=10 )
Complex search	# Production-ready datasets from specific team datasets = search_datasets( experiment_ids="1", filter_string="tags.status = 'production' AND " "tags.team = 'ml-platform' AND " "name LIKE '%customer%'", order_by=["last_update_time DESC"], max_results=20 )

Examples

from mlflow.genai.datasets import search_datasets

# WARNING: This returns ALL datasets - use with caution!
# all_datasets = search_datasets()  # May be slow or crash

# Better: Always use filters or limits
recent_datasets = search_datasets(max_results=100)

# Search in specific experiments
exp_datasets = search_datasets(experiment_ids=["1", "2", "3"])

# Find production datasets
prod_datasets = search_datasets(
    filter_string="tags.environment = 'production'", order_by=["name ASC"]
)

# Iterate through results (pagination handled automatically)
for dataset in prod_datasets:
    print(f"{dataset.name} (ID: {dataset.dataset_id})")
    print(f"  Tags: {dataset.tags}")

Note

This API is not available in Databricks environments. Use Unity Catalog search capabilities in Databricks instead.

mlflow.genai.search_prompts(filter_string: str | None = None, max_results: int | None = None) → PagedList[Prompt][source]

mlflow.genai.set_dataset_tags(dataset_id: str, tags: dict[str, typing.Any]) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Set tags for a dataset.

This implements a batch tag operation - existing tags are merged with new tags. To remove a tag, set its value to None or use delete_dataset_tag() instead.

Parameters

dataset_id – The ID of the dataset.
tags – Dictionary of tags to set. Setting a value to None removes the tag.

Examples

from mlflow.genai.datasets import set_dataset_tags, get_dataset

# Get your dataset
dataset = get_dataset(dataset_id="d-8f3a2b1c4e5d6f7a8b9c0d1e2f3a4b5c")

# Add or update multiple tags
set_dataset_tags(
    dataset_id=dataset.dataset_id,
    tags={
        "environment": "production",  # Add new tag
        "version": "2.0",  # Update existing tag
        "validated": "true",
        "validation_date": "2024-11-01",
        "team": "ml-platform",
    },
)

# Remove tags by setting to None
set_dataset_tags(
    dataset_id=dataset.dataset_id,
    tags={
        "deprecated_tag": None,  # This removes the tag
        "old_version": None,  # This also removes the tag
    },
)

# Update status after validation
set_dataset_tags(
    dataset_id=dataset.dataset_id,
    tags={
        "status": "production_ready",
        "coverage": "comprehensive",
        "last_review": "2024-11-01",
        "approved_by": "data_science_lead@company.com",
    },
)

Note

This API is not available in Databricks environments yet. Tags in Databricks are managed through Unity Catalog.

mlflow.genai.set_prompt_alias(name: str, alias: str, version: int) → None[source]

Set an alias for a Prompt in the MLflow Prompt Registry.

Parameters

name – The name of the prompt.
alias – The alias to set for the prompt.
version – The version of the prompt.

Example:

import mlflow

# Set an alias for the prompt
mlflow.genai.set_prompt_alias(name="my_prompt", version=1, alias="production")

# Load the prompt by alias (use "@" to specify the alias)
prompt = mlflow.genai.load_prompt("prompts:/my_prompt@production")

# Switch the alias to a new version of the prompt
mlflow.genai.set_prompt_alias(name="my_prompt", version=2, alias="production")

# Delete the alias
mlflow.genai.delete_prompt_alias(name="my_prompt", alias="production")

mlflow.genai.set_prompt_model_config(name: str, version: str | int, model_config: PromptModelConfig | dict[str, typing.Any]) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Set or update the model configuration for a specific prompt version.

Model configuration includes model-specific settings such as model name, temperature, max_tokens, and other inference parameters. Unlike the prompt template, model configuration is mutable and can be updated after a prompt version is created.

Args:
name: The name of the prompt. version: The version of the prompt. model_config: A PromptModelConfig or dict with model settings like model_name, temperature.

Example:
import mlflow
from mlflow.entities.model_registry import PromptModelConfig

# Set model config using a dictionary
mlflow.genai.set_prompt_model_config(
    name="my-prompt",
    version=1,
    model_config={"model_name": "gpt-4", "temperature": 0.7, "max_tokens": 1000},
)

# Set model config using PromptModelConfig for validation
config = PromptModelConfig(
    model_name="gpt-4-turbo",
    temperature=0.5,
    max_tokens=2000,
    top_p=0.95,
)
mlflow.genai.set_prompt_model_config(
    name="my-prompt",
    version=1,
    model_config=config,
)

# Load and verify the config was set
prompt = mlflow.genai.load_prompt("my-prompt", version=1)
print(prompt.model_config)

mlflow.genai.set_prompt_tag(name: str, key: str, value: str) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Set a tag on a prompt in the MLflow Prompt Registry.

Args:
name: The name of the prompt. key: The key of the tag value: The value of the tag for the key

mlflow.genai.set_prompt_version_tag(name: str, version: str | int, key: str, value: str) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Set a tag on a prompt version in the MLflow Prompt Registry.

Args:
name: The name of the prompt. version: The version of the prompt. key: The key of the tag value: The value of the tag for the key

mlflow.genai.to_predict_fn(endpoint_uri: str) → Callable[[...], Any][source]

Convert an endpoint URI to a predict function.

Parameters: endpoint_uri – The endpoint URI to convert.
Returns: A predict function that can be used to make predictions.

Example

The following example assumes that the model serving endpoint accepts a JSON object with a messages key. Please adjust the input based on the actual schema of the model serving endpoint.

from mlflow.genai.scorers import get_all_scorers

data = [
    {
        "inputs": {
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is MLflow?"},
            ]
        }
    },
    {
        "inputs": {
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is Spark?"},
            ]
        }
    },
]
predict_fn = mlflow.genai.to_predict_fn("endpoints:/chat")
mlflow.genai.evaluate(
    data=data,
    predict_fn=predict_fn,
    scorers=get_all_scorers(),
)

You can also directly invoke the function to validate if the endpoint works properly with your input schema.

predict_fn(**data[0]["inputs"])

class mlflow.genai.scorers.Completeness(*, name: str = 'completeness', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = 'Evaluate whether the assistant fully addresses all user questions in a single turn.', required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Completeness evaluates whether an AI assistant fully addresses all user questions in a single user prompt.

For evaluating the completeness of a conversation, use the ConversationCompleteness scorer instead.

This scorer analyzes a single turn of interaction (user input and AI response) to determine if the AI successfully answered all questions and provided all requested information. It returns “yes” or “no”.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “completeness”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import Completeness

assessment = Completeness(name="my_completeness_check")(
    inputs={"question": "What is MLflow and what are its main features?"},
    outputs="MLflow is an open-source platform for managing the ML lifecycle.",
)
print(assessment)  # Feedback with value "yes" or "no"

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import Completeness

data = [
    {
        "inputs": {"question": "What is MLflow and what are its main features?"},
        "outputs": "MLflow is an open-source platform.",
    },
]
result = mlflow.genai.evaluate(data=data, scorers=[Completeness()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for this judge.

Returns: List of JudgeField objects defining the input fields.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

class mlflow.genai.scorers.ConversationCompleteness(*, name: str = 'conversation_completeness', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = 'Evaluate whether the assistant fully addresses all user requests by the end of the conversation.', required_columns: set[str] = {'trace'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInSessionLevelScorer

Note

Experimental: This class may change or be removed in a future release without warning.

ConversationCompleteness evaluates whether an AI assistant fully addresses all user requests by the end of the conversation.

For evaluating the completeness of a single user prompt, use the Completeness scorer instead.

This scorer analyzes a complete conversation (represented as a list of traces) to determine if the assistant successfully addressed all the user’s requests in a conversation. It returns “yes” or “no”.

You can invoke the scorer directly with a session for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “conversation_completeness”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import ConversationCompleteness

# Retrieve a list of traces with the same session ID
session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)

assessment = ConversationCompleteness(name="my_completion_check")(session=session)
print(assessment)  # Feedback with value "yes" or "no"

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import ConversationCompleteness

session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)
result = mlflow.genai.evaluate(data=session, scorers=[ConversationCompleteness()])

description: str

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

class mlflow.genai.scorers.ConversationalRoleAdherence(*, name: str = 'conversational_role_adherence', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = 'Evaluate whether an AI assistant maintains its assigned role throughout a conversation, checking for persona consistency and boundary violations.', required_columns: set[str] = {'trace'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInSessionLevelScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Conversational role adherence evaluates whether an AI assistant maintains its assigned role throughout a conversation.

This scorer analyzes the complete conversation to evaluate whether the assistant adheres to its defined role as specified in the system message, or implicitly maintains a consistent persona throughout the interaction.

You can invoke the scorer directly with a session for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “conversational_role_adherence”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import ConversationalRoleAdherence

# Retrieve a list of traces with the same session ID
session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)

assessment = ConversationalRoleAdherence()(session=session)
print(assessment)  # Feedback with value "yes" or "no"

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import ConversationalRoleAdherence

session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)
result = mlflow.genai.evaluate(data=session, scorers=[ConversationalRoleAdherence()])

description: str

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

class mlflow.genai.scorers.ConversationalSafety(*, name: str = 'conversational_safety', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = "Evaluate whether the assistant's responses in a conversation are safe, checking for harmful content and safety guideline failures.", required_columns: set[str] = {'trace'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInSessionLevelScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Conversational safety evaluates whether the assistant’s responses in a conversation are safe.

This scorer analyzes the assistant’s responses for safety concerns including:

Harmful content in assistant responses (hate speech, harassment, violence, illegal acts)
Cases where the assistant fails to maintain safety when faced with manipulation attempts
Assistant responses that enable or encourage harmful behavior

Note: User messages containing harmful content or manipulation attempts do not make a conversation unsafe. Only the assistant’s actual responses are evaluated.

You can invoke the scorer directly with a session for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “conversational_safety”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import ConversationalSafety

# Retrieve a list of traces with the same session ID
session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)

assessment = ConversationalSafety()(session=session)
print(assessment)  # Feedback with value "yes" or "no"

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import ConversationalSafety

session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)
result = mlflow.genai.evaluate(data=session, scorers=[ConversationalSafety()])

description: str

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

class mlflow.genai.scorers.ConversationalToolCallEfficiency(*, name: str = 'conversational_tool_call_efficiency', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = 'Evaluate whether tool usage across a multi-turn conversation session was efficient, checking for redundant calls, unnecessary calls, and poor tool selection.', required_columns: set[str] = {'trace'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInSessionLevelScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Conversational tool call efficiency evaluates whether tool usage across a multi-turn conversation session was optimized.

This scorer analyzes the complete conversation and tool call history to identify inefficiencies such as redundant calls, unnecessary invocations, or missed optimization opportunities.

You can invoke the scorer directly with a session for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “conversational_tool_call_efficiency”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import ConversationalToolCallEfficiency

# Retrieve a list of traces with the same session ID
session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)

assessment = ConversationalToolCallEfficiency()(session=session)
print(assessment)  # Feedback with value "yes" or "no"

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import ConversationalToolCallEfficiency

session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)
result = mlflow.genai.evaluate(
    data=session, scorers=[ConversationalToolCallEfficiency()]
)

description: str

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

class mlflow.genai.scorers.Correctness(*, name: str = 'correctness', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = "Check whether the expected facts (from expected_response or expected_facts) are supported by the model's response.", required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Correctness evaluates whether the model’s response supports the expected facts or response.

This scorer checks if the facts specified in expected_response or expected_facts are supported by the model’s output. It answers the question: “Does the model’s response contain or support all the expected facts?”

Note

This scorer checks if expected facts are supported by the output, not whether the output is equivalent to the expected response. For direct equivalence comparison, use the Equivalence scorer instead.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “correctness”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import Correctness

assessment = Correctness(name="my_correctness")(
    inputs={
        "question": "What is the difference between reduceByKey and groupByKey in Spark?"
    },
    outputs=(
        "reduceByKey aggregates data before shuffling, whereas groupByKey "
        "shuffles all data, making reduceByKey more efficient."
    ),
    expectations=[
        {"expected_response": "reduceByKey aggregates data before shuffling"},
        {"expected_response": "groupByKey shuffles all data"},
    ],
)
print(assessment)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import Correctness

data = [
    {
        "inputs": {
            "question": (
                "What is the difference between reduceByKey and groupByKey in Spark?"
            )
        },
        "outputs": (
            "reduceByKey aggregates data before shuffling, whereas groupByKey "
            "shuffles all data, making reduceByKey more efficient."
        ),
        "expectations": {
            "expected_response": (
                "reduceByKey aggregates data before shuffling. "
                "groupByKey shuffles all data"
            ),
        },
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[Correctness()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for the Correctness judge.

Returns: List of JudgeField objects defining the input fields based on the __call__ method.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

validate_columns(columns: set[str]) → None[source]

class mlflow.genai.scorers.Equivalence(*, name: str = 'equivalence', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = 'Compare outputs against expected outputs for semantic equivalence.', required_columns: set[str] = {'outputs'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Equivalence compares outputs against expected outputs for semantic equivalence.

This scorer uses exact matching for numerical types (int, float, bool) and an LLM judge for text outputs to determine if they are semantically equivalent in both content and format.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate or mlflow.genai.optimize_prompts for evaluation.

Parameters

name – The name of the scorer. Defaults to “equivalence”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import Equivalence

# Numerical equivalence
assessment = Equivalence()(
    outputs=42,
    expectations={"expected_response": 42},
)
print(assessment)  # value: ategoricalRating.YES, rationale: 'Exact numerical match'

# Text equivalence
assessment = Equivalence()(
    outputs="The capital is Paris",
    expectations={"expected_response": "Paris is the capital"},
)
print(assessment)  # value: CategoricalRating.YES (semantically equivalent)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import Equivalence

data = [
    {
        "outputs": "The capital is Paris",
        "expectations": {"expected_response": "Paris"},
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[Equivalence()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for the Equivalence scorer.

Returns: List of JudgeField objects defining the input fields.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

validate_columns(columns: set[str]) → None[source]

class mlflow.genai.scorers.ExpectationsGuidelines(*, name: str = 'expectations_guidelines', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = "Evaluate whether the agent's response follows specific constraints or instructions provided for each row in the input dataset.", required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

This scorer evaluates whether the agent’s response follows specific constraints or instructions provided for each row in the input dataset. This scorer is useful when you have a different set of guidelines for each example.

To use this scorer, the input dataset should contain the expectations column with the guidelines field. Then pass this scorer to mlflow.genai.evaluate for running full evaluation on the input dataset.

Parameters

name – The name of the scorer. Defaults to “expectations_guidelines”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example:

In this example, the guidelines specified in the guidelines field of the expectations column will be applied to each example individually. The evaluation result will contain a single “expectations_guidelines” score.

import mlflow
from mlflow.genai.scorers import ExpectationsGuidelines

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "expectations": {
            "guidelines": ["The response must be factual and concise"],
        },
    },
    {
        "inputs": {"question": "How to learn Python?"},
        "outputs": "You can read a book or take a course.",
        "expectations": {
            "guidelines": ["The response must be helpful and encouraging"],
        },
    },
]
mlflow.genai.evaluate(data=data, scorers=[ExpectationsGuidelines()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for the ExpectationsGuidelines judge.

Returns: List of JudgeField objects defining the input fields based on the __call__ method.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

validate_columns(columns: set[str]) → None[source]

class mlflow.genai.scorers.Guidelines(*, name: str = 'guidelines', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = "Evaluate whether the agent's response follows specific constraints or instructions provided in the guidelines.", required_columns: set[str] = {'inputs', 'outputs'}, guidelines: str | list[str], model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Guideline adherence evaluates whether the agent’s response follows specific constraints or instructions provided in the guidelines.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “guidelines”.
guidelines – A single guideline text or a list of guidelines.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import Guidelines

english = Guidelines(
    name="english_guidelines",
    guidelines=["The response must be in English"],
)
feedback = english(
    inputs={"question": "What is the capital of France?"},
    outputs="The capital of France is Paris.",
)
print(feedback)

Example (with evaluate):

In the following example, the guidelines specified in the english and clarify scorers will be uniformly applied to all the examples in the dataset. The evaluation result will contains two scores “english” and “clarify”.

import mlflow
from mlflow.genai.scorers import Guidelines

english = Guidelines(
    name="english",
    guidelines=["The response must be in English"],
)
clarify = Guidelines(
    name="clarify",
    guidelines=["The response must be clear, coherent, and concise"],
)

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    },
    {
        "inputs": {"question": "What is the capital of Germany?"},
        "outputs": "The capital of Germany is Berlin.",
    },
]
mlflow.genai.evaluate(data=data, scorers=[english, clarify])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for the Guidelines judge.

Returns: List of JudgeField objects defining the input fields based on the __call__ method.

guidelines: str | list[str]

property instructions: str: Get the instructions of what this scorer evaluates.

property kind: mlflow.genai.scorers.base.ScorerKind

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

class mlflow.genai.scorers.RelevanceToQuery(*, name: str = 'relevance_to_query', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = "Ensure that the agent's response directly addresses the user's input without deviating into unrelated topics.", required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Relevance ensures that the agent’s response directly addresses the user’s input without deviating into unrelated topics.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “relevance_to_query”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import RelevanceToQuery

assessment = RelevanceToQuery(name="my_relevance_to_query")(
    inputs={"question": "What is the capital of France?"},
    outputs="The capital of France is Paris.",
)
print(assessment)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import RelevanceToQuery

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[RelevanceToQuery()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for the RelevanceToQuery judge.

Returns: List of JudgeField objects defining the input fields based on the __call__ method.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

class mlflow.genai.scorers.RetrievalGroundedness(*, name: str = 'retrieval_groundedness', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = 'Assess whether the facts in the response are implied by the information in the last retrieval step, i.e., hallucinations do not occur.', required_columns: set[str] = {'inputs', 'trace'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

RetrievalGroundedness assesses whether the agent’s response is aligned with the information provided in the retrieved context.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “retrieval_groundedness”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import RetrievalGroundedness

trace = mlflow.get_trace("<your-trace-id>")
feedback = RetrievalGroundedness(name="my_retrieval_groundedness")(trace=trace)
print(feedback)

Example (with evaluate):

import mlflow

data = mlflow.search_traces(...)
result = mlflow.genai.evaluate(data=data, scorers=[RetrievalGroundedness()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for the RetrievalGroundedness judge.

Returns: List of JudgeField objects defining the input fields based on the __call__ method.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

class mlflow.genai.scorers.RetrievalRelevance(*, name: str = 'retrieval_relevance', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = 'Evaluate whether each retrieved context chunk is relevant to the input request.', required_columns: set[str] = {'inputs', 'trace'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Retrieval relevance measures whether each chunk is relevant to the input request.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “retrieval_relevance”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import RetrievalRelevance

trace = mlflow.get_trace("<your-trace-id>")
feedbacks = RetrievalRelevance(name="my_retrieval_relevance")(trace=trace)
print(feedbacks)

Example (with evaluate):

import mlflow

data = mlflow.search_traces(...)
result = mlflow.genai.evaluate(data=data, scorers=[RetrievalRelevance()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for the RetrievalRelevance judge.

Returns: List of JudgeField objects defining the input fields based on the __call__ method.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

class mlflow.genai.scorers.RetrievalSufficiency(*, name: str = 'retrieval_sufficiency', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = 'Evaluate whether the information in the last retrieval is sufficient to generate the facts in expected_response or expected_facts.', required_columns: set[str] = {'inputs', 'trace'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Retrieval sufficiency evaluates whether the retrieved documents provide all necessary information to generate the expected response.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “retrieval_sufficiency”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import RetrievalSufficiency

trace = mlflow.get_trace("<your-trace-id>")
feedback = RetrievalSufficiency(name="my_retrieval_sufficiency")(trace=trace)
print(feedback)

Example (with evaluate):

import mlflow

data = mlflow.search_traces(...)
result = mlflow.genai.evaluate(data=data, scorers=[RetrievalSufficiency()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for the RetrievalSufficiency judge.

Returns: List of JudgeField objects defining the input fields based on the __call__ method.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

validate_columns(columns: set[str]) → None[source]

class mlflow.genai.scorers.Safety(*, name: str = 'safety', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = "Ensure that the agent's responses do not contain harmful, offensive, or toxic content.", required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Safety ensures that the agent’s responses do not contain harmful, offensive, or toxic content.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “safety”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import Safety

assessment = Safety(name="my_safety")(outputs="The capital of France is Paris.")
print(assessment)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import Safety

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[Safety()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for the Safety judge.

Returns: List of JudgeField objects defining the input fields based on the __call__ method.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

class mlflow.genai.scorers.ScorerSamplingConfig(sample_rate: Optional[float] = None, filter_string: Optional[str] = None)[source]

Bases: object

Configuration for registered scorer sampling.

filter_string: str | None = None

sample_rate: float | None = None

class mlflow.genai.scorers.Summarization(*, name: str = 'summarization', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = 'Evaluate whether the summarization output is factually correct based on the input and does not make any assumptions not in the input, with a focus on faithfulness, coverage, and conciseness.', required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Summarization evaluates whether a summarization output is factually correct, grounded in the input, and provides reasonably good coverage of the input.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “summarization”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import Summarization

assessment = Summarization(name="my_summarization_check")(
    inputs={"text": "MLflow is an open-source platform for managing ML workflows..."},
    outputs="MLflow is an ML platform.",
)
print(assessment)  # Feedback with value "yes" or "no"

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import Summarization

data = [
    {
        "inputs": {
            "text": "MLflow is an open-source platform for managing ML workflows..."
        },
        "outputs": "MLflow is an ML platform.",
    },
]
result = mlflow.genai.evaluate(data=data, scorers=[Summarization()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for this judge.

Returns: List of JudgeField objects defining the input fields.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

class mlflow.genai.scorers.ToolCallEfficiency(*, name: str = 'tool_call_efficiency', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = "Evaluate the agent's trajectory for redundancy in tool usage, such as tool calls with the same or similar arguments.", required_columns: set[str] = {'trace'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInScorer

Note

Experimental: This class may change or be removed in a future release without warning.

ToolCallEfficiency evaluates the agent’s trajectory for redundancy in tool usage, such as tool calls with the same or similar arguments.

This scorer analyzes whether the agent makes redundant tool calls during execution. It checks for duplicate or near-duplicate tool invocations that could be avoided for more efficient task completion.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “tool_call_efficiency”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import ToolCallEfficiency

trace = mlflow.get_trace("<your-trace-id>")
feedback = ToolCallEfficiency(name="my_tool_call_efficiency")(trace=trace)
print(feedback)

Example (with evaluate):

import mlflow

data = mlflow.search_traces(...)
result = mlflow.genai.evaluate(data=data, scorers=[ToolCallEfficiency()])

description: str

get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for this judge.

Returns: List of JudgeField objects defining the input fields.

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

required_columns: set[str]

class mlflow.genai.scorers.UserFrustration(*, name: str = 'user_frustration', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str = "Evaluate the user's frustration state throughout the conversation.", required_columns: set[str] = {'trace'}, model: str | None = None)[source]

Bases: mlflow.genai.scorers.builtin_scorers.BuiltInSessionLevelScorer

Note

Experimental: This class may change or be removed in a future release without warning.

UserFrustration evaluates the user’s frustration state throughout the conversation with the AI assistant based on a conversation session.

This scorer analyzes a session of conversation (represented as a list of traces) to determine if the user shows explicit or implicit frustration directed at the AI. It evaluates the entire conversation and returns one of three values:

“no_frustration”: user not frustrated at any point in the conversation
“frustration_resolved”: user is frustrated at some point in the conversation, but leaves the conversation satisfied
“frustration_not_resolved”: user is still frustrated at the end of the conversation

You can invoke the scorer directly with a session for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Parameters

name – The name of the scorer. Defaults to “user_frustration”.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import UserFrustration

# Retrieve a list of traces with the same session ID
session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)

assessment = UserFrustration(name="my_user_frustration_judge")(session=session)
print(assessment)
# Feedback with value "no_frustration", "frustration_resolved", or
# "frustration_not_resolved"

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import UserFrustration

session = mlflow.search_traces(
    experiment_ids=[experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{session_id}'",
    return_type="list",
)
result = mlflow.genai.evaluate(data=session, scorers=[UserFrustration()])

description: str

property instructions: str: Get the instructions of what this scorer evaluates.

model: str | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

name: str

mlflow.genai.scorers.delete_scorer(*, name: str, experiment_id: Optional[str] = None, version: Optional[Union[int, str]] = None) → None[source]

Delete a registered scorer from the MLflow experiment.

This function permanently removes scorer registrations. The behavior of this function varies depending on the backend store and version parameter:

OSS MLflow Tracking Backend:

Supports versioning with granular deletion options
Can delete specific versions or all versions of a scorer by setting version parameter to “all”

Databricks Backend:

Does not support versioning
Deletes the entire scorer regardless of version parameter
version parameter must be None

Parameters

name (str) – The name of the scorer to delete. This must match exactly with the name used during scorer registration.
experiment_id (str, optional) – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment as determined by mlflow.get_experiment_by_name() or mlflow.set_experiment().
version (int | str | None, optional) – The version(s) to delete: For OSS MLflow tracking backend: if None, deletes the latest version only, if version is an integer, deletes the specific version, if version is the string ‘all’, deletes all versions of the scorer For Databricks backend, the version must be set to None (versioning not supported)

Raises

mlflow.MlflowException – If the scorer with the specified name is not found in the experiment, if the specified version doesn’t exist, or if versioning is not supported for the current backend.

Example

from mlflow.genai.scorers import delete_scorer

# Delete the latest version of a scorer from current experiment
delete_scorer(name="accuracy_scorer")

# Delete a specific version of a scorer
delete_scorer(name="safety_scorer", version=2)

# Delete all versions of a scorer
delete_scorer(name="relevance_scorer", version="all")

# Delete a scorer from a specific experiment
delete_scorer(name="harmfulness_scorer", experiment_id="123", version=1)

mlflow.genai.scorers.get_all_scorers() → list[mlflow.genai.scorers.builtin_scorers.BuiltInScorer][source]

Returns a list of all built-in scorers.

Example:

import mlflow
from mlflow.genai.scorers import get_all_scorers

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "expectations": {"expected_response": "Paris is the capital city of France."},
    }
]
result = mlflow.genai.evaluate(data=data, scorers=get_all_scorers())

mlflow.genai.scorers.get_scorer(*, name: str, experiment_id: Optional[str] = None, version: Optional[int] = None) → Scorer[source]

Retrieve a specific registered scorer by name and optional version.

This function retrieves a single Scorer instance from the specified experiment. If no version is specified, it returns the latest (highest version number) scorer with the given name.

Parameters

name (str) – The name of the registered scorer to retrieve. This must match exactly with the name used during scorer registration.
experiment_id (str, optional) – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment as determined by mlflow.get_experiment_by_name() or mlflow.set_experiment().
version (int, optional) – The specific version of the scorer to retrieve. If None, returns the scorer with the highest version number (latest version).

Returns

A Scorer object representing the requested scorer.

Return type

Scorer

Raises

mlflow.MlflowException – If the scorer with the specified name is not found in the experiment, if the specified version doesn’t exist, if the experiment doesn’t exist, or if there are issues with the backend store connection.

Example

from mlflow.genai.scorers import get_scorer

# Get the latest version of a scorer
latest_scorer = get_scorer(name="accuracy_scorer")

# Get a specific version of a scorer
v2_scorer = get_scorer(name="safety_scorer", version=2)

# Get a scorer from a specific experiment
scorer = get_scorer(name="relevance_scorer", experiment_id="123")

Note

When no version is specified, the function automatically returns the latest version
This function works with both OSS MLflow tracking backend and Databricks backend.
For Databricks backend, versioning is not supported, so the version parameter should be None.

mlflow.genai.scorers.list_scorers(*, experiment_id: Optional[str] = None) → list[Scorer][source]

List all registered scorers for an experiment.

This function retrieves all scorers that have been registered in the specified experiment. For each scorer name, only the latest version is returned.

The function automatically determines the appropriate backend store (MLflow tracking store, Databricks, etc.) based on the current MLflow configuration and experiment location.

Parameters

experiment_id (str, optional) – The ID of the MLflow experiment containing the scorers. If None, uses the currently active experiment as determined by mlflow.get_experiment_by_name() or mlflow.set_experiment().

Returns

A list of Scorer objects, each representing the latest version of a: registered scorer with its current configuration. The list may be empty if no scorers have been registered in the experiment.

Return type

list[Scorer]

Raises

mlflow.MlflowException – If the experiment doesn’t exist or if there are issues with the backend store connection.

Example

from mlflow.genai.scorers import list_scorers

# List all scorers in the current experiment
scorers = list_scorers()

# List all scorers in a specific experiment
scorers = list_scorers(experiment_id="123")

# Process the returned scorers
for scorer in scorers:
    print(f"Scorer: {scorer.name}")

Note

Only the latest version of each scorer is returned.
This function works with both OSS MLflow tracking backend and Databricks backend.

mlflow.genai.scorers.scorer(func=None, *, name: Optional[str] = None, description: Optional[str] = None, aggregations: Optional[list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]]] = None)[source]

A decorator to define a custom scorer that can be used in mlflow.genai.evaluate().

The scorer function should take in a subset of the following parameters:

Parameter	Description	Source
`inputs`	A single input to the target model/app.	Derived from either dataset or trace. When the dataset contains `inputs` column, the value will be passed as is. When traces are provided as evaluation dataset, this will be derived from the `inputs` field of the trace (i.e. inputs captured as the root span of the trace).
`outputs`	A single output from the target model/app.	Derived from either dataset, trace, or output of `predict_fn`. When the dataset contains `outputs` column, the value will be passed as is. When `predict_fn` is provided, MLflow will make a prediction using the `inputs` and the `predict_fn` and pass the result as the `outputs`. When traces are provided as evaluation dataset, this will be derived from the `response` field of the trace (i.e. outputs captured as the root span of the trace).
`expectations`	Ground truth or any expectation for each prediction e.g., expected retrieved docs.	Derived from either dataset or trace. When the dataset contains `expectations` column, the value will be passed as is. When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value].
`trace`	A trace object corresponding to the prediction for the row.	Specified as a `trace` column in the dataset, or generated during the prediction.

The scorer function should return one of the following:

A boolean value
An integer value
A float value
A string value
A single Feedback object
A list of Feedback objects

Note

The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer.

Parameters

func – The scorer function to be decorated.
name – The name of the scorer.
description – A description of what the scorer evaluates.
aggregations –
A list of aggregation functions to apply to the scorer’s output. The aggregation functions can be either a string or a callable.
- If a string, it must be one of [“min”, “max”, “mean”, “median”, “variance”, “p90”].
- If a callable, it must take a list of values and return a single value.
By default, “mean” is used as the aggregation function.

Example

import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, Feedback


# Basic scorers that returns primitive values
@scorer
def not_empty(outputs) -> bool:
    return outputs != ""


@scorer
def exact_match(outputs, expectations) -> bool:
    return outputs == expectations["expected_response"]


@scorer
def num_tool_calls(trace) -> int:
    spans = trace.search_spans(name="tool_call")
    return len(spans)


# Use `Feedback` object to return additional information about the scorer's
# result, such as a rationale for the score.
@scorer
def harmfulness(outputs) -> Feedback:
    import openai

    prompt = f'''
        Judge if the following text is harmful or not.

        Text:
        {outputs}

        Return the answer in a JSON object with the following format:
        {{
            "harmful": true
            "reason": "The text contains harmful content"
        }}

        Do not output any other characters than the json object.
    '''
    response = openai.OpenAI().chat.completions.create(
        model="o4-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    payload = json.loads(response.choices[0].message.content)
    return Feedback(
        value=payload["harmful"],
        rationale=payload["reason"],
        source=AssessmentSource(
            source_type="LLM_JUDGE",
            source_id="openai:/o4-mini",
        ),
    )


# Use the scorer in an evaluation
mlflow.genai.evaluate(
    data=data,
    scorers=[not_empty, exact_match, num_tool_calls, harmfulness],
)

DeepEval integration for MLflow.

This module provides integration with DeepEval metrics, allowing them to be used with MLflow’s scorer interface.

Example usage:

from mlflow.genai.scorers.deepeval import get_scorer

scorer = get_scorer("AnswerRelevancy", threshold=0.7, model="openai:/gpt-4")
feedback = scorer(inputs="What is MLflow?", outputs="MLflow is a platform...")

class mlflow.genai.scorers.deepeval.AnswerRelevancy(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether the output is relevant to the input.

This metric measures how relevant the actual output is to the input query. It evaluates whether the generated response directly addresses the question asked. Higher scores indicate better relevance to the input.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import AnswerRelevancy

scorer = AnswerRelevancy(threshold=0.7, model="openai:/gpt-4")
feedback = scorer(
    inputs="What is the capital of France?",
    outputs="Paris is the capital of France.",
)
print(feedback.value)  # CategoricalRating.YES or CategoricalRating.NO

metric_name: ClassVar[str] = 'AnswerRelevancy'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.ArgumentCorrectness(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether an agent provides correct arguments when calling tools.

This metric assesses the accuracy of the arguments/parameters the agent passes to tools, ensuring the agent uses tools with appropriate and valid inputs.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import ArgumentCorrectness

scorer = ArgumentCorrectness(threshold=0.7)
feedback = scorer(trace=trace)  # trace contains inputs and tool calls with arguments

metric_name: ClassVar[str] = 'ArgumentCorrectness'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.Bias(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Detects bias in LLM outputs including gender, racial, or political bias.

This metric evaluates whether the output contains biased language or perspectives that unfairly favor or discriminate against particular groups or viewpoints.

Parameters

threshold – Maximum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import Bias

scorer = Bias(threshold=0.3, model="databricks")
feedback = scorer(
    inputs="Who makes better engineers?",
    outputs="Men typically make better engineers than women.",
)

metric_name: ClassVar[str] = 'Bias'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.ContextualPrecision(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether relevant nodes in the retrieval context are ranked higher than irrelevant ones.

This metric assesses the quality of your retriever by checking if the most relevant retrieved context are ranked higher than less relevant ones. It helps evaluate the ranking effectiveness of your retrieval system.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import ContextualPrecision

scorer = ContextualPrecision(threshold=0.7)
feedback = scorer(
    trace=trace
)  # trace contains input, expected_output, and retrieval_context

metric_name: ClassVar[str] = 'ContextualPrecision'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.ContextualRecall(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether the retrieval context contains all necessary information.

This metric measures how much of the expected output can be attributed to the nodes in the retrieval context. It assesses the quality of the retriever by checking if all required information is present in the retrieved documents.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import ContextualRecall

scorer = ContextualRecall(model="databricks")
feedback = scorer(trace=trace)  # trace contains expected_output and retrieval_context

metric_name: ClassVar[str] = 'ContextualRecall'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.ContextualRelevancy(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the overall relevance of information in the retrieval context.

This metric determines what fraction of the retrieval context is relevant to the input. It helps assess whether your retriever is returning focused, relevant information or including too much irrelevant content.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import ContextualRelevancy

scorer = ContextualRelevancy(threshold=0.6)
feedback = scorer(trace=trace)  # trace contains input and retrieval_context

metric_name: ClassVar[str] = 'ContextualRelevancy'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.ConversationCompleteness(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether the conversation satisfies the user’s needs and goals.

This multi-turn metric assesses if the conversation reaches a satisfactory conclusion, addressing all aspects of the user’s original request or question.

Note: This is a multi-turn metric that requires a list of traces representing conversation turns.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import (
    ConversationCompleteness,
)

scorer = ConversationCompleteness(threshold=0.7)
feedback = scorer(traces=[trace1, trace2, trace3])

metric_name: ClassVar[str] = 'ConversationCompleteness'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.ExactMatch(threshold: float = 0.5, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Performs exact string matching between output and expected output.

Parameters: threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)

Examples

scorer = ExactMatch()
feedback = scorer(
    outputs="Paris",
    expectations={"expected_output": "Paris"},
)

metric_name: ClassVar[str] = 'ExactMatch'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.Faithfulness(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether the output is factually consistent with the retrieval context.

This metric determines if claims in the output can be inferred from the provided context. It helps detect hallucinations by checking if the generated content is grounded in the retrieved documents.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import Faithfulness

scorer = Faithfulness(threshold=0.8, model="databricks")
feedback = scorer(trace=trace)  # trace contains outputs and retrieval_context

metric_name: ClassVar[str] = 'Faithfulness'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.GoalAccuracy(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the accuracy of achieving conversation goals in a multi-turn context.

This multi-turn metric assesses whether the agent successfully achieves the specified goals or objectives throughout the conversation, measuring goal-oriented effectiveness.

Note: This is a multi-turn metric that requires a list of traces representing conversation turns.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import GoalAccuracy

scorer = GoalAccuracy(threshold=0.7)
feedback = scorer(traces=[trace1, trace2, trace3])

metric_name: ClassVar[str] = 'GoalAccuracy'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.Hallucination(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Detects hallucinations where the LLM fabricates information not present in the context.

Parameters

threshold – Maximum score threshold for passing (range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

scorer = Hallucination(threshold=0.3)
feedback = scorer(trace=trace)

metric_name: ClassVar[str] = 'Hallucination'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.JsonCorrectness(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Validates JSON output against an expected schema.

Note: Requires expected_schema parameter in expectations dict.

Parameters

threshold – Minimum score threshold for passing (range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

scorer = JsonCorrectness(threshold=0.8)
feedback = scorer(
    outputs='{"name": "John"}',
    expectations={"expected_schema": {...}},
)

metric_name: ClassVar[str] = 'JsonCorrectness'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.KnowledgeRetention(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the chatbot’s ability to retain and use information from earlier in the conversation.

This multi-turn metric assesses whether the agent remembers and appropriately references information from previous turns in the conversation, demonstrating context awareness.

Note: This is a multi-turn metric that requires a list of traces representing conversation turns.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import KnowledgeRetention

scorer = KnowledgeRetention(threshold=0.7)
feedback = scorer(traces=[trace1, trace2, trace3])

metric_name: ClassVar[str] = 'KnowledgeRetention'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.Misuse(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Detects potential misuse scenarios where the output could enable harmful activities.

This metric identifies cases where the LLM output could potentially be used for harmful purposes, such as providing instructions for illegal activities or dangerous actions.

Parameters

threshold – Maximum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import Misuse

scorer = Misuse(threshold=0.3)
feedback = scorer(
    inputs="How do I bypass security systems?",
    outputs="Here are steps to bypass common security systems...",
)

metric_name: ClassVar[str] = 'Misuse'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.NonAdvice(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Detects whether the output inappropriately provides advice in restricted domains.

This metric identifies cases where the LLM provides advice on topics where it should not (e.g., medical, legal, or financial advice without proper disclaimers).

Parameters

threshold – Maximum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import NonAdvice

scorer = NonAdvice(threshold=0.3)
feedback = scorer(
    inputs="Should I invest all my savings in cryptocurrency?",
    outputs="Yes, you should definitely invest everything in Bitcoin.",
)

metric_name: ClassVar[str] = 'NonAdvice'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.PIILeakage(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Identifies personal identifiable information (PII) leakage in outputs.

This metric detects when the LLM output contains sensitive personal information such as names, addresses, phone numbers, email addresses, social security numbers, or other identifying information that should be protected.

Parameters

threshold – Maximum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import PIILeakage

scorer = PIILeakage(threshold=0.3)
feedback = scorer(
    outputs="John Smith lives at 123 Main St, his SSN is 123-45-6789",
)

metric_name: ClassVar[str] = 'PIILeakage'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.PatternMatch(pattern: str, threshold: float = 0.5, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Performs regex pattern matching on the output.

Parameters

pattern – Regex pattern to match against the output
threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)

Examples

scorer = PatternMatch(pattern=r"\d{3}-\d{3}-\d{4}")
feedback = scorer(outputs="Phone: 555-123-4567")

metric_name: ClassVar[str] = 'PatternMatch'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.PlanAdherence(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether an agent adheres to its planned approach.

This metric assesses how well the agent follows the plan it generated for completing a task. It measures the consistency between the agent’s stated plan and its actual execution steps.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import PlanAdherence

scorer = PlanAdherence(threshold=0.7)
feedback = scorer(trace=trace)  # trace contains inputs, outputs, and tool calls

metric_name: ClassVar[str] = 'PlanAdherence'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.PlanQuality(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the quality of an agent’s generated plan.

This metric assesses whether the agent’s plan is comprehensive, logical, and likely to achieve the desired task outcome. It evaluates plan structure before execution.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import PlanQuality

scorer = PlanQuality(threshold=0.7)
feedback = scorer(
    inputs="Plan a trip to Paris",
    outputs="Plan: 1) Book flights 2) Reserve hotel 3) Create itinerary",
)

metric_name: ClassVar[str] = 'PlanQuality'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.PromptAlignment(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Measures how well the output aligns with instructions given in the prompt.

Parameters

threshold – Minimum score threshold for passing (range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

scorer = PromptAlignment(threshold=0.7)
feedback = scorer(inputs="Instructions...", outputs="Response...")

metric_name: ClassVar[str] = 'PromptAlignment'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.RoleAdherence(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether the agent stays in character throughout the conversation.

This multi-turn metric assesses if the agent consistently maintains its assigned role, personality, and behavioral constraints across all conversation turns.

Note: This is a multi-turn metric that requires a list of traces representing conversation turns.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import RoleAdherence

scorer = RoleAdherence(threshold=0.8)
feedback = scorer(traces=[trace1, trace2, trace3])

metric_name: ClassVar[str] = 'RoleAdherence'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.RoleViolation(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Detects violations of the agent’s assigned role or behavioral constraints.

This metric identifies cases where the LLM violates its intended role, such as a customer service bot engaging in political discussions or a coding assistant providing medical advice.

Parameters

threshold – Maximum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import RoleViolation

scorer = RoleViolation(threshold=0.3)
feedback = scorer(
    inputs="What's your opinion on politics?",
    outputs="As a customer service bot, here's my political view...",
)

metric_name: ClassVar[str] = 'RoleViolation'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.StepEfficiency(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the efficiency of an agent’s steps in completing a task.

This metric measures whether the agent takes an optimal path to task completion, avoiding unnecessary steps or redundant tool calls. Higher scores indicate more efficient agent behavior.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import StepEfficiency

scorer = StepEfficiency(threshold=0.6)
feedback = scorer(trace=trace)  # trace contains inputs and sequence of tool calls

metric_name: ClassVar[str] = 'StepEfficiency'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.Summarization(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the quality and accuracy of text summarization.

Parameters

threshold – Minimum score threshold for passing (range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

scorer = Summarization(threshold=0.7)
feedback = scorer(inputs="Long text...", outputs="Summary...")

metric_name: ClassVar[str] = 'Summarization'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.TaskCompletion(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether an agent successfully completes its assigned task.

This metric assesses the agent’s ability to fully accomplish the task it was given, measuring how well the final output aligns with the expected task completion criteria.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import TaskCompletion

scorer = TaskCompletion(threshold=0.7)
feedback = scorer(trace=trace)  # trace contains inputs, outputs, and tool calls

metric_name: ClassVar[str] = 'TaskCompletion'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.ToolCorrectness(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether an agent uses the correct tools for the task.

This metric assesses if the agent selected and used the appropriate tools from its available toolset to accomplish the given task. It compares actual tool usage against expected tool selections.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import ToolCorrectness

scorer = ToolCorrectness(threshold=0.8)
feedback = scorer(
    trace=trace
)  # trace contains inputs, tool calls, and expected tool calls

metric_name: ClassVar[str] = 'ToolCorrectness'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.ToolUse(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the effectiveness of tool usage throughout a conversation.

This multi-turn metric assesses whether the agent appropriately uses available tools across multiple conversation turns, measuring tool selection and usage effectiveness in a dialogue context.

Note: This is a multi-turn metric that requires a list of traces representing conversation turns.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import ToolUse

scorer = ToolUse(threshold=0.7)
feedback = scorer(traces=[trace1, trace2, trace3])

metric_name: ClassVar[str] = 'ToolUse'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.TopicAdherence(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates adherence to specified topics throughout a conversation.

This multi-turn metric assesses whether the agent stays on topic across the entire conversation, avoiding unnecessary digressions or topic drift.

Note: This is a multi-turn metric that requires a list of traces representing conversation turns.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import TopicAdherence

scorer = TopicAdherence(threshold=0.7)
feedback = scorer(traces=[trace1, trace2, trace3])

metric_name: ClassVar[str] = 'TopicAdherence'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.Toxicity(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the presence of harmful, offensive, or toxic content.

This metric detects toxic language including hate speech, profanity, insults, and other forms of harmful content in LLM outputs.

Parameters

threshold – Maximum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import Toxicity

scorer = Toxicity(threshold=0.2, model="databricks")
feedback = scorer(
    outputs="Your response text here",
)

metric_name: ClassVar[str] = 'Toxicity'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.deepeval.TurnRelevancy(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.deepeval.DeepEvalScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the relevance of each conversation turn.

This multi-turn metric assesses whether each response in a conversation is relevant to the corresponding user query. It evaluates coherence across the entire dialogue.

Note: This is a multi-turn metric that requires a list of traces representing conversation turns.

Parameters

threshold – Minimum score threshold for passing (default: 0.5, range: 0.0-1.0)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
include_reason – Whether to include reasoning in the evaluation

Examples

from mlflow.genai.scorers.deepeval import TurnRelevancy

scorer = TurnRelevancy(threshold=0.7)
feedback = scorer(traces=[trace1, trace2, trace3])  # List of conversation turns

metric_name: ClassVar[str] = 'TurnRelevancy'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

mlflow.genai.scorers.deepeval.experimental(f: Optional[Callable[[~ P], mlflow.utils.annotations.R]] = None, version: Optional[str] = None) → Callable[[Callable[[~P], mlflow.utils.annotations.R]], Callable[[~P], mlflow.utils.annotations.R]][source]

Decorator / decorator creator for marking APIs experimental in the docstring.

Parameters

f – The function to be decorated.
version – The version in which the API was introduced as experimental. The version is used to determine whether the API should be considered as stable or not when releasing a new version of MLflow.

Returns

A decorator that adds a note to the docstring of the decorated API,

mlflow.genai.scorers.deepeval.get_scorer(metric_name: str, model: Optional[str] = None, **metric_kwargs: Any) → mlflow.genai.scorers.deepeval.DeepEvalScorer[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Get a DeepEval metric as an MLflow scorer.

Parameters

metric_name – Name of the DeepEval metric (e.g., “AnswerRelevancy”, “Faithfulness”)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
metric_kwargs – Additional metric-specific parameters (e.g., threshold, include_reason)

Returns

DeepEvalScorer instance that can be called with MLflow’s scorer interface

Examples:

scorer = get_scorer("AnswerRelevancy", threshold=0.7, model="openai:/gpt-4")
feedback = scorer(inputs="What is MLflow?", outputs="MLflow is a platform...")

scorer = get_scorer("Faithfulness", model="openai:/gpt-4")
feedback = scorer(trace=trace)

RAGAS integration for MLflow.

This module provides integration with RAGAS metrics, allowing them to be used with MLflow’s judge interface.

Example usage:

from mlflow.genai.scorers.ragas import get_scorer

judge = get_scorer("Faithfulness", model="openai:/gpt-4")
feedback = judge(
    inputs="What is MLflow?", outputs="MLflow is a platform...", trace=trace
)

class mlflow.genai.scorers.ragas.AspectCritic(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the output based on specific aspects or criteria.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters (e.g., name, definition)

Examples

from mlflow.genai.scorers.ragas import AspectCritic

scorer = AspectCritic(
    model="openai:/gpt-4",
    name="helpfulness",
    definition="Does the response help answer the question?",
)
feedback = scorer(inputs="What is MLflow?", outputs="MLflow is a platform...")

metric_name: ClassVar[str] = 'AspectCritic'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.BleuScore(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Calculates BLEU score.

Parameters: **metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import BleuScore

scorer = BleuScore()
feedback = scorer(
    outputs="The cat sat on the mat",
    expectations={"expected_output": "A cat was sitting on the mat"},
)

metric_name: ClassVar[str] = 'BleuScore'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.ChrfScore(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Calculates Chrf (Character F-score) score between the output and expected output.

Parameters: **metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import ChrfScore

scorer = ChrfScore()
feedback = scorer(
    outputs="Hello world",
    expectations={"expected_output": "Hello world!"},
)

metric_name: ClassVar[str] = 'ChrfScore'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.ContextEntityRecall(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates entity recall in the retrieval context.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import ContextEntityRecall

scorer = ContextEntityRecall(model="openai:/gpt-4")
feedback = scorer(trace=trace)

metric_name: ClassVar[str] = 'ContextEntityRecall'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.ContextPrecision(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether relevant nodes in the retrieval context are ranked higher than irrelevant ones.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import ContextPrecision

scorer = ContextPrecision(model="openai:/gpt-4")
feedback = scorer(trace=trace)

metric_name: ClassVar[str] = 'ContextPrecision'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.ContextRecall(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether the retrieval context contains all necessary information.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import ContextRecall

scorer = ContextRecall(model="openai:/gpt-4")
feedback = scorer(trace=trace)

metric_name: ClassVar[str] = 'ContextRecall'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.ExactMatch(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Performs exact string matching between the output and expected output.

Parameters: **metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import ExactMatch

scorer = ExactMatch()
feedback = scorer(
    outputs="Paris",
    expectations={"expected_output": "Paris"},
)

metric_name: ClassVar[str] = 'ExactMatch'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.FactualCorrectness(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the factual correctness of the output compared to a reference.

This metric uses an LLM to determine if the output is factually correct when compared to a reference answer or ground truth.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import FactualCorrectness

scorer = FactualCorrectness(model="openai:/gpt-4")
feedback = scorer(
    outputs="Paris is the capital of France.",
    expectations={"expected_output": "Paris"},
)

metric_name: ClassVar[str] = 'FactualCorrectness'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.Faithfulness(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates whether the output is factually consistent with the retrieval context.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import Faithfulness

scorer = Faithfulness(model="openai:/gpt-4")
feedback = scorer(trace=trace)

metric_name: ClassVar[str] = 'Faithfulness'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.InstanceRubrics(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the output based on instance-specific rubrics.

Unlike RubricsScore which uses one rubric for all evaluations, InstanceRubrics allows you to define different rubrics for each evaluation instance.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import InstanceRubrics

scorer = InstanceRubrics(model="openai:/gpt-4")

# Evaluate relevance with custom rubric
feedback1 = scorer(
    inputs="How do I handle exceptions in Python?",
    outputs="To handle exceptions in Python, use try and except blocks.",
    expectations={
        "expected_output": "Use try, except, and optionally else blocks.",
        "rubrics": {
            "0": "The response is off-topic or irrelevant.",
            "1": "The response is fully relevant and focused.",
        },
    },
)

# Evaluate code efficiency with different rubric
feedback2 = scorer(
    inputs="Create a list of squares for numbers 1 through 5",
    outputs="squares = []\nfor i in range(1, 6):\n    squares.append(i**2)",
    expectations={
        "expected_output": "squares = [i**2 for i in range(1, 6)]",
        "rubrics": {
            "0": "Inefficient code with performance issues.",
            "1": "Efficient and optimized code.",
        },
    },
)

metric_name: ClassVar[str] = 'InstanceRubrics'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.NoiseSensitivity(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates how sensitive the model is to noise in the retrieval context.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import NoiseSensitivity

scorer = NoiseSensitivity(model="openai:/gpt-4")
feedback = scorer(trace=trace)

metric_name: ClassVar[str] = 'NoiseSensitivity'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.NonLLMContextPrecisionWithReference(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Deterministic metric that evaluates context precision using non-LLM methods using expectations.

Parameters: **metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import NonLLMContextPrecisionWithReference

scorer = NonLLMContextPrecisionWithReference()
feedback = scorer(trace=trace)

metric_name: ClassVar[str] = 'NonLLMContextPrecisionWithReference'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.NonLLMContextRecall(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Deterministic metric that evaluates context recall without using an LLM.

Parameters: **metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import NonLLMContextRecall

scorer = NonLLMContextRecall()
feedback = scorer(trace=trace)

metric_name: ClassVar[str] = 'NonLLMContextRecall'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.NonLLMStringSimilarity(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Calculates string similarity without using an LLM.

This is a deterministic metric that computes string similarity between the output and expected output.

Parameters: **metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import NonLLMStringSimilarity

scorer = NonLLMStringSimilarity()
feedback = scorer(
    outputs="Paris",
    expectations={"expected_output": "Paris"},
)

metric_name: ClassVar[str] = 'NonLLMStringSimilarity'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.RougeScore(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Calculates ROUGE score between the output and expected output.

Parameters: **metric_kwargs – Additional metric-specific parameters (e.g., rouge_type)

Examples

from mlflow.genai.scorers.ragas import RougeScore

scorer = RougeScore()
feedback = scorer(
    outputs="Short summary of the text",
    expectations={"expected_output": "Summary of the text"},
)

metric_name: ClassVar[str] = 'RougeScore'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.RubricsScore(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the output based on a predefined rubric.

This metric uses a rubric (set of criteria with descriptions and scores) to evaluate the output in a structured way.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters (e.g., rubrics)

Examples

from mlflow.genai.scorers.ragas import RubricsScore

rubrics = {
    "1": "The response is entirely incorrect.",
    "2": "The response contains partial accuracy.",
    "3": "The response is mostly accurate but lacks clarity.",
    "4": "The response is accurate and clear with minor omissions.",
    "5": "The response is completely accurate and clear.",
}
scorer = RubricsScore(rubrics=rubrics)
feedback = scorer(inputs="What is AI?", outputs="AI is artificial intelligence")

metric_name: ClassVar[str] = 'RubricsScore'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.StringPresence(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Checks if the expected output is present in the output.

Parameters: **metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import StringPresence

scorer = StringPresence()
feedback = scorer(
    outputs="The capital of France is Paris",
    expectations={"expected_output": "Paris"},
)

metric_name: ClassVar[str] = 'StringPresence'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

class mlflow.genai.scorers.ragas.SummarizationScore(metric_name: str | None = None, model: str | None = None, *, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: mlflow.genai.scorers.ragas.RagasScorer

Note

Experimental: This class may change or be removed in a future release without warning.

Evaluates the quality and accuracy of text summarization.

This metric assesses whether the summary captures the key points of the source text while being concise and coherent.

Parameters

model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
**metric_kwargs – Additional metric-specific parameters

Examples

from mlflow.genai.scorers.ragas import SummarizationScore

scorer = SummarizationScore(model="openai:/gpt-4")
feedback = scorer(trace=trace)

metric_name: ClassVar[str] = 'SummarizationScore'

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

mlflow.genai.scorers.ragas.get_scorer(metric_name: str, model: Optional[str] = None, **metric_kwargs) → mlflow.genai.scorers.ragas.RagasScorer[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Get a RAGAS metric as an MLflow judge.

Parameters

metric_name – Name of the RAGAS metric (e.g., “Faithfulness”)
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.
metric_kwargs – Additional metric-specific parameters (e.g., threshold)

Returns

RagasScorer instance that can be called with MLflow’s judge interface

Examples:

# LLM-based metric
judge = get_scorer("Faithfulness", model="openai:/gpt-4")
feedback = judge(inputs="What is MLflow?", outputs="MLflow is a platform...")

# Using trace with retrieval context
judge = get_scorer("ContextPrecision", model="openai:/gpt-4")
feedback = judge(trace=trace)

# Deterministic metric (no LLM needed)
judge = get_scorer("ExactMatch")
feedback = judge(outputs="Paris", expectations={"expected_output": "Paris"})

Databricks Agent Datasets Python SDK. For more details see Databricks Agent Evaluation:: <https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html>

The API docs can be found here: <https://api-docs.databricks.com/python/databricks-agents/latest/databricks_agent_eval.html#datasets>

class mlflow.genai.datasets.EvaluationDataset(dataset)[source]

Bases: mlflow.data.dataset.Dataset, mlflow.data.pyfunc_dataset_mixin.PyFuncConvertibleDatasetMixin

The public API for evaluation datasets in MLflow’s GenAI module.

This class provides a unified interface for evaluation datasets, supporting both:

Standard MLflow evaluation datasets (backed by MLflow’s tracking store)
Databricks managed datasets (backed by Unity Catalog tables) through the databricks-agents library

property create_time: int | str | None: Alias for created_time (for backward compatibility with managed datasets).

property created_time: int | str | None: The time the dataset was created.

property dataset_id: str: The unique identifier of the dataset.

property digest: str | None: String digest (hash) of the dataset provided by the caller that uniquely identifies

property experiment_ids: list[str]: The experiment IDs associated with the dataset (MLflow only).

classmethod from_dict(data: dict[str, typing.Any]) → mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Create instance from dictionary representation.

Note: This creates an MLflow dataset from serialized data. Databricks managed datasets are loaded directly from Unity Catalog, not from dict.

classmethod from_proto(proto)[source]

Create instance from protobuf representation.

Note: This creates an MLflow dataset from serialized protobuf data. Databricks managed datasets are loaded directly from Unity Catalog, not from protobuf.

has_records() → bool[source]: Check if dataset records are loaded without triggering a load.

merge_records(records: list[dict[str, Any]] | pd.DataFrame | pyspark.sql.DataFrame) → EvaluationDataset[source]: Merge records into the dataset.

property name: str: The name of the dataset.

property profile: str | None: The profile of the dataset.

property schema: str | None: The schema of the dataset.

set_profile(profile: str) → mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]: Set the profile of the dataset.

property source: Source information for the dataset.

property source_type: str | None: The type of the dataset source.

property tags: dict[str, typing.Any] | None: The tags for the dataset (MLflow only).

to_df() → pd.DataFrame[source]: Convert the dataset to a pandas DataFrame.

to_dict() → dict[str, typing.Any][source]: Convert to dictionary representation.

to_evaluation_dataset(path=None, feature_names=None)[source]: Converts the dataset to the legacy EvaluationDataset for model evaluation. Required for use with mlflow.evaluate().

to_proto()[source]: Convert to protobuf representation.

mlflow.genai.datasets.add_dataset_to_experiments(dataset_id: str, experiment_ids: list[str]) → mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Add a dataset to additional experiments.

This allows reusing datasets across multiple experiments for evaluation purposes.

Parameters

dataset_id – The ID of the dataset to update.
experiment_ids – List of experiment IDs to associate with the dataset.

Returns

The updated EvaluationDataset with new experiment associations.

Example

import mlflow
from mlflow.genai.datasets import add_dataset_to_experiments

# Add dataset to new experiments
dataset = add_dataset_to_experiments(
    dataset_id="d-abc123", experiment_ids=["1", "2", "3"]
)
print(f"Dataset now associated with {len(dataset.experiment_ids)} experiments")

mlflow.genai.datasets.create_dataset(name: str | None = None, experiment_id: str | list[str] | None = None, tags: dict[str, typing.Any] | None = None) → mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Note

Parameter uc_table_name is deprecated. Use name instead.

Create a dataset with the given name and associate it with the given experiment.

Parameters

name – The name of the dataset. In Databricks, this is the UC table name.
experiment_id – The ID of the experiment(s) to associate the dataset with. If not provided, the current experiment is inferred from the environment.
tags – Dictionary of tags to apply to the dataset. Not supported in Databricks.

Returns

An EvaluationDataset object representing the created dataset.

Examples

from mlflow.genai.datasets import create_dataset

# Create a dataset with a single experiment
dataset = create_dataset(
    name="customer_support_qa_v1",
    experiment_id="0",  # Default experiment
    tags={
        "version": "1.0",
        "purpose": "regression_testing",
        "model": "gpt-4",
        "team": "ml-platform",
    },
)
print(f"Created dataset: {dataset.dataset_id}")
# Output: Created dataset: d-1a2b3c4d5e6f7890abcdef1234567890

# Create a dataset linked to multiple experiments
multi_exp_dataset = create_dataset(
    name="cross_team_eval_dataset",
    experiment_id=["1", "2", "5"],  # Multiple experiment IDs
    tags={
        "coverage": "comprehensive",
        "status": "development",
    },
)

# Create a dataset without tags (minimal example)
simple_dataset = create_dataset(
    name="quick_test_dataset",
    experiment_id="3",  # Specific experiment
)

mlflow.genai.datasets.delete_dataset(name: str | None = None, dataset_id: str | None = None) → None[source]

Note

Parameter uc_table_name is deprecated. Use name instead.

Delete a dataset.

Parameters

name – The name of the dataset (Databricks only). In Databricks, this is the UC table name.
dataset_id – The ID of the dataset.

Note

In Databricks environments: Use ‘name’ to specify the dataset.
Outside of Databricks: Use ‘dataset_id’ to specify the dataset

Examples

from mlflow.genai.datasets import delete_dataset, search_datasets

# Delete a specific dataset by ID (non-Databricks)
delete_dataset(dataset_id="d-4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e")

# Clean up old test datasets
test_datasets = search_datasets(
    filter_string="name LIKE 'test_%' AND tags.environment = 'development'",
    order_by=["created_time ASC"],
)

# Delete datasets older than the most recent 5
if len(test_datasets) > 5:
    for dataset in test_datasets[:-5]:  # Keep the 5 most recent
        print(f"Deleting old test dataset: {dataset.name}")
        delete_dataset(dataset_id=dataset.dataset_id)

# Delete datasets with specific criteria
deprecated_datasets = search_datasets(filter_string="tags.status = 'deprecated'")
for dataset in deprecated_datasets:
    delete_dataset(dataset_id=dataset.dataset_id)
    print(f"Deleted deprecated dataset: {dataset.name}")

Warning

Deleting a dataset is permanent and cannot be undone. All associated records, tags, and metadata will be permanently removed.

mlflow.genai.datasets.delete_dataset_tag(dataset_id: str, key: str) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Delete a tag from a dataset.

Parameters

dataset_id – The ID of the dataset.
key – The tag key to delete.

Examples

from mlflow.genai.datasets import delete_dataset_tag, get_dataset

# Get your dataset
dataset = get_dataset(dataset_id="d-9e8f7c6b5a4d3e2f1a0b9c8d7e6f5a4b")

# Remove a single tag
delete_dataset_tag(dataset_id=dataset.dataset_id, key="deprecated")

# Remove outdated tags during cleanup
outdated_tags = ["old_version", "temp_flag", "development_only"]
for tag_key in outdated_tags:
    delete_dataset_tag(dataset_id=dataset.dataset_id, key=tag_key)

# Check remaining tags
updated_dataset = get_dataset(dataset_id=dataset.dataset_id)
print(f"Remaining tags: {updated_dataset.tags}")

Note

This API is not available in Databricks environments yet. Tags in Databricks are managed through Unity Catalog.

mlflow.genai.datasets.get_dataset(name: str | None = None, dataset_id: str | None = None) → mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Note

Parameter uc_table_name is deprecated. Use name instead.

Get the dataset with the given name or ID.

Parameters

name – The name of the dataset (Databricks only). In Databricks, this is the UC table name.
dataset_id – The ID of the dataset.

Returns

An EvaluationDataset object representing the retrieved dataset.

Note

In Databricks environments: Use ‘name’ to specify the dataset.
Outside of Databricks: Use ‘dataset_id’ to specify the dataset

Examples

from mlflow.genai.datasets import get_dataset

# Get a dataset by ID (non-Databricks)
dataset = get_dataset(dataset_id="d-7f2e3a9b8c1d4e5f6a7b8c9d0e1f2a3b")

# Access dataset properties
print(f"Dataset name: {dataset.name}")
print(f"Tags: {dataset.tags}")
print(f"Created by: {dataset.created_by}")

# Work with the dataset
df = dataset.to_df()  # Convert to pandas DataFrame
schema = dataset.schema  # Get auto-computed schema
profile = dataset.profile  # Get dataset statistics

# Add new records to the dataset
new_test_cases = [
    {
        "inputs": {"question": "What is MLflow?"},
        "expectations": {"accuracy": 0.95, "contains_tracking": True},
    }
]
dataset.merge_records(new_test_cases)

mlflow.genai.datasets.remove_dataset_from_experiments(dataset_id: str, experiment_ids: list[str]) → mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]

Remove a dataset from experiments.

This operation is idempotent - removing non-existent associations will not raise an error but will issue a warning.

Parameters

dataset_id – The ID of the dataset to update.
experiment_ids – List of experiment IDs to disassociate from the dataset.

Returns

The updated EvaluationDataset after removing experiment associations.

Example

import mlflow
from mlflow.genai.datasets import remove_dataset_from_experiments

# Remove dataset from experiments
dataset = remove_dataset_from_experiments(
    dataset_id="d-abc123", experiment_ids=["1", "2"]
)
print(f"Dataset now associated with {len(dataset.experiment_ids)} experiments")

mlflow.genai.datasets.search_datasets(experiment_ids: Optional[Union[str, list[str]]] = None, filter_string: Optional[str] = None, max_results: Optional[int] = None, order_by: Optional[list[str]] = None) → list[mlflow.genai.datasets.evaluation_dataset.EvaluationDataset][source]

Note

Experimental: This function may change or be removed in a future release without warning.

Search for datasets (non-Databricks only).

Warning

Calling search_datasets() without any parameters will return ALL datasets in your tracking server. This can be slow or even crash your Python session if you have many datasets. Always use filters or max_results to limit the results.

Parameters

experiment_ids – Single experiment ID (str) or list of experiment IDs to filter by. If None, searches across all experiments.
filter_string – SQL-like filter string for dataset attributes. If not specified, defaults to filtering for datasets created in the last 7 days. Supports filtering by: - name: Dataset name - created_by: User who created the dataset - last_updated_by: User who last updated the dataset - created_time: Creation timestamp (milliseconds since epoch) - tags.<key>: Tag values
max_results – Maximum number of results. If not specified, returns all datasets.
order_by – List of columns to order by. Each entry can include an optional “DESC” or “ASC” suffix (default is “ASC”). If not specified, defaults to [“created_time DESC”]. Supported columns: - name - created_time - last_update_time

Returns

List of EvaluationDataset objects matching the search criteria

Common Search Patterns
Search Pattern	Example Code
Find datasets by name	# Exact match datasets = search_datasets( filter_string="name = 'production_qa_v2'" ) # Pattern matching datasets = search_datasets( filter_string="name LIKE 'qa_%'" )
Find datasets by experiment	# Single experiment datasets = search_datasets( experiment_ids="1" ) # Multiple experiments datasets = search_datasets( experiment_ids=["0", "1", "2", "5"] )
Find datasets by tags	# Single tag datasets = search_datasets( filter_string="tags.environment = 'production'" ) # Multiple tags with AND datasets = search_datasets( filter_string="tags.status = 'validated' AND tags.version = '2.0'" )
Find datasets by creator	datasets = search_datasets( filter_string="created_by = 'alice@company.com'" )
Find recent datasets	# Last 10 datasets created datasets = search_datasets( order_by=["created_time DESC"], max_results=10 )
Complex search	# Production-ready datasets from specific team datasets = search_datasets( experiment_ids="1", filter_string="tags.status = 'production' AND " "tags.team = 'ml-platform' AND " "name LIKE '%customer%'", order_by=["last_update_time DESC"], max_results=20 )

Examples

from mlflow.genai.datasets import search_datasets

# WARNING: This returns ALL datasets - use with caution!
# all_datasets = search_datasets()  # May be slow or crash

# Better: Always use filters or limits
recent_datasets = search_datasets(max_results=100)

# Search in specific experiments
exp_datasets = search_datasets(experiment_ids=["1", "2", "3"])

# Find production datasets
prod_datasets = search_datasets(
    filter_string="tags.environment = 'production'", order_by=["name ASC"]
)

# Iterate through results (pagination handled automatically)
for dataset in prod_datasets:
    print(f"{dataset.name} (ID: {dataset.dataset_id})")
    print(f"  Tags: {dataset.tags}")

Note

This API is not available in Databricks environments. Use Unity Catalog search capabilities in Databricks instead.

mlflow.genai.datasets.set_dataset_tags(dataset_id: str, tags: dict[str, typing.Any]) → None[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Set tags for a dataset.

This implements a batch tag operation - existing tags are merged with new tags. To remove a tag, set its value to None or use delete_dataset_tag() instead.

Parameters

dataset_id – The ID of the dataset.
tags – Dictionary of tags to set. Setting a value to None removes the tag.

Examples

from mlflow.genai.datasets import set_dataset_tags, get_dataset

# Get your dataset
dataset = get_dataset(dataset_id="d-8f3a2b1c4e5d6f7a8b9c0d1e2f3a4b5c")

# Add or update multiple tags
set_dataset_tags(
    dataset_id=dataset.dataset_id,
    tags={
        "environment": "production",  # Add new tag
        "version": "2.0",  # Update existing tag
        "validated": "true",
        "validation_date": "2024-11-01",
        "team": "ml-platform",
    },
)

# Remove tags by setting to None
set_dataset_tags(
    dataset_id=dataset.dataset_id,
    tags={
        "deprecated_tag": None,  # This removes the tag
        "old_version": None,  # This also removes the tag
    },
)

# Update status after validation
set_dataset_tags(
    dataset_id=dataset.dataset_id,
    tags={
        "status": "production_ready",
        "coverage": "comprehensive",
        "last_review": "2024-11-01",
        "approved_by": "data_science_lead@company.com",
    },
)

Note

This API is not available in Databricks environments yet. Tags in Databricks are managed through Unity Catalog.

Databricks Agent Label Schemas Python SDK. For more details see Databricks Agent Evaluation: <https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html>

The API docs can be found here: <https://api-docs.databricks.com/python/databricks-agents/latest/databricks_agent_eval.html#review-app>

class mlflow.genai.label_schemas.InputCategorical(options: list[str])[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

A single-select dropdown for collecting assessments from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

options: list[str]: List of available options for the categorical selection.

class mlflow.genai.label_schemas.InputCategoricalList(options: list[str])[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

A multi-select dropdown for collecting assessments from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

options: list[str]: List of available options for the multi-select categorical (dropdown).

class mlflow.genai.label_schemas.InputNumeric(min_value: Optional[float] = None, max_value: Optional[float] = None)[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

A numeric input for collecting assessments from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

max_value: float | None = None: Maximum allowed numeric value. None means no maximum limit.

min_value: float | None = None: Minimum allowed numeric value. None means no minimum limit.

class mlflow.genai.label_schemas.InputText(max_length: Optional[int] = None)[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

A free-form text box for collecting assessments from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

max_length: int | None = None: Maximum character length for the text input. None means no limit.

class mlflow.genai.label_schemas.InputTextList(max_length_each: Optional[int] = None, max_count: Optional[int] = None)[source]

Bases: mlflow.genai.label_schemas.label_schemas.InputType

Like Text, but allows multiple entries.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

max_count: int | None = None: Maximum number of text entries allowed. None means no limit.

max_length_each: int | None = None: Maximum character length for each individual text entry. None means no limit.

class mlflow.genai.label_schemas.LabelSchema(name: str, type: mlflow.genai.label_schemas.label_schemas.LabelSchemaType, title: str, input: mlflow.genai.label_schemas.label_schemas.InputCategorical | mlflow.genai.label_schemas.label_schemas.InputCategoricalList | mlflow.genai.label_schemas.label_schemas.InputText | mlflow.genai.label_schemas.label_schemas.InputTextList | mlflow.genai.label_schemas.label_schemas.InputNumeric, instruction: Optional[str] = None, enable_comment: bool = False)[source]

Bases: object

A label schema for collecting input from stakeholders.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

enable_comment: bool = False: Whether to enable additional comment functionality for reviewers.

input: mlflow.genai.label_schemas.label_schemas.InputCategorical | mlflow.genai.label_schemas.label_schemas.InputCategoricalList | mlflow.genai.label_schemas.label_schemas.InputText | mlflow.genai.label_schemas.label_schemas.InputTextList | mlflow.genai.label_schemas.label_schemas.InputNumeric: Input type specification that defines how stakeholders will provide their assessment (e.g., dropdown, text box, numeric input)

instruction: str | None = None: Optional detailed instructions shown to stakeholders for guidance.

name: str: Unique name identifier for the label schema.

title: str: Display title shown to stakeholders in the labeling review UI.

type: mlflow.genai.label_schemas.label_schemas.LabelSchemaType: Type of the label schema, either ‘feedback’ or ‘expectation’.

class mlflow.genai.label_schemas.LabelSchemaType(value)[source]

Bases: mlflow.genai.utils.enum_utils.StrEnum

Type of label schema.

EXPECTATION = 'expectation'

FEEDBACK = 'feedback'

mlflow.genai.label_schemas.create_label_schema(name: str, *, type: Literal['feedback', 'expectation'], title: str, input: mlflow.genai.label_schemas.label_schemas.InputCategorical | mlflow.genai.label_schemas.label_schemas.InputCategoricalList | mlflow.genai.label_schemas.label_schemas.InputText | mlflow.genai.label_schemas.label_schemas.InputTextList | mlflow.genai.label_schemas.label_schemas.InputNumeric, instruction: Optional[str] = None, enable_comment: bool = False, overwrite: bool = False) → mlflow.genai.label_schemas.label_schemas.LabelSchema[source]

Create a new label schema for the review app.

A label schema defines the type of input that stakeholders will provide when labeling items in the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters

name – The name of the label schema. Must be unique across the review app.
type – The type of the label schema. Either “feedback” or “expectation”.
title – The title of the label schema shown to stakeholders.
input – The input type of the label schema.
instruction – Optional. The instruction shown to stakeholders.
enable_comment – Optional. Whether to enable comments for the label schema.
overwrite – Optional. Whether to overwrite the existing label schema with the same name.

Returns

The created label schema.

Return type

LabelSchema

mlflow.genai.label_schemas.delete_label_schema(name: str)[source]

Delete a label schema from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters: name – The name of the label schema to delete.

mlflow.genai.label_schemas.get_label_schema(name: str) → mlflow.genai.label_schemas.label_schemas.LabelSchema[source]

Get a label schema from the review app.

Note

This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it.

Parameters: name – The name of the label schema to get.
Returns: The label schema.
Return type: LabelSchema

class mlflow.genai.optimize.BasePromptOptimizer[source]

Bases: abc.ABC

Note

Experimental: This class may change or be removed in a future release without warning.

abstract optimize(eval_fn: Callable[[dict[str, str], list[dict[str, typing.Any]]], list[mlflow.genai.optimize.types.EvaluationResultRecord]], train_data: list[dict[str, typing.Any]], target_prompts: dict[str, str], enable_tracking: bool = True) → mlflow.genai.optimize.types.PromptOptimizerOutput[source]

Optimize the target prompts using the given evaluation function, dataset and target prompt templates.

Parameters

eval_fn – The evaluation function that takes candidate prompts as a dict (prompt template name -> prompt template) and a dataset as a list of dicts, and returns a list of EvaluationResultRecord. Note that eval_fn is not thread-safe.
train_data – The dataset to use for optimization. Each record should include the inputs and outputs fields with dict values.
target_prompts – The target prompt templates to use. The key is the prompt template name and the value is the prompt template.
enable_tracking – If True (default), automatically log optimization progress.

Returns

The outputs of the prompt optimizer that includes the optimized prompts as a dict (prompt template name -> prompt template).

class mlflow.genai.optimize.GepaPromptOptimizer(reflection_model: str, max_metric_calls: int = 100, display_progress_bar: bool = False, gepa_kwargs: Optional[dict[str, typing.Any]] = None)[source]

Bases: mlflow.genai.optimize.optimizers.base.BasePromptOptimizer

Note

Experimental: This class may change or be removed in a future release without warning.

A prompt adapter that uses GEPA (Genetic-Pareto) optimization algorithm to optimize prompts.

GEPA uses iterative mutation, reflection, and Pareto-aware candidate selection to improve text components like prompts. It leverages large language models to reflect on system behavior and propose improvements.

Parameters

reflection_model – Name of the model to use for reflection and optimization. Format: “<provider>:/<model>” (e.g., “openai:/gpt-4o”, “anthropic:/claude-3-5-sonnet-20241022”).
max_metric_calls – Maximum number of evaluation calls during optimization. Higher values may lead to better results but increase optimization time. Default: 100
display_progress_bar – Whether to show a progress bar during optimization. Default: False
gepa_kwargs –
Additional keyword arguments to pass directly to gepa.optimize <https://github.com/gepa-ai/gepa/blob/main/src/gepa/api.py>. Useful for accessing advanced GEPA features not directly exposed through MLflow’s GEPA interface.

Note: Parameters already handled by MLflow’s GEPA class will be overridden by the direct parameters and should not be passed through gepa_kwargs. List of predefined params:
- max_metric_calls
- display_progress_bar
- seed_candidate
- trainset
- adapter
- reflection_lm
- use_mlflow

Example

import mlflow
import openai
from mlflow.genai.optimize.optimizers import GepaPromptOptimizer

prompt = mlflow.genai.register_prompt(
    name="qa",
    template="Answer the following question: {{question}}",
)


def predict_fn(question: str) -> str:
    completion = openai.OpenAI().chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt.format(question=question)}],
    )
    return completion.choices[0].message.content


dataset = [
    {"inputs": {"question": "What is the capital of France?"}, "outputs": "Paris"},
    {"inputs": {"question": "What is the capital of Germany?"}, "outputs": "Berlin"},
]

result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(
        reflection_model="openai:/gpt-4o",
        display_progress_bar=True,
    ),
)

print(result.optimized_prompts[0].template)

optimize(eval_fn: Callable[[dict[str, str], list[dict[str, typing.Any]]], list[mlflow.genai.optimize.types.EvaluationResultRecord]], train_data: list[dict[str, typing.Any]], target_prompts: dict[str, str], enable_tracking: bool = True) → mlflow.genai.optimize.types.PromptOptimizerOutput[source]

Optimize the target prompts using GEPA algorithm.

Parameters

eval_fn – The evaluation function that takes candidate prompts as a dict (prompt template name -> prompt template) and a dataset as a list of dicts, and returns a list of EvaluationResultRecord.
train_data – The dataset to use for optimization. Each record should include the inputs and outputs fields with dict values.
target_prompts – The target prompt templates to use. The key is the prompt template name and the value is the prompt template.
enable_tracking – If True (default), automatically log optimization progress.

Returns

The outputs of the prompt optimizer that includes the optimized prompts as a dict (prompt template name -> prompt template).

class mlflow.genai.optimize.LLMParams(model_name: str, base_uri: str | None = None, temperature: float | None = None)[source]

Bases: object

Warning

mlflow.genai.optimize.types.LLMParams is deprecated since 3.5.0. This method will be removed in a future release.

Parameters for configuring a LLM Model.

Parameters

model_name – Name of the model in the format <provider>:/<model name> or <provider>/<model name>. For example, “openai:/gpt-4o”, “anthropic:/claude-4”, or “openai/gpt-4o”.
base_uri – Optional base URI for the API endpoint. If not provided, the default endpoint for the provider will be used.
temperature – Optional sampling temperature for the model’s outputs. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more deterministic.

base_uri: str | None = None

model_name: str

temperature: float | None = None

class mlflow.genai.optimize.OptimizerConfig(num_instruction_candidates: int = 6, max_few_shot_examples: int = 6, num_threads: int = <factory>, optimizer_llm: mlflow.genai.optimize.types.LLMParams | None = None, algorithm: str | type['BasePromptOptimizer'] = 'DSPy/MIPROv2', verbose: bool = False, autolog: bool = True, convert_to_single_text: bool = True, extract_instructions: bool = True)[source]

Bases: object

Warning

mlflow.genai.optimize.types.OptimizerConfig is deprecated since 3.5.0. This method will be removed in a future release.

Configuration for prompt optimization.

Parameters

num_instruction_candidates – Number of candidate instructions to generate during each optimization iteration. Higher values may lead to better results but increase optimization time. Default: 6
max_few_shot_examples – Maximum number of examples to show in few-shot demonstrations. Default: 6
num_threads – Number of threads to use for parallel optimization. Default: (number of CPU cores * 2 + 1)
optimizer_llm – Optional LLM parameters for the teacher model. If not provided, the target LLM will be used as the teacher.
algorithm – The optimization algorithm to use. When a string is provided, it must be one of the supported algorithms: “DSPy/MIPROv2”. When a BasePromptOptimizer is provided, it will be used as the optimizer. Default: “DSPy/MIPROv2”
verbose – Whether to show optimizer logs during optimization. Default: False
autolog – Whether to enable automatic logging and prompt registration. If set to True, a MLflow run is automatically created to store optimization parameters, datasets and metrics, and the optimized prompt is registered. If set to False, the raw optimized template is returned without registration. Default: True
convert_to_single_text – Whether to convert the optimized prompt to a single prompt. Default: True
extract_instructions – Whether to extract instructions from the initial prompt. Default: True

algorithm: str | type['BasePromptOptimizer'] = 'DSPy/MIPROv2'

autolog: bool = True

convert_to_single_text: bool = True

extract_instructions: bool = True

max_few_shot_examples: int = 6

num_instruction_candidates: int = 6

num_threads: int

optimizer_llm: mlflow.genai.optimize.types.LLMParams | None = None

verbose: bool = False

class mlflow.genai.optimize.PromptOptimizationResult(optimized_prompts: list[PromptVersion], optimizer_name: str, initial_eval_score: Optional[float] = None, final_eval_score: Optional[float] = None)[source]

Bases: object

Note

Experimental: This class may change or be removed in a future release without warning.

Result of the mlflow.genai.optimize_prompts() API.

Parameters

optimized_prompts – The optimized prompts.
optimizer_name – The name of the optimizer.
initial_eval_score – The evaluation score before optimization (optional).
final_eval_score – The evaluation score after optimization (optional).

final_eval_score: float | None = None

initial_eval_score: float | None = None

optimized_prompts: list[PromptVersion]

optimizer_name: str

class mlflow.genai.optimize.PromptOptimizerOutput(optimized_prompts: dict[str, str], initial_eval_score: Optional[float] = None, final_eval_score: Optional[float] = None)[source]

Bases: object

Note

Experimental: This class may change or be removed in a future release without warning.

An output of the mlflow.genai.optimize.BasePromptOptimizer.optimize() API.

Parameters

optimized_prompts – The optimized prompts as a dict (prompt template name -> prompt template). e.g., {“question”: “What is the capital of {{country}}?”}
initial_eval_score – The evaluation score before optimization (optional).
final_eval_score – The evaluation score after optimization (optional).

final_eval_score: float | None = None

initial_eval_score: float | None = None

optimized_prompts: dict[str, str]

mlflow.genai.optimize.optimize_prompt(*args, **kwargs)[source]

mlflow.genai.optimize.optimize_prompts(*, predict_fn: Callable[[...], Any], train_data: EvaluationDatasetTypes, prompt_uris: list[str], optimizer: mlflow.genai.optimize.optimizers.base.BasePromptOptimizer, scorers: list[Scorer], aggregation: Optional[Callable[[dict[str, bool | float | str | Feedback | list[Feedback]]], float]] = None, enable_tracking: bool = True) → mlflow.genai.optimize.types.PromptOptimizationResult[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Automatically optimize prompts using evaluation metrics and training data. This function uses the provided optimization algorithm to improve prompt quality based on your evaluation criteria and dataset.

Parameters

predict_fn – a target function that uses the prompts to be optimized. The callable should receive inputs as keyword arguments and return the response. The function should use MLflow prompt registry and call PromptVersion.format during execution in order for this API to optimize the prompt. This function should return the same type as the outputs in the dataset.
train_data –
an evaluation dataset used for the optimization. It should include the inputs and outputs fields with dict values. The data must be one of the following formats:
- An EvaluationDataset entity
- Pandas DataFrame
- Spark DataFrame
- List of dictionaries
The dataset must include the following columns:
- inputs: A column containing single inputs in dict format. Each input should contain keys matching the variables in the prompt template.
- outputs: A column containing an output for each input that the predict_fn should produce.
prompt_uris – a list of prompt uris to be optimized. The prompt templates should be used by the predict_fn.
optimizer – a prompt optimizer object that optimizes a set of prompts with the training dataset and scorers. For example, GepaPromptOptimizer(reflection_model=”openai:/gpt-4o”).
scorers – List of scorers that evaluate the inputs, outputs and expectations. Required parameter. Use builtin scorers like Equivalence or Correctness, or define custom scorers with the @scorer decorator.
aggregation – A callable that computes the overall performance metric from individual scorer outputs. Takes a dict mapping scorer names to scores and returns a float value (greater is better). If None and all scorers return numerical values, uses sum of scores by default.
enable_tracking – If True (default), automatically creates an MLflow run if no active run exists and logs the following information: - The optimization scores (initial, final, improvement) - Links to the optimized prompt versions - The optimizer name and parameters - Optimization progress If False, no MLflow run is created and no tracking occurs.

Returns

The optimization result object that includes the optimized prompts as a list of prompt versions, evaluation scores, and the optimizer name.

Examples

import mlflow
import openai
from mlflow.genai.optimize.optimizers import GepaPromptOptimizer
from mlflow.genai.scorers import Correctness

prompt = mlflow.genai.register_prompt(
    name="qa",
    template="Answer the following question: {{question}}",
)


def predict_fn(question: str) -> str:
    completion = openai.OpenAI().chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt.format(question=question)}],
    )
    return completion.choices[0].message.content


dataset = [
    {"inputs": {"question": "What is the capital of France?"}, "outputs": "Paris"},
    {"inputs": {"question": "What is the capital of Germany?"}, "outputs": "Berlin"},
]

result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-4o"),
    scorers=[Correctness(model="openai:/gpt-4o")],
)

print(result.optimized_prompts[0].template)

Example: Using custom scorers with an objective function

import mlflow
from mlflow.genai.optimize.optimizers import GepaPromptOptimizer
from mlflow.genai.scorers import scorer


# Define custom scorers
@scorer(name="accuracy")
def accuracy_scorer(outputs, expectations):
    return 1.0 if outputs.lower() == expectations.lower() else 0.0


@scorer(name="brevity")
def brevity_scorer(outputs):
    # Prefer shorter outputs (max 50 chars gets score of 1.0)
    return min(1.0, 50 / max(len(outputs), 1))


# Define objective to combine scores
def weighted_objective(scores):
    return 0.7 * scores["accuracy"] + 0.3 * scores["brevity"]


result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="openai:/gpt-4o"),
    scorers=[accuracy_scorer, brevity_scorer],
    aggregation=weighted_objective,
)

class mlflow.genai.judges.AlignmentOptimizer[source]

Bases: abc.ABC

Note

Experimental: This class may change or be removed in a future release without warning.

Abstract base class for judge alignment optimizers.

Alignment optimizers improve judge accuracy by learning from traces that contain human feedback.

abstract align(judge: mlflow.genai.judges.base.Judge, traces: list[Trace]) → mlflow.genai.judges.base.Judge[source]

Align a judge using the provided traces.

Parameters

judge – The judge to be optimized
traces – List of traces containing alignment data (feedback)

Returns

A new Judge instance that is better aligned with the input traces.

class mlflow.genai.judges.CategoricalRating(value)[source]

Bases: mlflow.genai.utils.enum_utils.StrEnum

A categorical rating for an assessment.

Example

from mlflow.genai.judges import CategoricalRating
from mlflow.entities import Feedback

# Create feedback with categorical rating
feedback = Feedback(
    name="my_metric", value=CategoricalRating.YES, rationale="The metric is passing."
)

NO = 'no'

UNKNOWN = 'unknown'

YES = 'yes'

class mlflow.genai.judges.Judge(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, description: str | None = None)[source]

Bases: Scorer

Note

Experimental: This class may change or be removed in a future release without warning.

Base class for LLM-as-a-judge scorers that can be aligned with human feedback.

Judges are specialized scorers that use LLMs to evaluate outputs based on configurable criteria and the results of human-provided feedback alignment.

align(traces: list[Trace], optimizer: mlflow.genai.judges.base.AlignmentOptimizer | None = None) → mlflow.genai.judges.base.Judge[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Align this judge with human preferences using the provided optimizer and traces.

Parameters

traces – Training traces for alignment
optimizer – The alignment optimizer to use. If None, uses the default SIMBA optimizer.

Returns

A new Judge instance that is better aligned with the input traces.

Raises

NotImplementedError – If called on a session-level scorer. Alignment is currently only supported for single-turn scorers.

Note on Logging:

By default, alignment optimization shows minimal progress information. To see detailed optimization output, set the optimizer’s logger to DEBUG:

import logging

# For SIMBA optimizer (default)
logging.getLogger("mlflow.genai.judges.optimizers.simba").setLevel(logging.DEBUG)

abstract get_input_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the input fields for this judge.

Returns: List of JudgeField objects defining the input fields.

classmethod get_output_fields() → list[mlflow.genai.judges.base.JudgeField][source]

Get the standard output fields used by all judges. This is the source of truth for judge output field definitions.

Returns: List of JudgeField objects defining the standard output fields.

abstract property instructions: str: Plain text instructions of what this judge evaluates.

property kind: mlflow.genai.scorers.base.ScorerKind

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) → None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters

self – The BaseModel instance.
context – The context.

mlflow.genai.judges.custom_prompt_judge(*, name: str, prompt_template: str, numeric_values: dict[str, float] | None = None, model: str | None = None) → Callable[[...], Feedback][source]

Warning

mlflow.genai.judges.custom_prompt_judge.custom_prompt_judge is deprecated since 3.4.0. This method will be removed in a future release. Use mlflow.genai.make_judge instead.

Create a custom prompt judge that evaluates inputs using a template.

Parameters

name – Name of the judge, used as the name of returned mlflow.entities.Feedback object.
prompt_template – Template string with {{var_name}} placeholders for variable substitution. Should be prompted with choices as outputs.
numeric_values – Optional mapping from categorical values to numeric scores. Useful if you want to create a custom judge that returns continuous valued outputs. Defaults to None.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Returns

A callable that takes keyword arguments mapping to the template variables and returns an mlflow mlflow.entities.Feedback.

Example prompt template:

You will look at the response and determine the formality of the response.

<request>{{request}}</request>
<response>{{response}}</response>

You must choose one of the following categories.

[[formal]]: The response is very formal.
[[semi_formal]]: The response is somewhat formal. The response is somewhat formal if the
response mentions friendship, etc.
[[not_formal]]: The response is not formal.

Variable names in the template should be enclosed in double curly braces, e.g., {{request}}, {{response}}. They should be alphanumeric and can include underscores, but should not contain spaces or special characters.

It is required for the prompt template to request choices as outputs, with each choice enclosed in square brackets. Choice names should be alphanumeric and can include underscores and spaces.

mlflow.genai.judges.is_context_relevant(*, request: str, context: Any, name: Optional[str] = None, model: Optional[str] = None) → Feedback[source]

LLM judge determines whether the given context is relevant to the input request.

Parameters

request – Input to the application to evaluate, user’s question or query.
context – Context to evaluate the relevance to the request. Supports any JSON-serializable object.
name – Optional name for overriding the default name of the returned feedback.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the context is relevant to the request.

Example

The following example shows how to evaluate whether a document retrieved by a retriever is relevant to the user’s question.

from mlflow.genai.judges import is_context_relevant

feedback = is_context_relevant(
    request="What is the capital of France?",
    context="Paris is the capital of France.",
)
print(feedback.value)  # "yes"

feedback = is_context_relevant(
    request="What is the capital of France?",
    context="Paris is known for its Eiffel Tower.",
)
print(feedback.value)  # "no"

mlflow.genai.judges.is_context_sufficient(*, request: str, context: Any, expected_facts: list[str], expected_response: Optional[str] = None, name: Optional[str] = None, model: Optional[str] = None) → Feedback[source]

LLM judge determines whether the given context is sufficient to answer the input request.

Parameters

request – Input to the application to evaluate, user’s question or query.
context – Context to evaluate the sufficiency of. Supports any JSON-serializable object.
expected_facts – A list of expected facts that should be present in the context. Optional.
expected_response – The expected response from the application. Optional.
name – Optional name for overriding the default name of the returned feedback.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the context is sufficient to answer the request.

Example

The following example shows how to evaluate whether the documents returned by a retriever gives sufficient context to answer the user’s question.

from mlflow.genai.judges import is_context_sufficient

feedback = is_context_sufficient(
    request="What is the capital of France?",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."},
    ],
    expected_facts=["Paris is the capital of France."],
)
print(feedback.value)  # "yes"

feedback = is_context_sufficient(
    request="What is the capital of France?",
    context={"content": "France is a country in Europe."},
    expected_response="Paris is the capital of France.",
)
print(feedback.value)  # "no"

mlflow.genai.judges.is_correct(*, request: str, response: str, expected_facts: Optional[list[str]] = None, expected_response: Optional[str] = None, name: Optional[str] = None, model: Optional[str] = None) → Feedback[source]

LLM judge determines whether the expected facts are supported by the response.

This judge evaluates if the facts specified in expected_facts or expected_response are contained in or supported by the model’s response.

Note

This judge checks if expected facts are supported by the response, not whether the response is equivalent to the expected output. The response may contain additional information beyond the expected facts and still be considered correct.

Parameters

request – Input to the application to evaluate, user’s question or query.
response – The response from the application to evaluate.
expected_facts – A list of expected facts that should be supported by the response.
expected_response – The expected response containing facts that should be supported.
name – Optional name for overriding the default name of the returned feedback.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the expected facts are supported by the response.

Example

from mlflow.genai.judges import is_correct

# Response supports the expected response - correct
feedback = is_correct(
    request="What is the capital of France?",
    response="Paris is the capital of France.",
    expected_response="Paris",
)
print(feedback.value)  # "yes"

# Response contradicts the expected facts - incorrect
feedback = is_correct(
    request="What is the capital of France?",
    response="London is the capital of France.",
    expected_facts=["Paris is the capital of France"],
)
print(feedback.value)  # "no"

mlflow.genai.judges.is_grounded(*, request: str, response: str, context: Any, name: Optional[str] = None, model: Optional[str] = None) → Feedback[source]

LLM judge determines whether the given response is grounded in the given context.

Parameters

request – Input to the application to evaluate, user’s question or query.
response – The response from the application to evaluate.
context – Context to evaluate the response against. Supports any JSON-serializable object.
name – Optional name for overriding the default name of the returned feedback.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the response is grounded in the context.

Example

The following example shows how to evaluate whether the response is grounded in the context.

from mlflow.genai.judges import is_grounded

feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."},
    ],
)
print(feedback.value)  # "yes"

feedback = is_grounded(
    request="What is the capital of France?",
    response="London is the capital of France.",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."},
    ],
)
print(feedback.value)  # "no"

mlflow.genai.judges.is_safe(*, content: str, name: Optional[str] = None, model: Optional[str] = None) → Feedback[source]

LLM judge determines whether the given response is safe.

Parameters

content – Text content to evaluate for safety.
name – Optional name for overriding the default name of the returned feedback.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the response is safe.

Example

from mlflow.genai.judges import is_safe

feedback = is_safe(content="I am a happy person.")
print(feedback.value)  # "yes"

mlflow.genai.judges.is_tool_call_efficient(*, request: str, tools_called: list['FunctionCall'], available_tools: list['ChatTool'], name: Optional[str] = None, model: Optional[str] = None) → Feedback[source]

Note

Experimental: This function may change or be removed in a future release without warning.

LLM judge determines whether the agent’s tool usage is efficient and free of redundancy.

This judge analyzes the agent’s trajectory for redundancy, such as repeated tool calls with the same tool name and identical or very similar arguments.

Parameters

request – The original user request that the agent is trying to fulfill.
tools_called – The sequence of tools that were called by the agent. Each element should be a FunctionCall object.
available_tools – The set of available tools that the agent could choose from. Each element should be a dictionary containing the tool name and description.
name – Optional name for overriding the default name of the returned feedback.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the tool usage is efficient (“yes”) or contains redundancy (“no”).

Example

The following example shows how to evaluate whether an agent’s tool calls are efficient.

from mlflow.genai.judges import is_tool_call_efficient
from mlflow.genai.utils.type import FunctionCall

# Efficient tool usage
feedback = is_tool_call_efficient(
    request="What is the capital of France and translate it to Spanish?",
    tools_called=[
        FunctionCall(
            name="search",
            arguments={"query": "capital of France"},
            outputs="Paris",
        ),
        FunctionCall(
            name="translate",
            arguments={"text": "Paris", "target": "es"},
            outputs="París",
        ),
    ],
    available_tools=["search", "translate", "calculate"],
)
print(feedback.value)  # "yes"

# Redundant tool usage
feedback = is_tool_call_efficient(
    request="What is the capital of France?",
    tools_called=[
        FunctionCall(
            name="search",
            arguments={"query": "capital of France"},
            outputs="Paris",
        ),
        FunctionCall(
            name="search",
            arguments={"query": "capital of France"},
            outputs="Paris",
        ),  # Redundant
    ],
    available_tools=["search", "translate", "calculate"],
)
print(feedback.value)  # "no"

# Tool call with exception
feedback = is_tool_call_efficient(
    request="Get weather for an invalid city",
    tools_called=[
        FunctionCall(
            name="get_weather",
            arguments={"city": "InvalidCity123"},
            exception="ValueError: City not found",
        ),
    ],
    available_tools=["get_weather"],
)
print(feedback.value)  # Judge evaluates based on exception context

mlflow.genai.judges.make_judge(name: str, instructions: str, model: str | None = None, description: str | None = None, feedback_value_type: Any = None, inference_params: dict[str, typing.Any] | None = None) → mlflow.genai.judges.base.Judge[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Note

As of MLflow 3.4.0, this function is deprecated in favor of mlflow.genai.make_judge and may be removed in a future version.

Create a custom MLflow judge instance.

Parameters

name – The name of the judge
instructions – Natural language instructions for evaluation. Must contain at least one template variable: {{ inputs }}, {{ outputs }}, {{ expectations }}, {{ conversation }}, or {{ trace }} to reference evaluation data. Custom variables are not supported. Note: {{ conversation }} can only coexist with {{ expectations }}. It cannot be used together with {{ inputs }}, {{ outputs }}, or {{ trace }}.
model – The model identifier to use for evaluation (e.g., “openai:/gpt-4”)
description – A description of what the judge evaluates
feedback_value_type –
Type specification for the ‘value’ field in the Feedback object. The judge will use structured outputs to enforce this type. If unspecified, the feedback value type is determined by the judge. It is recommended to explicitly specify the type.

Supported types (matching FeedbackValueType):
- int: Integer ratings (e.g., 1-5 scale)
- float: Floating point scores (e.g., 0.0-1.0)
- str: Text responses
- bool: Yes/no evaluations
- Literal[values]: Enum-like choices (e.g., Literal[“good”, “bad”])
- dict[str, int | float | str | bool]: Dictionary with string keys and int, float, str, or bool values.
- list[int | float | str | bool]: List of int, float, str, or bool values
Note: Pydantic BaseModel types are not supported.
inference_params – Optional dictionary of inference parameters to pass to the model (e.g., temperature, top_p, max_tokens). These parameters allow fine-grained control over the model’s behavior during evaluation. For example, setting a lower temperature can produce more deterministic and reproducible evaluation results.

Returns

An InstructionsJudge instance configured with the provided parameters

Example

import mlflow
from mlflow.genai.judges import make_judge
from typing import Literal

# Create a judge that evaluates response quality using template variables
quality_judge = make_judge(
    name="response_quality",
    instructions=(
        "Evaluate if the response in {{ outputs }} correctly answers "
        "the question in {{ inputs }}. The response should be accurate, "
        "complete, and professional."
    ),
    model="openai:/gpt-4",
    feedback_value_type=Literal["yes", "no"],
)

# Evaluate a response
result = quality_judge(
    inputs={"question": "What is machine learning?"},
    outputs="ML is basically when computers learn stuff on their own",
)

# Create a judge that compares against expectations
correctness_judge = make_judge(
    name="correctness",
    instructions=(
        "Compare the {{ outputs }} against the {{ expectations }}. "
        "Rate how well they match on a scale of 1-5."
    ),
    model="openai:/gpt-4",
    feedback_value_type=int,
)

# Evaluate with expectations (must be dictionaries)
result = correctness_judge(
    inputs={"question": "What is the capital of France?"},
    outputs={"answer": "The capital of France is Paris."},
    expectations={"expected_answer": "Paris"},
)

# Create a judge that evaluates based on trace context
trace_judge = make_judge(
    name="trace_quality",
    instructions="Evaluate the overall quality of the {{ trace }} execution.",
    model="openai:/gpt-4",
    feedback_value_type=Literal["good", "needs_improvement"],
)

# Use with search_traces() - evaluate each trace
traces = mlflow.search_traces(experiment_ids=["1"], return_type="list")
for trace in traces:
    feedback = trace_judge(trace=trace)
    print(f"Trace {trace.info.trace_id}: {feedback.value} - {feedback.rationale}")

# Create a multi-turn judge that detects user frustration
frustration_judge = make_judge(
    name="user_frustration",
    instructions=(
        "Analyze the {{ conversation }} to detect signs of user frustration. "
        "Look for indicators such as repeated questions, negative language, "
        "or expressions of dissatisfaction."
    ),
    model="openai:/gpt-4",
    feedback_value_type=Literal["frustrated", "not frustrated"],
)

# Evaluate a multi-turn conversation using session traces
session = mlflow.search_traces(
    experiment_ids=["1"],
    filter_string="metadata.`mlflow.trace.session` = 'session_123'",
    return_type="list",
)
result = frustration_judge(session=session)

# Align a judge with human feedback
aligned_judge = quality_judge.align(traces)

# To see detailed optimization output during alignment, enable DEBUG logging:
# import logging
# logging.getLogger("mlflow.genai.judges.optimizers.simba").setLevel(logging.DEBUG)

mlflow.genai.judges.meets_guidelines(*, guidelines: str | list[str], context: dict[str, typing.Any], name: Optional[str] = None, model: Optional[str] = None) → Feedback[source]

LLM judge determines whether the given response meets the given guideline(s).

Parameters

guidelines – A single guideline or a list of guidelines.
context – Mapping of context to be evaluated against the guidelines. For example, pass {“response”: “<response text>”} to evaluate whether the response meets the given guidelines.
name – Optional name for overriding the default name of the returned feedback.
model –
Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup:
- Databricks: databricks
- Otherwise: openai:/gpt-4.1-mini.

Returns

A mlflow.entities.assessment.Feedback~ object with a “yes” or “no” value indicating whether the response meets the guideline(s).

Example

The following example shows how to evaluate whether the response meets the given guideline(s).

from mlflow.genai.judges import meets_guidelines

feedback = meets_guidelines(
    guidelines="Be polite and respectful.",
    context={"response": "Hello, how are you?"},
)
print(feedback.value)  # "yes"

feedback = meets_guidelines(
    guidelines=["Be polite and respectful.", "Must be in English."],
    context={"response": "Hola, ¿cómo estás?"},
)
print(feedback.value)  # "no"

class mlflow.genai.agent_server.AgentServer(agent_type: Optional[Literal['ResponsesAgent']] = None, enable_chat_proxy: bool = False)[source]

Bases: object

Note

Experimental: This class may change or be removed in a future release without warning.

FastAPI-based server for hosting agents.

Args:
agent_type: An optional parameter to specify the type of agent to serve. If provided, input/output validation and streaming tracing aggregation will be done automatically.

Currently only “ResponsesAgent” is supported.

If None, no input/output validation and streaming tracing aggregation will be done. Default to None.

enable_chat_proxy: If True, enables a proxy middleware that forwards unmatched requests to a chat app running on the port specified by the CHAT_APP_PORT environment variable (defaults to 3000) with a timeout specified by the CHAT_PROXY_TIMEOUT_SECONDS environment variable, (defaults to 300 seconds). enable_chat_proxy defaults to False.

See https://mlflow.org/docs/latest/genai/serving/agent-server for more information.

run(app_import_string: str, host: str = '0.0.0.0') → None[source]: Run the agent server with command line argument parsing.

mlflow.genai.agent_server.get_invoke_function()[source]: Note

Experimental: This function may change or be removed in a future release without warning.

mlflow.genai.agent_server.get_request_headers() → dict[str, str][source]: Note

Experimental: This function may change or be removed in a future release without warning.

Get all request headers from the current context

mlflow.genai.agent_server.get_stream_function()[source]: Note

Experimental: This function may change or be removed in a future release without warning.

mlflow.genai.agent_server.invoke() → Callable[[Callable[[~_P], mlflow.genai.agent_server.server._R]], Callable[[~_P], mlflow.genai.agent_server.server._R]][source]: Note

Experimental: This function may change or be removed in a future release without warning.

Decorator to register a function as an invoke endpoint. Can only be used once.

mlflow.genai.agent_server.set_request_headers(headers: dict[str, str]) → None[source]: Note

Experimental: This function may change or be removed in a future release without warning.

Set request headers in the current context (called by server)

mlflow.genai.agent_server.setup_mlflow_git_based_version_tracking() → None[source]: Note

Experimental: This function may change or be removed in a future release without warning.

Initialize MLflow tracking and set active model with git-based version tracking.

mlflow.genai.agent_server.stream() → Callable[[Callable[[~_P], mlflow.genai.agent_server.server._R]], Callable[[~_P], mlflow.genai.agent_server.server._R]][source]: Note

Experimental: This function may change or be removed in a future release without warning.

Decorator to register a function as a stream endpoint. Can only be used once.