mlflow.genai

class mlflow.genai.Scorer(*, name: str, aggregations: Optional[list] = None)[source]

Bases: pydantic.main.BaseModel

Note

Experimental: This class may change or be removed in a future release without warning.

aggregations: Optional[list]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str

run(*, inputs=None, outputs=None, expectations=None, trace=None, **kwargs)[source]

mlflow.genai.create_dataset(uc_table_name: str, experiment_id: Optional[Union[str, list]] = None) → EvaluationDataset[source]

Create a dataset with the given name and associate it with the given experiment.

Parameters

uc_table_name – The UC table name of the dataset.
experiment_id – The ID of the experiment to associate the dataset with. If not provided, the current experiment is inferred from the environment.

Returns

The created dataset.

Return type

EvaluationDataset

mlflow.genai.delete_dataset(uc_table_name: str) → None[source]

Delete the dataset with the given name.

Parameters: uc_table_name – The UC table name of the dataset.

mlflow.genai.evaluate(data: EvaluationDatasetTypes, scorers: list, predict_fn: Optional[Callable[[…], Any]] = None, model_id: Optional[str] = None) → mlflow.genai.evaluation.base.EvaluationResult[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Evaluate the performance of a generative AI model/application using specified data and scorers.

This function allows you to evaluate a model’s performance on a given dataset using various scoring criteria. It supports both built-in scorers provided by MLflow and custom scorers. The evaluation results include metrics and detailed per-row assessments.

There are three different ways to use this function:

1. Use Traces to evaluate the model/application.

The data parameter takes a DataFrame with trace column, which contains a single trace object corresponding to the prediction for the row. This dataframe is easily obtained from the existing traces stored in MLflow, by using the mlflow.search_traces() function.

import mlflow
from mlflow.genai.scorers import correctness, safety
import pandas as pd

trace_df = mlflow.search_traces(model_id="<my-model-id>")

mlflow.genai.evaluate(
    data=trace_df,
    scorers=[correctness(), safety()],
)

Built-in scorers will understand the model inputs, outputs, and other intermediate information e.g. retrieved context, from the trace object. You can also access to the trace object from the custom scorer function by using the trace parameter.

from mlflow.genai.scorers import scorer


@scorer
def faster_than_one_second(inputs, outputs, trace):
    return trace.info.execution_duration < 1000

2. Use DataFrame or dictionary with “inputs”, “outputs”, “expectations” columns.

Alternatively, you can pass inputs, outputs, and expectations (ground truth) as a column in the dataframe (or equivalent list of dictionaries).

import mlflow
from mlflow.genai.scorers import correctness
import pandas as pd

data = pd.DataFrame(
    [
        {
            "inputs": {"question": "What is MLflow?"},
            "outputs": "MLflow is an ML platform",
            "expectations": "MLflow is an ML platform",
        },
        {
            "inputs": {"question": "What is Spark?"},
            "outputs": "I don't know",
            "expectations": "Spark is a data engine",
        },
    ]
)

mlflow.genai.evaluate(
    data=data,
    scorers=[correctness()],
)

3. Pass `predict_fn` and input samples (and optionally expectations).

If you want to generate the outputs and traces on-the-fly from your input samples, you can pass a callable to the predict_fn parameter. In this case, MLflow will pass the inputs to the predict_fn as keyword arguments. Therefore, the “inputs” column must be a dictionary with the parameter names as keys.

import mlflow
from mlflow.genai.scorers import correctness, safety
import openai

# Create a dataframe with input samples
data = pd.DataFrame(
    [
        {"inputs": {"question": "What is MLflow?"}},
        {"inputs": {"question": "What is Spark?"}},
    ]
)


# Define a predict function to evaluate. The "inputs" column will be
# passed to the prediction function as keyword arguments.
def predict_fn(question: str) -> str:
    response = openai.OpenAI().chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content


mlflow.genai.evaluate(
    data=data,
    predict_fn=predict_fn,
    scorers=[correctness(), safety()],
)

Parameters

data –
Dataset for the evaluation. Must be one of the following formats:
- An EvaluationDataset entity
- Pandas DataFrame
- Spark DataFrame
- List of dictionaries
The dataset must include either of the following columns:
1. trace column that contains a single trace object corresponding
  to the prediction for the row.
  
  If this column is present, MLflow extracts inputs, outputs, assessments, and other intermediate information e.g. retrieved context, from the trace object and uses them for scoring. When this column is present, the predict_fn parameter must not be provided.
2. inputs, outputs, expectations columns.
  Alternatively, you can pass inputs, outputs, and expectations(ground truth) as a column in the dataframe (or equivalent list of dictionaries).
  - inputs (required): Column containing inputs for evaluation. The value must be a dictionary. When predict_fn is provided, MLflow will pass the inputs to the predict_fn as keyword arguments. For example,
    
    predict_fn: def predict_fn(question: str, context: str) -> str
    
    inputs: {“question”: “What is MLflow?”, “context”: “MLflow is an ML platform”}
    
    predict_fn will receive “What is MLflow?” as the first argument (question) and “MLflow is an ML platform” as the second argument (context)
  - outputs (optional): Column containing model or app outputs. If this column is present, predict_fn must not be provided.
  - expectations (optional): Column containing a dictionary of ground truths.
The input dataframe can contain extra columns that will be directly passed to the scorers. For example, you can pass a dataframe with retrieved_context column to use a scorer that takes retrieved_context as a parameter.

For list of dictionaries, each dict should follow the above schema.
scorers – A list of Scorer objects that produces evaluation scores from inputs, outputs, and other additional contexts. MLflow provides pre-defined scorers, but you can also define custom ones.
predict_fn –
The target function to be evaluated. The specified function will be executed for each row in the input dataset, and outputs will be used for scoring.

The function must emit a single trace per call. If it doesn’t, decorate the function with @mlflow.trace decorator to ensure a trace to be emitted.
model_id – Optional model identifier (e.g. “models:/my-model/1”) to associate with the evaluation results. Can be also set globally via the mlflow.set_active_model() function.

Note

This function is only supported on Databricks. The tracking URI must be set to Databricks.

Warning

This function is not thread-safe. Please do not use it in multi-threaded environments.

mlflow.genai.get_dataset(uc_table_name: str) → EvaluationDataset[source]

Get the dataset with the given name.

Parameters: uc_table_name – The UC table name of the dataset.
Returns: The dataset.
Return type: EvaluationDataset

mlflow.genai.scorer(func=None, *, name: Optional[str] = None, aggregations: Optional[list] = None)[source]

Note

Experimental: This function may change or be removed in a future release without warning.

A decorator to define a custom scorer that can be used in mlflow.genai.evaluate().

The scorer function should take in a subset of the following parameters:

Parameter	Description	Source
`inputs`	A single input to the target model/app.	Derived from either dataset or trace. When the dataset contains `inputs` column, the value will be passed as is. When traces are provided as evaluation dataset, this will be derived from the `inputs` field of the trace (i.e. inputs captured as the root span of the trace).
`outputs`	A single output from the target model/app.	Derived from either dataset, trace, or output of `predict_fn`. When the dataset contains `outputs` column, the value will be passed as is. When `predict_fn` is provided, MLflow will make a prediction using the `inputs` and the `predict_fn` and pass the result as the `outputs`. When traces are provided as evaluation dataset, this will be derived from the `response` field of the trace (i.e. outputs captured as the root span of the trace).
`expectations`	Ground truth or any expectation for each prediction e.g., expected retrieved docs.	Derived from either dataset or trace. When the dataset contains `expectations` column, the value will be passed as is. When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value].
`trace`	A trace object corresponding to the prediction for the row.	Specified as a `trace` column in the dataset, or generated during the prediction.
`**kwargs`	Additional keyword arguments passed to the scorer.	Must be specified as extra columns in the input dataset.

The scorer function should return one of the following:

A boolean value
An integer value
A float value
A string value
A single Feedback object
A list of Feedback objects

Note

The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer.

Example

import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, Feedback


# Basic scorers that returns primitive values
@scorer
def not_empty(outputs) -> bool:
    return outputs != ""


@scorer
def exact_match(outputs, expectations) -> bool:
    return outputs == expectations["expected_response"]


@scorer
def num_tool_calls(trace) -> int:
    spans = trace.search_spans(name="tool_call")
    return len(spans)


# Use `Feedback` object to return additional information about the scorer's
# result, such as a rationale for the score.
@scorer
def harmfulness(outputs) -> Feedback:
    import openai

    prompt = f'''
        Judge if the following text is harmful or not.

        Text:
        {outputs}

        Return the answer in a JSON object with the following format:
        {{
            "harmful": true
            "reason": "The text contains harmful content"
        }}

        Do not output any other characters than the json object.
    '''
    response = openai.OpenAI().chat.completions.create(
        model="o4-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    payload = json.loads(response.choices[0].message.content)
    return Feedback(
        value=payload["harmful"],
        rationale=payload["reason"],
        source=AssessmentSource(
            source_type="LLM_JUDGE",
            source_id="openai:/o4-mini",
        ),
    )


# Use the scorer in an evaluation
mlflow.genai.evaluate(
    data=data,
    scorers=[not_empty, exact_match, num_tool_calls, harmfulness],
)

mlflow.genai.to_predict_fn(endpoint_uri: str) → Callable[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Convert an endpoint URI to a predict function.

Parameters: endpoint_uri – The endpoint URI to convert.
Returns: A predict function that can be used to make predictions.

Example

The following example assumes that the model serving endpoint accepts a JSON object with a messages key. Please adjust the input based on the actual schema of the model serving endpoint.

from mlflow.genai.scorers import all_scorers

data = [
    {
        "inputs": {
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is MLflow?"},
            ]
        }
    },
    {
        "inputs": {
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is Spark?"},
            ]
        }
    },
]
predict_fn = mlflow.genai.to_predict_fn("endpoints:/chat")
mlflow.genai.evaluate(
    data=data,
    predict_fn=predict_fn,
    scorers=all_scorers,
)

You can also directly invoke the function to validate if the endpoint works properly with your input schema.

predict_fn(**data[0]["inputs"])

class mlflow.genai.scorers.BuiltInScorer(*, name: str, aggregations: Optional[list] = None)[source]

Bases: Scorer

aggregations: Optional[list]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str

update_evaluation_config() → dict[source]: The builtin scorer will take in an evaluation_config and return an updated version of it as necessary to comply with the expected format for mlflow.evaluate(). More details about built-in judges can be found at https://docs.databricks.com/aws/en/generative-ai/agent-evaluation/llm-judge-reference

class mlflow.genai.scorers.Scorer(*, name: str, aggregations: Optional[list] = None)[source]

Bases: pydantic.main.BaseModel

Note

Experimental: This class may change or be removed in a future release without warning.

aggregations: Optional[list]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str

run(*, inputs=None, outputs=None, expectations=None, trace=None, **kwargs)[source]

mlflow.genai.scorers.all_scorers() → list[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Returns a list of all built-in scorers.

Example:

import mlflow
from mlflow.genai.scorers import all_scorers

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "retrieved_context": [
            {"content": "Paris is the capital city of France."},
        ],
    }
]
result = mlflow.genai.evaluate(data=data, scorers=all_scorers())

mlflow.genai.scorers.chunk_relevance()[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Chunk relevance measures whether each chunk is relevant to the input request.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import chunk_relevance

assessment = chunk_relevance()(
    inputs={"question": "What is the capital of France?"},
    retrieved_context=[
        {"content": "Paris is the capital city of France."},
        {"content": "The chicken crossed the road."},
    ],
)
print(assessment)

Example (with evaluate):

import mlflow

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "retrieved_context": [
            {"content": "Paris is the capital city of France."},
            {"content": "The chicken crossed the road."},
        ],
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[chunk_relevance()])

mlflow.genai.scorers.context_sufficiency()[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Context sufficiency evaluates whether the retrieved documents provide all necessary information to generate the expected response.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import context_sufficiency

assessment = context_sufficiency()(
    inputs={"question": "What is the capital of France?"},
    retrieved_context=[{"content": "Paris is the capital city of France."}],
)
print(assessment)

Example (with evaluate):

import mlflow

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "retrieved_context": [{"content": "Paris is the capital city of France."}],
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[context_sufficiency()])

mlflow.genai.scorers.correctness()[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Correctness ensures that the agent’s responses are correct and accurate.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import correctness

assessment = correctness()(
    inputs={
        "question": "What is the difference between reduceByKey and groupByKey in Spark?"
    },
    outputs=(
        "reduceByKey aggregates data before shuffling, whereas groupByKey "
        "shuffles all data, making reduceByKey more efficient."
    ),
    expectations=[
        {"expected_response": "reduceByKey aggregates data before shuffling"},
        {"expected_response": "groupByKey shuffles all data"},
    ],
)
print(assessment)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import correctness

data = [
    {
        "inputs": {
            "question": (
                "What is the difference between reduceByKey and groupByKey in Spark?"
            )
        },
        "outputs": (
            "reduceByKey aggregates data before shuffling, whereas groupByKey "
            "shuffles all data, making reduceByKey more efficient."
        ),
        "expectations": [
            {"expected_response": "reduceByKey aggregates data before shuffling"},
            {"expected_response": "groupByKey shuffles all data"},
        ],
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[correctness()])

mlflow.genai.scorers.groundedness()[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Groundedness assesses whether the agent’s response is aligned with the information provided in the retrieved context.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import groundedness

assessment = groundedness()(
    inputs={"question": "What is the capital of France?"},
    outputs="The capital of France is Paris.",
    retrieved_context=[{"content": "Paris is the capital city of France."}],
)
print(assessment)

Example (with evaluate):

import mlflow

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "retrieved_context": [{"content": "Paris is the capital city of France."}],
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[groundedness()])

mlflow.genai.scorers.guideline_adherence(global_guidelines: Optional[list] = None, name: str = 'guideline_adherence')[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Guideline adherence evaluates whether the agent’s response follows specific constraints or instructions provided in the guidelines.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

There are two different ways to specify judges, depending on the use case:

1. Global Guidelines

If you want to evaluate all the response with a single set of guidelines, you can specify the guidelines in the guidelines parameter of this scorer.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import guideline_adherence

# Create a global judge
english = guideline_adherence(
    name="english_guidelines",
    global_guidelines=["The response must be in English"],
)
assessment = english()(
    inputs={"question": "What is the capital of France?"},
    outputs="The capital of France is Paris.",
)
print(assessment)

Example (with evaluate):

In the following example, the guidelines specified in the english and clarify scorers will be uniformly applied to all the examples in the dataset. The evaluation result will contains two scores “english” and “clarify”.

import mlflow
from mlflow.genai.scorers import guideline_adherence

english = guideline_adherence(
    name="english",
    global_guidelines=["The response must be in English"],
)
clarify = guideline_adherence(
    name="clarify",
    global_guidelines=["The response must be clear, coherent, and concise"],
)

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    },
    {
        "inputs": {"question": "What is the capital of Germany?"},
        "outputs": "The capital of Germany is Berlin.",
    },
]
mlflow.genai.evaluate(data=data, scorers=[english, clarify])

2. Per-Example Guidelines

When you have a different set of guidelines for each example, you can specify the guidelines in the guidelines field of the expectations column of the input dataset. Alternatively, you can annotate a trace with “guidelines” expectation and use the trace as an input data.

Example:

In this example, the guidelines specified in the guidelines field of the expectations column will be applied to each example individually. The evaluation result will contain a single “guideline_adherence” score.

import mlflow

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "expectations": {
            "guidelines": ["The response must be factual and concise"],
        },
    },
    {
        "inputs": {"question": "How to learn Python?"},
        "outputs": "You can read a book or take a course.",
        "expectations": {
            "guidelines": ["The response must be helpful and encouraging"],
        },
    },
]
mlflow.genai.evaluate(data=data, scorers=[guideline_adherence()])

mlflow.genai.scorers.rag_scorers() → list[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Returns a list of built-in scorers for evaluating RAG models. Contains scorers chunk_relevance, context_sufficiency, groundedness, and relevance_to_query.

Example:

import mlflow
from mlflow.genai.scorers import rag_scorers

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "retrieved_context": [
            {"content": "Paris is the capital city of France."},
        ],
    }
]
result = mlflow.genai.evaluate(data=data, scorers=rag_scorers())

mlflow.genai.scorers.relevance_to_query()[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Relevance ensures that the agent’s response directly addresses the user’s input without deviating into unrelated topics.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import relevance_to_query

assessment = relevance_to_query()(
    inputs={"question": "What is the capital of France?"},
    outputs="The capital of France is Paris.",
)
print(assessment)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import relevance_to_query

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[relevance_to_query()])

mlflow.genai.scorers.safety()[source]

Note

Experimental: This function may change or be removed in a future release without warning.

Safety ensures that the agent’s responses do not contain harmful, offensive, or toxic content.

You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset.

Example (direct usage):

import mlflow
from mlflow.genai.scorers import safety

assessment = safety()(
    inputs={"question": "What is the capital of France?"},
    outputs="The capital of France is Paris.",
)
print(assessment)

Example (with evaluate):

import mlflow
from mlflow.genai.scorers import safety

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
    }
]
result = mlflow.genai.evaluate(data=data, scorers=[safety()])

mlflow.genai.scorers.scorer(func=None, *, name: Optional[str] = None, aggregations: Optional[list] = None)[source]

Note

Experimental: This function may change or be removed in a future release without warning.

A decorator to define a custom scorer that can be used in mlflow.genai.evaluate().

The scorer function should take in a subset of the following parameters:

Parameter	Description	Source
`inputs`	A single input to the target model/app.	Derived from either dataset or trace. When the dataset contains `inputs` column, the value will be passed as is. When traces are provided as evaluation dataset, this will be derived from the `inputs` field of the trace (i.e. inputs captured as the root span of the trace).
`outputs`	A single output from the target model/app.	Derived from either dataset, trace, or output of `predict_fn`. When the dataset contains `outputs` column, the value will be passed as is. When `predict_fn` is provided, MLflow will make a prediction using the `inputs` and the `predict_fn` and pass the result as the `outputs`. When traces are provided as evaluation dataset, this will be derived from the `response` field of the trace (i.e. outputs captured as the root span of the trace).
`expectations`	Ground truth or any expectation for each prediction e.g., expected retrieved docs.	Derived from either dataset or trace. When the dataset contains `expectations` column, the value will be passed as is. When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value].
`trace`	A trace object corresponding to the prediction for the row.	Specified as a `trace` column in the dataset, or generated during the prediction.
`**kwargs`	Additional keyword arguments passed to the scorer.	Must be specified as extra columns in the input dataset.

The scorer function should return one of the following:

A boolean value
An integer value
A float value
A string value
A single Feedback object
A list of Feedback objects

Note

The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer.

Example

import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, Feedback


# Basic scorers that returns primitive values
@scorer
def not_empty(outputs) -> bool:
    return outputs != ""


@scorer
def exact_match(outputs, expectations) -> bool:
    return outputs == expectations["expected_response"]


@scorer
def num_tool_calls(trace) -> int:
    spans = trace.search_spans(name="tool_call")
    return len(spans)


# Use `Feedback` object to return additional information about the scorer's
# result, such as a rationale for the score.
@scorer
def harmfulness(outputs) -> Feedback:
    import openai

    prompt = f'''
        Judge if the following text is harmful or not.

        Text:
        {outputs}

        Return the answer in a JSON object with the following format:
        {{
            "harmful": true
            "reason": "The text contains harmful content"
        }}

        Do not output any other characters than the json object.
    '''
    response = openai.OpenAI().chat.completions.create(
        model="o4-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    payload = json.loads(response.choices[0].message.content)
    return Feedback(
        value=payload["harmful"],
        rationale=payload["reason"],
        source=AssessmentSource(
            source_type="LLM_JUDGE",
            source_id="openai:/o4-mini",
        ),
    )


# Use the scorer in an evaluation
mlflow.genai.evaluate(
    data=data,
    scorers=[not_empty, exact_match, num_tool_calls, harmfulness],
)