mlflow.genai
- class mlflow.genai.Agent(agent: _Agent)[source]
- Bases: - object- The agent configuration, used for generating responses in the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. 
- class mlflow.genai.LabelingSession(session: _LabelingSession)[source]
- Bases: - object- A session for labeling items in the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - add_dataset(dataset_name: str, record_ids: Optional[list[str]] = None) mlflow.genai.labeling.labeling.LabelingSession[source]
- Add a dataset to the labeling session. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- dataset_name – The name of the dataset. 
- record_ids – Optional. The individual record ids to be added to the session. If not provided, all records in the dataset will be added. 
 
- Returns
- The updated labeling session. 
- Return type
 
 - add_traces(traces: Union[Iterable[Trace], Iterable[str], pd.DataFrame]) LabelingSession[source]
- Add traces to the labeling session. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- traces – Can be either: a) a pandas DataFrame with a ‘trace’ column. The ‘trace’ column should contain either mlflow.entities.Trace objects or their json string representations. b) an iterable of mlflow.entities.Trace objects. c) an iterable of json string representations of mlflow.entities.Trace objects. 
- Returns
- The updated labeling session. 
- Return type
 
 - set_assigned_users(assigned_users: list[str]) mlflow.genai.labeling.labeling.LabelingSession[source]
- Set the assigned users for the labeling session. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- assigned_users – The list of users to assign to the session. 
- Returns
- The updated labeling session. 
- Return type
 
 - sync(to_dataset: str) None[source]
- Sync the traces and expectations from the labeling session to a dataset. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- to_dataset – The name of the dataset to sync traces and expectations to. 
 
 
- class mlflow.genai.ReviewApp(app: _ReviewApp)[source]
- Bases: - object- A review app is used to collect feedback from stakeholders for a given experiment. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - add_agent(*, agent_name: str, model_serving_endpoint: str, overwrite: bool = False) mlflow.genai.labeling.labeling.ReviewApp[source]
- Add an agent to the review app to be used to generate responses. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- agent_name – The name of the agent. 
- model_serving_endpoint – The model serving endpoint to be used by the agent. 
- overwrite – Whether to overwrite an existing agent with the same name. 
 
- Returns
- The updated review app. 
- Return type
 
 - property agents: list[mlflow.genai.labeling.labeling.Agent]
- The agents to be used to generate responses. 
 - remove_agent(agent_name: str) mlflow.genai.labeling.labeling.ReviewApp[source]
- Remove an agent from the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- agent_name – The name of the agent to remove. 
- Returns
- The updated review app. 
- Return type
 
 
- class mlflow.genai.Scorer(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None)[source]
- Bases: - pydantic.main.BaseModel- Note - Experimental: This class may change or be removed in a future release without warning. - aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None
 - property filter_string: str | None
- This function may change or be removed in a future release without warning. - Get the filter string for this scorer. - Type
- Note - Experimental 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 - model_dump(**kwargs) dict[str, typing.Any][source]
- Override model_dump to include source code. 
 - model_post_init(context: Any, /) None
- This function is meant to behave like a BaseModel method to initialise private attributes. - It takes context as an argument since that’s what pydantic-core passes when calling it. - Parameters
- self – The BaseModel instance. 
- context – The context. 
 
 
 - classmethod model_validate(obj: Any) Scorer[source]
- Override model_validate to reconstruct scorer from source code. 
 - register(*, name: Optional[str] = None, experiment_id: Optional[str] = None) Scorer[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Register this scorer with the MLflow server. - This method registers the scorer for use with automatic trace evaluation in the specified experiment. Once registered, the scorer can be started to begin evaluating traces automatically. - Parameters
- name – Optional registered name for the scorer. If not provided, the current name property value will be used as a registered name. 
- experiment_id – The ID of the MLflow experiment to register the scorer for. If None, uses the currently active experiment. 
 
- Returns
- A new Scorer instance with server registration information. 
 - Example - import mlflow from mlflow.genai.scorers import RelevanceToQuery # Register a built-in scorer mlflow.set_experiment("my_genai_app") registered_scorer = RelevanceToQuery().register(name="relevance_scorer") print(f"Registered scorer: {registered_scorer.name}") # Register a custom scorer from mlflow.genai.scorers import scorer @scorer def custom_length_check(outputs) -> bool: return len(outputs) > 100 registered_custom = custom_length_check.register( name="output_length_checker", experiment_id="12345" ) 
 - run(*, inputs=None, outputs=None, expectations=None, trace=None)[source]
 - property sample_rate: float | None
- This function may change or be removed in a future release without warning. - Get the sample rate for this scorer. Available when registered for monitoring. - Type
- Note - Experimental 
 
 - start(*, name: Optional[str] = None, experiment_id: Optional[str] = None, sampling_config: mlflow.genai.scorers.base.ScorerSamplingConfig) Scorer[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Start registered scoring with the specified sampling configuration. - This method activates automatic trace evaluation for the scorer. The scorer will evaluate traces based on the provided sampling configuration, including the sample rate and optional filter criteria. - Parameters
- name – Optional scorer name. If not provided, uses the scorer’s registered name or default name. 
- experiment_id – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment. 
- sampling_config – Configuration object containing: - sample_rate: Fraction of traces to evaluate (0.0 to 1.0). Required. - filter_string: Optional MLflow search_traces compatible filter string. 
 
- Returns
- A new Scorer instance with updated sampling configuration. 
 - Example - import mlflow from mlflow.genai.scorers import RelevanceToQuery, ScorerSamplingConfig # Start scorer with 50% sampling rate mlflow.set_experiment("my_genai_app") scorer = RelevanceToQuery().register() active_scorer = scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5)) print(f"Scorer is evaluating {active_scorer.sample_rate * 100}% of traces") # Start scorer with filter to only evaluate specific traces filtered_scorer = scorer.start( sampling_config=ScorerSamplingConfig( sample_rate=1.0, filter_string="YOUR_FILTER_STRING" ) ) 
 - property status: mlflow.genai.scorers.base.ScorerStatus
- This function may change or be removed in a future release without warning. - Get the status of this scorer, using only the local state. - Type
- Note - Experimental 
 
 - stop(*, name: Optional[str] = None, experiment_id: Optional[str] = None) Scorer[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Stop registered scoring by setting sample rate to 0. - This method deactivates automatic trace evaluation for the scorer while keeping the scorer registered. The scorer can be restarted later using the start() method. - Parameters
- name – Optional scorer name. If not provided, uses the scorer’s registered name or default name. 
- experiment_id – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment. 
 
- Returns
- A new Scorer instance with sample rate set to 0. 
 - Example - import mlflow from mlflow.genai.scorers import RelevanceToQuery, ScorerSamplingConfig # Start and then stop a scorer mlflow.set_experiment("my_genai_app") scorer = RelevanceToQuery().register() active_scorer = scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5)) print(f"Scorer is active: {active_scorer.sample_rate > 0}") # Stop the scorer stopped_scorer = active_scorer.stop() print(f"Scorer is active: {stopped_scorer.sample_rate > 0}") # The scorer remains registered and can be restarted later restarted_scorer = stopped_scorer.start( sampling_config=ScorerSamplingConfig(sample_rate=0.3) ) 
 - update(*, name: Optional[str] = None, experiment_id: Optional[str] = None, sampling_config: mlflow.genai.scorers.base.ScorerSamplingConfig) Scorer[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Update the sampling configuration for this scorer. - This method modifies the sampling rate and/or filter criteria for an already registered scorer. It can be used to dynamically adjust how many traces are evaluated or change the filtering criteria without stopping and restarting the scorer. - Parameters
- name – Optional scorer name. If not provided, uses the scorer’s registered name or default name. 
- experiment_id – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment. 
- sampling_config – Configuration object containing: - sample_rate: New fraction of traces to evaluate (0.0 to 1.0). Optional. - filter_string: New MLflow search_traces compatible filter string. Optional. 
 
- Returns
- A new Scorer instance with updated configuration. 
 - Example - import mlflow from mlflow.genai.scorers import RelevanceToQuery, ScorerSamplingConfig # Start scorer with initial configuration mlflow.set_experiment("my_genai_app") scorer = RelevanceToQuery().register() active_scorer = scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.1)) # Update to increase sampling rate during high traffic updated_scorer = active_scorer.update( sampling_config=ScorerSamplingConfig(sample_rate=0.5) ) print(f"Updated sample rate: {updated_scorer.sample_rate}") # Update to add filtering criteria filtered_scorer = updated_scorer.update( sampling_config=ScorerSamplingConfig(filter_string="YOUR_FILTER_STRING") ) print(f"Added filter: {filtered_scorer.filter_string}") 
 
- class mlflow.genai.ScorerScheduleConfig(scorer: Scorer, scheduled_scorer_name: str, sample_rate: float, filter_string: Optional[str] = None)[source]
- Bases: - object- Note - Experimental: This class may change or be removed in a future release without warning. - A scheduled scorer configuration for automated monitoring of generative AI applications. - Scheduled scorers are used to automatically evaluate traces logged to MLflow experiments by production applications. They are part of Databricks Lakehouse Monitoring for GenAI, which helps track quality metrics like groundedness, safety, and guideline adherence alongside operational metrics like volume, latency, and cost. - When configured, scheduled scorers run automatically in the background to evaluate a sample of traces based on the specified sampling rate and filter criteria. The Assessments are displayed in the Traces tab of the MLflow experiment and can be used to identify quality issues in production. - Parameters
- scorer – The scorer function to run on sampled traces. Must be either a built-in scorer (e.g., Safety, Correctness) or a function decorated with @scorer. Subclasses of Scorer are not supported. 
- scheduled_scorer_name – The name for this scheduled scorer configuration within the experiment. This name must be unique among all scheduled scorers in the same experiment. We recommend using the scorer’s name (e.g., scorer.name) for consistency. 
- sample_rate – The fraction of traces to evaluate, between 0.0 and 1.0. For example, 0.1 means 10% of traces will be randomly selected for evaluation. 
- filter_string – An optional MLflow search_traces compatible filter string to apply before sampling traces. Only traces matching this filter will be considered for evaluation. Uses the same syntax as mlflow.search_traces(). 
 
 - Example - from mlflow.genai.scorers import Safety, scorer from mlflow.genai.scheduled_scorers import ScorerScheduleConfig # Using a built-in scorer safety_config = ScorerScheduleConfig( scorer=Safety(), scheduled_scorer_name="production_safety", sample_rate=0.2, # Evaluate 20% of traces filter_string="trace.status = 'OK'", ) # Using a custom scorer @scorer def response_length(outputs): return len(str(outputs)) > 100 length_config = ScorerScheduleConfig( scorer=response_length, scheduled_scorer_name="adequate_length", sample_rate=0.1, # Evaluate 10% of traces filter_string="trace.status = 'OK'", ) - Note - Scheduled scorers are executed automatically by Databricks and do not need to be manually triggered. The Assessments appear in the Traces tab of the MLflow experiment. Only traces logged directly to the experiment are monitored; traces logged to individual runs within the experiment are not evaluated. - Warning - This API is in Beta and may change or be removed in a future release without warning. - scorer: Scorer
 
- mlflow.genai.create_dataset(name: str | None = None, experiment_id: str | list[str] | None = None, tags: dict[str, typing.Any] | None = None) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]
- Note - Parameter - uc_table_nameis deprecated. Use- nameinstead.- Create a dataset with the given name and associate it with the given experiment. - Parameters
- name – The name of the dataset. In Databricks, this is the UC table name. 
- experiment_id – The ID of the experiment(s) to associate the dataset with. If not provided, the current experiment is inferred from the environment. 
- tags – Dictionary of tags to apply to the dataset. Not supported in Databricks. 
 
- Returns
- An EvaluationDataset object representing the created dataset. 
 - Examples - from mlflow.genai.datasets import create_dataset # Create a dataset with a single experiment dataset = create_dataset( name="customer_support_qa_v1", experiment_id="0", # Default experiment tags={ "version": "1.0", "purpose": "regression_testing", "model": "gpt-4", "team": "ml-platform", }, ) print(f"Created dataset: {dataset.dataset_id}") # Output: Created dataset: d-1a2b3c4d5e6f7890abcdef1234567890 # Create a dataset linked to multiple experiments multi_exp_dataset = create_dataset( name="cross_team_eval_dataset", experiment_id=["1", "2", "5"], # Multiple experiment IDs tags={ "coverage": "comprehensive", "status": "development", }, ) # Create a dataset without tags (minimal example) simple_dataset = create_dataset( name="quick_test_dataset", experiment_id="3", # Specific experiment ) 
- mlflow.genai.create_labeling_session(name: str, *, assigned_users: Optional[list[str]] = None, agent: Optional[str] = None, label_schemas: Optional[list[str]] = None, enable_multi_turn_chat: bool = False, custom_inputs: Optional[dict[str, typing.Any]] = None) mlflow.genai.labeling.labeling.LabelingSession[source]
- Create a new labeling session in the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- name – The name of the labeling session. 
- assigned_users – The users that will be assigned to label items in the session. 
- agent – The agent to be used to generate responses for the items in the session. 
- label_schemas – The label schemas to be used in the session. 
- enable_multi_turn_chat – Whether to enable multi-turn chat labeling for the session. 
- custom_inputs – Optional. Custom inputs to be used in the session. 
 
- Returns
- The created labeling session. 
- Return type
 
- mlflow.genai.delete_dataset(name: str | None = None, dataset_id: str | None = None) None[source]
- Note - Parameter - uc_table_nameis deprecated. Use- nameinstead.- Delete a dataset. - Parameters
- name – The name of the dataset (Databricks only). In Databricks, this is the UC table name. 
- dataset_id – The ID of the dataset. 
 
 - Note - In Databricks environments: Use ‘name’ to specify the dataset. 
- Outside of Databricks: Use ‘dataset_id’ to specify the dataset 
 - Examples - from mlflow.genai.datasets import delete_dataset, search_datasets # Delete a specific dataset by ID (non-Databricks) delete_dataset(dataset_id="d-4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e") # Clean up old test datasets test_datasets = search_datasets( filter_string="name LIKE 'test_%' AND tags.environment = 'development'", order_by=["created_time ASC"], ) # Delete datasets older than the most recent 5 if len(test_datasets) > 5: for dataset in test_datasets[:-5]: # Keep the 5 most recent print(f"Deleting old test dataset: {dataset.name}") delete_dataset(dataset_id=dataset.dataset_id) # Delete datasets with specific criteria deprecated_datasets = search_datasets(filter_string="tags.status = 'deprecated'") for dataset in deprecated_datasets: delete_dataset(dataset_id=dataset.dataset_id) print(f"Deleted deprecated dataset: {dataset.name}") - Warning - Deleting a dataset is permanent and cannot be undone. All associated records, tags, and metadata will be permanently removed. 
- mlflow.genai.delete_dataset_tag(dataset_id: str, key: str) None[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Delete a tag from a dataset. - Parameters
- dataset_id – The ID of the dataset. 
- key – The tag key to delete. 
 
 - Examples - from mlflow.genai.datasets import delete_dataset_tag, get_dataset # Get your dataset dataset = get_dataset(dataset_id="d-9e8f7c6b5a4d3e2f1a0b9c8d7e6f5a4b") # Remove a single tag delete_dataset_tag(dataset_id=dataset.dataset_id, key="deprecated") # Remove outdated tags during cleanup outdated_tags = ["old_version", "temp_flag", "development_only"] for tag_key in outdated_tags: delete_dataset_tag(dataset_id=dataset.dataset_id, key=tag_key) # Check remaining tags updated_dataset = get_dataset(dataset_id=dataset.dataset_id) print(f"Remaining tags: {updated_dataset.tags}") - Note - This API is not available in Databricks environments yet. Tags in Databricks are managed through Unity Catalog. 
- mlflow.genai.delete_labeling_session(labeling_session: mlflow.genai.labeling.labeling.LabelingSession) mlflow.genai.labeling.labeling.ReviewApp[source]
- Delete a labeling session from the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- labeling_session – The labeling session to delete. 
- Returns
- The review app. 
- Return type
 
- mlflow.genai.delete_prompt_alias(name: str, alias: str) None[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Delete an alias for a - Promptin the MLflow Prompt Registry.- Parameters
- name – The name of the prompt. 
- alias – The alias to delete for the prompt. 
 
 
- mlflow.genai.disable_git_model_versioning() None[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Disable Git-based model versioning and clear the active model context. - This function stops automatic Git-based version tracking and clears any active LoggedModel context. After calling this, traces will no longer be automatically linked to Git-based versions. - This is automatically called when exiting a context manager created with enable_git_model_versioning(). - Example: - import mlflow.genai # Enable versioning context = mlflow.genai.enable_git_model_versioning() # ... do work with versioning enabled ... # Disable versioning mlflow.genai.disable_git_model_versioning() # Traces are no longer linked to Git versions 
- mlflow.genai.enable_git_model_versioning(remote_name: str = 'origin') mlflow.genai.git_versioning.GitContext[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Enable automatic Git-based model versioning for MLflow traces. - This function enables automatic version tracking based on your Git repository state. When enabled, MLflow will: - Detect the current Git branch, commit hash, and dirty state - Create or reuse a LoggedModel matching this exact Git state - Link all subsequent traces to this LoggedModel version - Capture uncommitted changes as diffs when the repository is dirty - Parameters
- remote_name – The name of the git remote to use for repository URL detection. Defaults to “origin”. 
- Returns
- info: GitInfo object with branch, commit, dirty state, and diff information 
- active_model: The active LoggedModel linked to current Git state 
 
- Return type
- A GitContext instance containing 
 - Example: - import mlflow.genai # Enable Git-based versioning context = mlflow.genai.enable_git_model_versioning() print(f"Branch: {context.info.branch}, Commit: {context.info.commit[:8]}") # Output: Branch: main, Commit: abc12345 # All traces are now automatically linked to this Git version @mlflow.trace def my_app(): return "result" # Can also use as a context manager with mlflow.genai.enable_git_model_versioning() as context: # Traces within this block are linked to the Git version result = my_app() - Note - If Git is not available or the current directory is not a Git repository, a warning is issued and versioning is disabled (context.info will be None). 
- mlflow.genai.evaluate(data: EvaluationDatasetTypes, scorers: list[Scorer], predict_fn: Optional[Callable[[...], Any]] = None, model_id: str | None = None) mlflow.models.evaluation.base.EvaluationResult[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Evaluate the performance of a generative AI model/application using specified data and scorers. - This function allows you to evaluate a model’s performance on a given dataset using various scoring criteria. It supports both built-in scorers provided by MLflow and custom scorers. The evaluation results include metrics and detailed per-row assessments. - There are three different ways to use this function: - 1. Use Traces to evaluate the model/application. - The data parameter takes a DataFrame with trace column, which contains a single trace object corresponding to the prediction for the row. This dataframe is easily obtained from the existing traces stored in MLflow, by using the - mlflow.search_traces()function.- import mlflow from mlflow.genai.scorers import Correctness, Safety import pandas as pd # model_id is a string starting with "m-", e.g. "m-074689226d3b40bfbbdf4c3ff35832cd" trace_df = mlflow.search_traces(model_id="<my-model-id>") mlflow.genai.evaluate( data=trace_df, scorers=[Correctness(), Safety()], ) - Built-in scorers will understand the model inputs, outputs, and other intermediate information e.g. retrieved context, from the trace object. You can also access to the trace object from the custom scorer function by using the trace parameter. - from mlflow.genai.scorers import scorer @scorer def faster_than_one_second(inputs, outputs, trace): return trace.info.execution_duration < 1000 - 2. Use DataFrame or dictionary with “inputs”, “outputs”, “expectations” columns. - Alternatively, you can pass inputs, outputs, and expectations (ground truth) as a column in the dataframe (or equivalent list of dictionaries). - import mlflow from mlflow.genai.scorers import Correctness import pandas as pd data = pd.DataFrame( [ { "inputs": {"question": "What is MLflow?"}, "outputs": "MLflow is an ML platform", "expectations": "MLflow is an ML platform", }, { "inputs": {"question": "What is Spark?"}, "outputs": "I don't know", "expectations": "Spark is a data engine", }, ] ) mlflow.genai.evaluate( data=data, scorers=[Correctness()], ) - 3. Pass `predict_fn` and input samples (and optionally expectations). - If you want to generate the outputs and traces on-the-fly from your input samples, you can pass a callable to the predict_fn parameter. In this case, MLflow will pass the inputs to the predict_fn as keyword arguments. Therefore, the “inputs” column must be a dictionary with the parameter names as keys. - import mlflow from mlflow.genai.scorers import Correctness, Safety import openai # Create a dataframe with input samples data = pd.DataFrame( [ {"inputs": {"question": "What is MLflow?"}}, {"inputs": {"question": "What is Spark?"}}, ] ) # Define a predict function to evaluate. The "inputs" column will be # passed to the prediction function as keyword arguments. def predict_fn(question: str) -> str: response = openai.OpenAI().chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": question}], ) return response.choices[0].message.content mlflow.genai.evaluate( data=data, predict_fn=predict_fn, scorers=[Correctness(), Safety()], ) - Parameters
- data – - Dataset for the evaluation. Must be one of the following formats: - An EvaluationDataset entity 
- Pandas DataFrame 
- Spark DataFrame 
- List of dictionaries 
 - The dataset must include either of the following columns: - trace column that contains a single trace object corresponding
- to the prediction for the row. - If this column is present, MLflow extracts inputs, outputs, assessments, and other intermediate information e.g. retrieved context, from the trace object and uses them for scoring. When this column is present, the predict_fn parameter must not be provided. 
 
- inputs, outputs, expectations columns. - Alternatively, you can pass inputs, outputs, and expectations(ground truth) as a column in the dataframe (or equivalent list of dictionaries). - inputs (required): Column containing inputs for evaluation. The value must be a dictionary. When predict_fn is provided, MLflow will pass the inputs to the predict_fn as keyword arguments. For example, - predict_fn: def predict_fn(question: str, context: str) -> str 
- inputs: {“question”: “What is MLflow?”, “context”: “MLflow is an ML platform”} 
- predict_fn will receive “What is MLflow?” as the first argument (question) and “MLflow is an ML platform” as the second argument (context) 
 
- outputs (optional): Column containing model or app outputs. If this column is present, predict_fn must not be provided. 
- expectations (optional): Column containing a dictionary of ground truths. 
 
 - For list of dictionaries, each dict should follow the above schema. - Optional columns:
- tags (optional): Column containing a dictionary of tags. The tags will be logged
- to the respective traces. 
 
 
 
- scorers – A list of Scorer objects that produces evaluation scores from inputs, outputs, and other additional contexts. MLflow provides pre-defined scorers, but you can also define custom ones. 
- predict_fn – - The target function to be evaluated. The specified function will be executed for each row in the input dataset, and outputs will be used for scoring. - The function must emit a single trace per call. If it doesn’t, decorate the function with @mlflow.trace decorator to ensure a trace to be emitted. 
- model_id – Optional model identifier (e.g. “m-074689226d3b40bfbbdf4c3ff35832cd”) to associate with the evaluation results. Can be also set globally via the - mlflow.set_active_model()function.
 
- Returns
- An - mlflow.models.EvaluationResult~object.
 - Note - This function is only supported on Databricks. The tracking URI must be set to Databricks. - Warning - This function is not thread-safe. Please do not use it in multi-threaded environments. 
- mlflow.genai.get_dataset(name: str | None = None, dataset_id: str | None = None) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]
- Note - Parameter - uc_table_nameis deprecated. Use- nameinstead.- Get the dataset with the given name or ID. - Parameters
- name – The name of the dataset (Databricks only). In Databricks, this is the UC table name. 
- dataset_id – The ID of the dataset. 
 
- Returns
- An EvaluationDataset object representing the retrieved dataset. 
 - Note - In Databricks environments: Use ‘name’ to specify the dataset. 
- Outside of Databricks: Use ‘dataset_id’ to specify the dataset 
 - Examples - from mlflow.genai.datasets import get_dataset # Get a dataset by ID (non-Databricks) dataset = get_dataset(dataset_id="d-7f2e3a9b8c1d4e5f6a7b8c9d0e1f2a3b") # Access dataset properties print(f"Dataset name: {dataset.name}") print(f"Number of records: {len(dataset.records)}") print(f"Tags: {dataset.tags}") print(f"Created by: {dataset.created_by}") # Work with the dataset df = dataset.to_df() # Convert to pandas DataFrame schema = dataset.schema # Get auto-computed schema profile = dataset.profile # Get dataset statistics # Add new records to the dataset new_test_cases = [ { "inputs": {"question": "What is MLflow?"}, "expectations": {"accuracy": 0.95, "contains_tracking": True}, } ] dataset.merge_records(new_test_cases) 
- mlflow.genai.get_labeling_session(run_id: str) mlflow.genai.labeling.labeling.LabelingSession[source]
- Get a labeling session from the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- run_id – The mlflow run ID of the labeling session to get. 
- Returns
- The labeling session. 
- Return type
 
- mlflow.genai.get_labeling_sessions() list[mlflow.genai.labeling.labeling.LabelingSession][source]
- Get all labeling sessions from the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Returns
- The list of labeling sessions. 
- Return type
- list[LabelingSession] 
 
- mlflow.genai.get_review_app(experiment_id: Optional[str] = None) mlflow.genai.labeling.labeling.ReviewApp[source]
- Gets or creates (if it doesn’t exist) the review app for the given experiment ID. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- experiment_id – Optional. The experiment ID for which to get the review app. If not provided, the experiment ID is inferred from the current active environment. 
- Returns
- The review app. 
- Return type
 
- mlflow.genai.load_prompt(name_or_uri: str, version: str | int | None = None, allow_missing: bool = False) PromptVersion[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Load a - Promptfrom the MLflow Prompt Registry.- The prompt can be specified by name and version, or by URI. - Parameters
- name_or_uri – The name of the prompt, or the URI in the format “prompts:/name/version”. 
- version – The version of the prompt (required when using name, not allowed when using URI). 
- allow_missing – If True, return None instead of raising Exception if the specified prompt is not found. 
 
 - Example: - import mlflow # Load a specific version of the prompt prompt = mlflow.genai.load_prompt("my_prompt", version=1) # Load a specific version of the prompt by URI prompt = mlflow.genai.load_prompt("prompts:/my_prompt/1") # Load a prompt version with an alias "production" prompt = mlflow.genai.load_prompt("prompts:/my_prompt@production") 
- mlflow.genai.make_judge(name: str, instructions: str, model: Optional[str] = None) mlflow.genai.judges.base.Judge[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Create a custom MLflow judge instance. - Parameters
- name – The name of the judge 
- instructions – Natural language instructions for evaluation. Must contain at least one template variable: {{ inputs }}, {{ outputs }}, {{ expectations }}, or {{ trace }} to reference evaluation data. Custom variables are not supported. 
- model – The model identifier to use for evaluation (e.g., “openai:/gpt-4”) 
 
- Returns
- An InstructionsJudge instance configured with the provided parameters 
 - Example - import mlflow from mlflow.genai.judges import make_judge # Create a judge that evaluates response quality using template variables quality_judge = make_judge( name="response_quality", instructions=( "Evaluate if the response in {{ outputs }} correctly answers " "the question in {{ inputs }}. The response should be accurate, " "complete, and professional." ), model="openai:/gpt-4", ) # Evaluate a response result = quality_judge( inputs={"question": "What is machine learning?"}, outputs="ML is basically when computers learn stuff on their own", ) # Create a judge that compares against expectations correctness_judge = make_judge( name="correctness", instructions=( "Compare the {{ outputs }} against the {{ expectations }}. " "Rate how well they match on a scale of 1-5." ), model="openai:/gpt-4", ) # Evaluate with expectations (must be dictionaries) result = correctness_judge( inputs={"question": "What is the capital of France?"}, outputs={"answer": "The capital of France is Paris."}, expectations={"expected_answer": "Paris"}, ) # Create a judge that evaluates based on trace context trace_judge = make_judge( name="trace_quality", instructions="Evaluate the overall quality of the {{ trace }} execution.", model="openai:/gpt-4", ) # Use with search_traces() - evaluate each trace traces = mlflow.search_traces(experiment_ids=["1"], return_type="list") for trace in traces: feedback = trace_judge(trace=trace) print(f"Trace {trace.info.trace_id}: {feedback.value} - {feedback.rationale}") 
- mlflow.genai.optimize_prompt(*, target_llm_params: mlflow.genai.optimize.types.LLMParams, prompt: str | PromptVersion, train_data: EvaluationDatasetTypes, scorers: list[Scorer], objective: Optional[Callable[[dict[str, bool | float | str | Feedback | list[Feedback]]], float]] = None, eval_data: Optional[EvaluationDatasetTypes] = None, optimizer_config: mlflow.genai.optimize.types.OptimizerConfig | None = None) mlflow.genai.optimize.types.PromptOptimizationResult[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Optimize a LLM prompt using the given dataset and evaluation metrics. By default, the optimized prompt template is automatically registered as a new version of the original prompt and optimization metrics are logged. Currently, this API provides built-in support for DSPy’s MIPROv2 optimizer and you can also implement custom optimization algorithms by extending BasePromptOptimizer class. - Parameters
- target_llm_params – Parameters for the the LLM that prompt is optimized for. The model name can be specified in either format: - <provider>:/<model> (e.g., “openai:/gpt-4o”) - <provider>/<model> (e.g., “openai/gpt-4o”) 
- prompt – The URI or Prompt object of the MLflow prompt to optimize. 
- train_data – - Training dataset used for optimization. The data must be one of the following formats: - An EvaluationDataset entity 
- Pandas DataFrame 
- Spark DataFrame 
- List of dictionaries 
 - The dataset must include the following columns: - inputs: A column containing single inputs in dict format. Each input should contain keys matching the variables in the prompt template. 
- expectations: A column containing a dictionary of ground truths for individual output fields. 
 
- scorers – List of scorers that evaluate the inputs, outputs and expectations. Note: Trace input is not supported for optimization. Use inputs, outputs and expectations for optimization. Also, pass the objective argument when using scorers with string or - Feedbacktype outputs.
- objective – A callable that computes the overall performance metric from individual assessments. Takes a dict mapping assessment names to assessment scores and returns a float value (greater is better). 
- eval_data – Evaluation dataset with the same format as train_data. If not provided, train_data will be automatically split into training and evaluation sets. 
- optimizer_config – Configuration parameters for the optimizer. 
 
- Returns
- The optimization result including the optimized prompt. 
- Return type
 - Example - import os import mlflow from typing import Any from mlflow.genai.scorers import scorer from mlflow.genai.optimize import OptimizerConfig, LLMParams os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" @scorer def exact_match(expectations: dict[str, Any], outputs: dict[str, Any]) -> bool: return expectations == outputs prompt = mlflow.genai.register_prompt( name="qa", template="Answer the following question: {{question}}", ) result = mlflow.genai.optimize_prompt( target_llm_params=LLMParams(model_name="openai:/gpt-4o-mini"), train_data=[ {"inputs": {"question": f"{i}+1"}, "expectations": {"answer": f"{i + 1}"}} for i in range(100) ], scorers=[exact_match], prompt=prompt.uri, optimizer_config=OptimizerConfig(num_instruction_candidates=5), ) print(result.prompt.template) 
- mlflow.genai.register_prompt(name: str, template: str | list[dict[str, typing.Any]], commit_message: str | None = None, tags: dict[str, str] | None = None, response_format: pydantic.main.BaseModel | dict[str, typing.Any] | None = None) PromptVersion[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Register a new - Promptin the MLflow Prompt Registry.- A - Promptis a pair of name and template content at minimum. With MLflow Prompt Registry, you can create, manage, and version control prompts with the MLflow’s robust model tracking framework.- If there is no registered prompt with the given name, a new prompt will be created. Otherwise, a new version of the existing prompt will be created. - Parameters
- name – The name of the prompt. 
- template – - The template content of the prompt. Can be either: - A string containing text with variables enclosed in double curly braces, e.g. {{variable}}, which will be replaced with actual values by the format method. 
- A list of dictionaries representing chat messages, where each message has ‘role’ and ‘content’ keys (e.g., [{“role”: “user”, “content”: “Hello {{name}}”}]) 
 - Note - If you want to use the prompt with a framework that uses single curly braces e.g. LangChain, you can use the to_single_brace_format method to convert the loaded prompt to a format that uses single curly braces. - prompt = client.load_prompt("my_prompt") langchain_format = prompt.to_single_brace_format() 
- commit_message – A message describing the changes made to the prompt, similar to a Git commit message. Optional. 
- tags – A dictionary of tags associated with the prompt version. This is useful for storing version-specific information, such as the author of the changes. Optional. 
- response_format – Optional Pydantic class or dictionary defining the expected response structure. This can be used to specify the schema for structured outputs from LLM calls. 
 
- Returns
- A - Promptobject that was created.
 - Example: - import mlflow # Register a text prompt mlflow.genai.register_prompt( name="greeting_prompt", template="Respond to the user's message as a {{style}} AI.", ) # Register a chat prompt with multiple messages mlflow.genai.register_prompt( name="assistant_prompt", template=[ {"role": "system", "content": "You are a helpful {{style}} assistant."}, {"role": "user", "content": "{{question}}"}, ], response_format={"type": "object", "properties": {"answer": {"type": "string"}}}, ) # Load and use the prompt prompt = mlflow.genai.load_prompt("greeting_prompt") # Use the prompt in your application import openai openai_client = openai.OpenAI() openai_client.chat.completion.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": prompt.format(style="friendly")}, {"role": "user", "content": "Hello, how are you?"}, ], ) # Update the prompt with a new version prompt = mlflow.genai.register_prompt( name="greeting_prompt", template="Respond to the user's message as a {{style}} AI. {{greeting}}", commit_message="Add a greeting to the prompt.", tags={"author": "Bob"}, ) 
- mlflow.genai.scorer(func=None, *, name: Optional[str] = None, aggregations: Optional[list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]]] = None)[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - A decorator to define a custom scorer that can be used in - mlflow.genai.evaluate().- The scorer function should take in a subset of the following parameters: - Parameter - Description - Source - inputs- A single input to the target model/app. - Derived from either dataset or trace. - When the dataset contains - inputscolumn, the value will be passed as is.
- When traces are provided as evaluation dataset, this will be derived from the - inputsfield of the trace (i.e. inputs captured as the root span of the trace).
 - outputs- A single output from the target model/app. - Derived from either dataset, trace, or output of - predict_fn.- When the dataset contains - outputscolumn, the value will be passed as is.
- When - predict_fnis provided, MLflow will make a prediction using the- inputsand the- predict_fnand pass the result as the- outputs.
- When traces are provided as evaluation dataset, this will be derived from the - responsefield of the trace (i.e. outputs captured as the root span of the trace).
 - expectations- Ground truth or any expectation for each prediction e.g., expected retrieved docs. - Derived from either dataset or trace. - When the dataset contains - expectationscolumn, the value will be passed as is.
- When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value]. 
 - trace- A trace object corresponding to the prediction for the row. - Specified as a - tracecolumn in the dataset, or generated during the prediction.- The scorer function should return one of the following: - A boolean value 
- An integer value 
- A float value 
- A string value 
- A single - Feedbackobject
- A list of - Feedbackobjects
 - Note - The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer. - Parameters
- func – The scorer function to be decorated. 
- name – The name of the scorer. 
- aggregations – - A list of aggregation functions to apply to the scorer’s output. The aggregation functions can be either a string or a callable. - If a string, it must be one of [“min”, “max”, “mean”, “median”, “variance”, “p90”]. 
- If a callable, it must take a list of values and return a single value. 
 - By default, “mean” is used as the aggregation function. 
 
 - Example - import json from mlflow.genai.scorers import scorer from mlflow.entities import AssessmentSource, Feedback # Basic scorers that returns primitive values @scorer def not_empty(outputs) -> bool: return outputs != "" @scorer def exact_match(outputs, expectations) -> bool: return outputs == expectations["expected_response"] @scorer def num_tool_calls(trace) -> int: spans = trace.search_spans(name="tool_call") return len(spans) # Use `Feedback` object to return additional information about the scorer's # result, such as a rationale for the score. @scorer def harmfulness(outputs) -> Feedback: import openai prompt = f''' Judge if the following text is harmful or not. Text: {outputs} Return the answer in a JSON object with the following format: {{ "harmful": true "reason": "The text contains harmful content" }} Do not output any other characters than the json object. ''' response = openai.OpenAI().chat.completions.create( model="o4-mini", messages=[{"role": "user", "content": prompt}], ) payload = json.loads(response.choices[0].message.content) return Feedback( value=payload["harmful"], rationale=payload["reason"], source=AssessmentSource( source_type="LLM_JUDGE", source_id="openai:/o4-mini", ), ) # Use the scorer in an evaluation mlflow.genai.evaluate( data=data, scorers=[not_empty, exact_match, num_tool_calls, harmfulness], ) 
- mlflow.genai.search_datasets(experiment_ids: Optional[Union[str, list[str]]] = None, filter_string: Optional[str] = None, max_results: Optional[int] = None, order_by: Optional[list[str]] = None) list[mlflow.genai.datasets.evaluation_dataset.EvaluationDataset][source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Search for datasets (non-Databricks only). - Warning - Calling - search_datasets()without any parameters will return ALL datasets in your tracking server. This can be slow or even crash your Python session if you have many datasets. Always use filters or- max_resultsto limit the results.- Parameters
- experiment_ids – Single experiment ID (str) or list of experiment IDs to filter by. If None, searches across all experiments. 
- filter_string – SQL-like filter string for dataset attributes. If not specified, defaults to filtering for datasets created in the last 7 days. Supports filtering by: - name: Dataset name - created_by: User who created the dataset - last_updated_by: User who last updated the dataset - created_time: Creation timestamp (milliseconds since epoch) - tags.<key>: Tag values 
- max_results – Maximum number of results. If not specified, returns all datasets. 
- order_by – List of columns to order by. Each entry can include an optional “DESC” or “ASC” suffix (default is “ASC”). If not specified, defaults to [“created_time DESC”]. Supported columns: - name - created_time - last_update_time 
 
- Returns
- List of EvaluationDataset objects matching the search criteria 
 - Examples - from mlflow.genai.datasets import search_datasets # WARNING: This returns ALL datasets - use with caution! # all_datasets = search_datasets() # May be slow or crash # Better: Always use filters or limits recent_datasets = search_datasets(max_results=100) # Search in specific experiments exp_datasets = search_datasets(experiment_ids=["1", "2", "3"]) # Find production datasets prod_datasets = search_datasets( filter_string="tags.environment = 'production'", order_by=["name ASC"] ) # Iterate through results (pagination handled automatically) for dataset in prod_datasets: print(f"{dataset.name} (ID: {dataset.dataset_id})") print(f" Records: {len(dataset.records)}") print(f" Tags: {dataset.tags}") - Note - This API is not available in Databricks environments. Use Unity Catalog search capabilities in Databricks instead. 
- mlflow.genai.search_prompts(filter_string: str | None = None, max_results: int | None = None) PagedList[Prompt][source]
- Note - Experimental: This function may change or be removed in a future release without warning. 
- mlflow.genai.set_dataset_tags(dataset_id: str, tags: dict[str, typing.Any]) None[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Set tags for a dataset. - This implements a batch tag operation - existing tags are merged with new tags. To remove a tag, set its value to None or use delete_dataset_tag() instead. - Parameters
- dataset_id – The ID of the dataset. 
- tags – Dictionary of tags to set. Setting a value to None removes the tag. 
 
 - Examples - from mlflow.genai.datasets import set_dataset_tags, get_dataset # Get your dataset dataset = get_dataset(dataset_id="d-8f3a2b1c4e5d6f7a8b9c0d1e2f3a4b5c") # Add or update multiple tags set_dataset_tags( dataset_id=dataset.dataset_id, tags={ "environment": "production", # Add new tag "version": "2.0", # Update existing tag "validated": "true", "validation_date": "2024-11-01", "team": "ml-platform", }, ) # Remove tags by setting to None set_dataset_tags( dataset_id=dataset.dataset_id, tags={ "deprecated_tag": None, # This removes the tag "old_version": None, # This also removes the tag }, ) # Update status after validation set_dataset_tags( dataset_id=dataset.dataset_id, tags={ "status": "production_ready", "coverage": "comprehensive", "last_review": "2024-11-01", "approved_by": "data_science_lead@company.com", }, ) - Note - This API is not available in Databricks environments yet. Tags in Databricks are managed through Unity Catalog. 
- mlflow.genai.set_prompt_alias(name: str, alias: str, version: int) None[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Set an alias for a - Promptin the MLflow Prompt Registry.- Parameters
- name – The name of the prompt. 
- alias – The alias to set for the prompt. 
- version – The version of the prompt. 
 
 - Example: - import mlflow # Set an alias for the prompt mlflow.genai.set_prompt_alias(name="my_prompt", version=1, alias="production") # Load the prompt by alias (use "@" to specify the alias) prompt = mlflow.genai.load_prompt("prompts:/my_prompt@production") # Switch the alias to a new version of the prompt mlflow.genai.set_prompt_alias(name="my_prompt", version=2, alias="production") # Delete the alias mlflow.genai.delete_prompt_alias(name="my_prompt", alias="production") 
- mlflow.genai.to_predict_fn(endpoint_uri: str) Callable[[...], Any][source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Convert an endpoint URI to a predict function. - Parameters
- endpoint_uri – The endpoint URI to convert. 
- Returns
- A predict function that can be used to make predictions. 
 - Example - The following example assumes that the model serving endpoint accepts a JSON object with a messages key. Please adjust the input based on the actual schema of the model serving endpoint. - from mlflow.genai.scorers import get_all_scorers data = [ { "inputs": { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is MLflow?"}, ] } }, { "inputs": { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Spark?"}, ] } }, ] predict_fn = mlflow.genai.to_predict_fn("endpoints:/chat") mlflow.genai.evaluate( data=data, predict_fn=predict_fn, scorers=get_all_scorers(), ) - You can also directly invoke the function to validate if the endpoint works properly with your input schema. - predict_fn(**data[0]["inputs"]) 
- class mlflow.genai.scorers.Correctness(*, name: str = 'correctness', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]
- Bases: - mlflow.genai.scorers.builtin_scorers.BuiltInScorer- Note - Experimental: This class may change or be removed in a future release without warning. - Correctness ensures that the agent’s responses are correct and accurate. - You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset. - Parameters
- name – The name of the scorer. Defaults to “correctness”. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
 - Example (direct usage): - import mlflow from mlflow.genai.scorers import Correctness assessment = Correctness(name="my_correctness")( inputs={ "question": "What is the difference between reduceByKey and groupByKey in Spark?" }, outputs=( "reduceByKey aggregates data before shuffling, whereas groupByKey " "shuffles all data, making reduceByKey more efficient." ), expectations=[ {"expected_response": "reduceByKey aggregates data before shuffling"}, {"expected_response": "groupByKey shuffles all data"}, ], ) print(assessment) - Example (with evaluate): - import mlflow from mlflow.genai.scorers import Correctness data = [ { "inputs": { "question": ( "What is the difference between reduceByKey and groupByKey in Spark?" ) }, "outputs": ( "reduceByKey aggregates data before shuffling, whereas groupByKey " "shuffles all data, making reduceByKey more efficient." ), "expectations": { "expected_response": ( "reduceByKey aggregates data before shuffling. " "groupByKey shuffles all data" ), }, } ] result = mlflow.genai.evaluate(data=data, scorers=[Correctness()]) - get_input_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the input fields for the Correctness judge. - Returns
- List of JudgeField objects defining the input fields based on the __call__ method. 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 - model_post_init(context: Any, /) None
- This function is meant to behave like a BaseModel method to initialise private attributes. - It takes context as an argument since that’s what pydantic-core passes when calling it. - Parameters
- self – The BaseModel instance. 
- context – The context. 
 
 
 - validate_columns(columns: set[str]) None[source]
 
- class mlflow.genai.scorers.ExpectationsGuidelines(*, name: str = 'expectations_guidelines', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]
- Bases: - mlflow.genai.scorers.builtin_scorers.BuiltInScorer- Note - Experimental: This class may change or be removed in a future release without warning. - This scorer evaluates whether the agent’s response follows specific constraints or instructions provided for each row in the input dataset. This scorer is useful when you have a different set of guidelines for each example. - To use this scorer, the input dataset should contain the expectations column with the guidelines field. Then pass this scorer to mlflow.genai.evaluate for running full evaluation on the input dataset. - Parameters
- name – The name of the scorer. Defaults to “expectations_guidelines”. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
 - Example: - In this example, the guidelines specified in the guidelines field of the expectations column will be applied to each example individually. The evaluation result will contain a single “expectations_guidelines” score. - import mlflow from mlflow.genai.scorers import ExpectationsGuidelines data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", "expectations": { "guidelines": ["The response must be factual and concise"], }, }, { "inputs": {"question": "How to learn Python?"}, "outputs": "You can read a book or take a course.", "expectations": { "guidelines": ["The response must be helpful and encouraging"], }, }, ] mlflow.genai.evaluate(data=data, scorers=[ExpectationsGuidelines()]) - get_input_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the input fields for the ExpectationsGuidelines judge. - Returns
- List of JudgeField objects defining the input fields based on the __call__ method. 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 - model_post_init(context: Any, /) None
- This function is meant to behave like a BaseModel method to initialise private attributes. - It takes context as an argument since that’s what pydantic-core passes when calling it. - Parameters
- self – The BaseModel instance. 
- context – The context. 
 
 
 - validate_columns(columns: set[str]) None[source]
 
- class mlflow.genai.scorers.Guidelines(*, name: str = 'guidelines', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, required_columns: set[str] = {'inputs', 'outputs'}, guidelines: str | list[str], model: str | None = None)[source]
- Bases: - mlflow.genai.scorers.builtin_scorers.BuiltInScorer- Note - Experimental: This class may change or be removed in a future release without warning. - Guideline adherence evaluates whether the agent’s response follows specific constraints or instructions provided in the guidelines. - You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset. - Parameters
- name – The name of the scorer. Defaults to “guidelines”. 
- guidelines – A single guideline text or a list of guidelines. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
 - Example (direct usage): - import mlflow from mlflow.genai.scorers import Guidelines # Create a global judge english = Guidelines( name="english_guidelines", guidelines=["The response must be in English"], ) feedback = english( inputs={"question": "What is the capital of France?"}, outputs="The capital of France is Paris.", ) print(feedback) - Example (with evaluate): - In the following example, the guidelines specified in the english and clarify scorers will be uniformly applied to all the examples in the dataset. The evaluation result will contains two scores “english” and “clarify”. - import mlflow from mlflow.genai.scorers import Guidelines english = Guidelines( name="english", guidelines=["The response must be in English"], ) clarify = Guidelines( name="clarify", guidelines=["The response must be clear, coherent, and concise"], ) data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", }, { "inputs": {"question": "What is the capital of Germany?"}, "outputs": "The capital of Germany is Berlin.", }, ] mlflow.genai.evaluate(data=data, scorers=[english, clarify]) - get_input_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the input fields for the Guidelines judge. - Returns
- List of JudgeField objects defining the input fields based on the __call__ method. 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 
- class mlflow.genai.scorers.RelevanceToQuery(*, name: str = 'relevance_to_query', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]
- Bases: - mlflow.genai.scorers.builtin_scorers.BuiltInScorer- Note - Experimental: This class may change or be removed in a future release without warning. - Relevance ensures that the agent’s response directly addresses the user’s input without deviating into unrelated topics. - You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset. - Parameters
- name – The name of the scorer. Defaults to “relevance_to_query”. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
 - Example (direct usage): - import mlflow from mlflow.genai.scorers import RelevanceToQuery assessment = RelevanceToQuery(name="my_relevance_to_query")( inputs={"question": "What is the capital of France?"}, outputs="The capital of France is Paris.", ) print(assessment) - Example (with evaluate): - import mlflow from mlflow.genai.scorers import RelevanceToQuery data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", } ] result = mlflow.genai.evaluate(data=data, scorers=[RelevanceToQuery()]) - get_input_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the input fields for the RelevanceToQuery judge. - Returns
- List of JudgeField objects defining the input fields based on the __call__ method. 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 
- class mlflow.genai.scorers.RetrievalGroundedness(*, name: str = 'retrieval_groundedness', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, required_columns: set[str] = {'inputs', 'trace'}, model: str | None = None)[source]
- Bases: - mlflow.genai.scorers.builtin_scorers.BuiltInScorer- Note - Experimental: This class may change or be removed in a future release without warning. - RetrievalGroundedness assesses whether the agent’s response is aligned with the information provided in the retrieved context. - You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset. - Parameters
- name – The name of the scorer. Defaults to “retrieval_groundedness”. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
 - Example (direct usage): - import mlflow from mlflow.genai.scorers import RetrievalGroundedness trace = mlflow.get_trace("<your-trace-id>") feedback = RetrievalGroundedness(name="my_retrieval_groundedness")(trace=trace) print(feedback) - Example (with evaluate): - import mlflow data = mlflow.search_traces(...) result = mlflow.genai.evaluate(data=data, scorers=[RetrievalGroundedness()]) - get_input_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the input fields for the RetrievalGroundedness judge. - Returns
- List of JudgeField objects defining the input fields based on the __call__ method. 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 
- class mlflow.genai.scorers.RetrievalRelevance(*, name: str = 'retrieval_relevance', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, required_columns: set[str] = {'inputs', 'trace'}, model: str | None = None)[source]
- Bases: - mlflow.genai.scorers.builtin_scorers.BuiltInScorer- Note - Experimental: This class may change or be removed in a future release without warning. - Retrieval relevance measures whether each chunk is relevant to the input request. - You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset. - Parameters
- name – The name of the scorer. Defaults to “retrieval_relevance”. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
 - Example (direct usage): - import mlflow from mlflow.genai.scorers import RetrievalRelevance trace = mlflow.get_trace("<your-trace-id>") feedbacks = RetrievalRelevance(name="my_retrieval_relevance")(trace=trace) print(feedbacks) - Example (with evaluate): - import mlflow data = mlflow.search_traces(...) result = mlflow.genai.evaluate(data=data, scorers=[RetrievalRelevance()]) - get_input_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the input fields for the RetrievalRelevance judge. - Returns
- List of JudgeField objects defining the input fields based on the __call__ method. 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 
- class mlflow.genai.scorers.RetrievalSufficiency(*, name: str = 'retrieval_sufficiency', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, required_columns: set[str] = {'inputs', 'trace'}, model: str | None = None)[source]
- Bases: - mlflow.genai.scorers.builtin_scorers.BuiltInScorer- Note - Experimental: This class may change or be removed in a future release without warning. - Retrieval sufficiency evaluates whether the retrieved documents provide all necessary information to generate the expected response. - You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset. - Parameters
- name – The name of the scorer. Defaults to “retrieval_sufficiency”. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
 - Example (direct usage): - import mlflow from mlflow.genai.scorers import RetrievalSufficiency trace = mlflow.get_trace("<your-trace-id>") feedback = RetrievalSufficiency(name="my_retrieval_sufficiency")(trace=trace) print(feedback) - Example (with evaluate): - import mlflow data = mlflow.search_traces(...) result = mlflow.genai.evaluate(data=data, scorers=[RetrievalSufficiency()]) - get_input_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the input fields for the RetrievalSufficiency judge. - Returns
- List of JudgeField objects defining the input fields based on the __call__ method. 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 - model_post_init(context: Any, /) None
- This function is meant to behave like a BaseModel method to initialise private attributes. - It takes context as an argument since that’s what pydantic-core passes when calling it. - Parameters
- self – The BaseModel instance. 
- context – The context. 
 
 
 - validate_columns(columns: set[str]) None[source]
 
- class mlflow.genai.scorers.Safety(*, name: str = 'safety', aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None, required_columns: set[str] = {'inputs', 'outputs'}, model: str | None = None)[source]
- Bases: - mlflow.genai.scorers.builtin_scorers.BuiltInScorer- Note - Experimental: This class may change or be removed in a future release without warning. - Safety ensures that the agent’s responses do not contain harmful, offensive, or toxic content. - You can invoke the scorer directly with a single input for testing, or pass it to mlflow.genai.evaluate for running full evaluation on a dataset. - Parameters
- name – The name of the scorer. Defaults to “safety”. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
 - Example (direct usage): - import mlflow from mlflow.genai.scorers import Safety assessment = Safety(name="my_safety")(outputs="The capital of France is Paris.") print(assessment) - Example (with evaluate): - import mlflow from mlflow.genai.scorers import Safety data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", } ] result = mlflow.genai.evaluate(data=data, scorers=[Safety()]) - get_input_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the input fields for the Safety judge. - Returns
- List of JudgeField objects defining the input fields based on the __call__ method. 
 
 - model_config: ClassVar[ConfigDict] = {}
- Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict]. 
 
- class mlflow.genai.scorers.ScorerSamplingConfig(sample_rate: Optional[float] = None, filter_string: Optional[str] = None)[source]
- Bases: - object- Configuration for registered scorer sampling. 
- mlflow.genai.scorers.delete_scorer(*, name: str, experiment_id: Optional[str] = None, version: Optional[Union[int, str]] = None) None[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Delete a registered scorer from the MLflow experiment. - This function permanently removes scorer registrations. The behavior of this function varies depending on the backend store and version parameter: - OSS MLflow Tracking Backend:
- Supports versioning with granular deletion options 
- Can delete specific versions or all versions of a scorer by setting version parameter to “all” 
 
- Databricks Backend:
- Does not support versioning 
- Deletes the entire scorer regardless of version parameter 
- version parameter must be None 
 
 - Parameters
- name (str) – The name of the scorer to delete. This must match exactly with the name used during scorer registration. 
- experiment_id (str, optional) – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment as determined by - mlflow.get_experiment_by_name()or- mlflow.set_experiment().
- version (int | str | None, optional) – The version(s) to delete: For OSS MLflow tracking backend: if None, deletes the latest version only, if version is an integer, deletes the specific version, if version is the string ‘all’, deletes all versions of the scorer For Databricks backend, the version must be set to None (versioning not supported) 
 
- Raises
- mlflow.MlflowException – If the scorer with the specified name is not found in the experiment, if the specified version doesn’t exist, or if versioning is not supported for the current backend. 
 - Example - from mlflow.genai.scorers import delete_scorer # Delete the latest version of a scorer from current experiment delete_scorer(name="accuracy_scorer") # Delete a specific version of a scorer delete_scorer(name="safety_scorer", version=2) # Delete all versions of a scorer delete_scorer(name="relevance_scorer", version="all") # Delete a scorer from a specific experiment delete_scorer(name="harmfulness_scorer", experiment_id="123", version=1) 
- mlflow.genai.scorers.get_all_scorers() list[mlflow.genai.scorers.builtin_scorers.BuiltInScorer][source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Returns a list of all built-in scorers. - Example: - import mlflow from mlflow.genai.scorers import get_all_scorers data = [ { "inputs": {"question": "What is the capital of France?"}, "outputs": "The capital of France is Paris.", "expectations": {"expected_response": "Paris is the capital city of France."}, } ] result = mlflow.genai.evaluate(data=data, scorers=get_all_scorers()) 
- mlflow.genai.scorers.get_scorer(*, name: str, experiment_id: Optional[str] = None, version: Optional[int] = None) Scorer[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Retrieve a specific registered scorer by name and optional version. - This function retrieves a single Scorer instance from the specified experiment. If no version is specified, it returns the latest (highest version number) scorer with the given name. - Parameters
- name (str) – The name of the registered scorer to retrieve. This must match exactly with the name used during scorer registration. 
- experiment_id (str, optional) – The ID of the MLflow experiment containing the scorer. If None, uses the currently active experiment as determined by - mlflow.get_experiment_by_name()or- mlflow.set_experiment().
- version (int, optional) – The specific version of the scorer to retrieve. If None, returns the scorer with the highest version number (latest version). 
 
- Returns
- A Scorer object representing the requested scorer. 
- Return type
- Raises
- mlflow.MlflowException – If the scorer with the specified name is not found in the experiment, if the specified version doesn’t exist, if the experiment doesn’t exist, or if there are issues with the backend store connection. 
 - Example - from mlflow.genai.scorers import get_scorer # Get the latest version of a scorer latest_scorer = get_scorer(name="accuracy_scorer") # Get a specific version of a scorer v2_scorer = get_scorer(name="safety_scorer", version=2) # Get a scorer from a specific experiment scorer = get_scorer(name="relevance_scorer", experiment_id="123") - Note - When no version is specified, the function automatically returns the latest version 
- This function works with both OSS MLflow tracking backend and Databricks backend. 
- For Databricks backend, versioning is not supported, so the version parameter should be None. 
 
- mlflow.genai.scorers.list_scorers(*, experiment_id: Optional[str] = None) list[Scorer][source]
- Note - Experimental: This function may change or be removed in a future release without warning. - List all registered scorers for an experiment. - This function retrieves all scorers that have been registered in the specified experiment. For each scorer name, only the latest version is returned. - The function automatically determines the appropriate backend store (MLflow tracking store, Databricks, etc.) based on the current MLflow configuration and experiment location. - Parameters
- experiment_id (str, optional) – The ID of the MLflow experiment containing the scorers. If None, uses the currently active experiment as determined by - mlflow.get_experiment_by_name()or- mlflow.set_experiment().
- Returns
- A list of Scorer objects, each representing the latest version of a
- registered scorer with its current configuration. The list may be empty if no scorers have been registered in the experiment. 
 
- Return type
- list[Scorer] 
- Raises
- mlflow.MlflowException – If the experiment doesn’t exist or if there are issues with the backend store connection. 
 - Example - from mlflow.genai.scorers import list_scorers # List all scorers in the current experiment scorers = list_scorers() # List all scorers in a specific experiment scorers = list_scorers(experiment_id="123") # Process the returned scorers for scorer in scorers: print(f"Scorer: {scorer.name}") - Note - Only the latest version of each scorer is returned. 
- This function works with both OSS MLflow tracking backend and Databricks backend. 
 
- mlflow.genai.scorers.scorer(func=None, *, name: Optional[str] = None, aggregations: Optional[list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]]] = None)[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - A decorator to define a custom scorer that can be used in - mlflow.genai.evaluate().- The scorer function should take in a subset of the following parameters: - Parameter - Description - Source - inputs- A single input to the target model/app. - Derived from either dataset or trace. - When the dataset contains - inputscolumn, the value will be passed as is.
- When traces are provided as evaluation dataset, this will be derived from the - inputsfield of the trace (i.e. inputs captured as the root span of the trace).
 - outputs- A single output from the target model/app. - Derived from either dataset, trace, or output of - predict_fn.- When the dataset contains - outputscolumn, the value will be passed as is.
- When - predict_fnis provided, MLflow will make a prediction using the- inputsand the- predict_fnand pass the result as the- outputs.
- When traces are provided as evaluation dataset, this will be derived from the - responsefield of the trace (i.e. outputs captured as the root span of the trace).
 - expectations- Ground truth or any expectation for each prediction e.g., expected retrieved docs. - Derived from either dataset or trace. - When the dataset contains - expectationscolumn, the value will be passed as is.
- When traces are provided as evaluation dataset, this will be a dictionary that contains a set of assessments in the format of [assessment name]: [assessment value]. 
 - trace- A trace object corresponding to the prediction for the row. - Specified as a - tracecolumn in the dataset, or generated during the prediction.- The scorer function should return one of the following: - A boolean value 
- An integer value 
- A float value 
- A string value 
- A single - Feedbackobject
- A list of - Feedbackobjects
 - Note - The metric name will be determined by the scorer function’s name or a custom name specified in the name parameter for the scorer. - Parameters
- func – The scorer function to be decorated. 
- name – The name of the scorer. 
- aggregations – - A list of aggregation functions to apply to the scorer’s output. The aggregation functions can be either a string or a callable. - If a string, it must be one of [“min”, “max”, “mean”, “median”, “variance”, “p90”]. 
- If a callable, it must take a list of values and return a single value. 
 - By default, “mean” is used as the aggregation function. 
 
 - Example - import json from mlflow.genai.scorers import scorer from mlflow.entities import AssessmentSource, Feedback # Basic scorers that returns primitive values @scorer def not_empty(outputs) -> bool: return outputs != "" @scorer def exact_match(outputs, expectations) -> bool: return outputs == expectations["expected_response"] @scorer def num_tool_calls(trace) -> int: spans = trace.search_spans(name="tool_call") return len(spans) # Use `Feedback` object to return additional information about the scorer's # result, such as a rationale for the score. @scorer def harmfulness(outputs) -> Feedback: import openai prompt = f''' Judge if the following text is harmful or not. Text: {outputs} Return the answer in a JSON object with the following format: {{ "harmful": true "reason": "The text contains harmful content" }} Do not output any other characters than the json object. ''' response = openai.OpenAI().chat.completions.create( model="o4-mini", messages=[{"role": "user", "content": prompt}], ) payload = json.loads(response.choices[0].message.content) return Feedback( value=payload["harmful"], rationale=payload["reason"], source=AssessmentSource( source_type="LLM_JUDGE", source_id="openai:/o4-mini", ), ) # Use the scorer in an evaluation mlflow.genai.evaluate( data=data, scorers=[not_empty, exact_match, num_tool_calls, harmfulness], ) 
- Databricks Agent Datasets Python SDK. For more details see Databricks Agent Evaluation:
- <https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html> 
The API docs can be found here: <https://api-docs.databricks.com/python/databricks-agents/latest/databricks_agent_eval.html#datasets>
- class mlflow.genai.datasets.EvaluationDataset(dataset)[source]
- Bases: - mlflow.data.dataset.Dataset,- mlflow.data.pyfunc_dataset_mixin.PyFuncConvertibleDatasetMixin- The public API for evaluation datasets in MLflow’s GenAI module. - This class provides a unified interface for evaluation datasets, supporting both: - Standard MLflow evaluation datasets (backed by MLflow’s tracking store) 
- Databricks managed datasets (backed by Unity Catalog tables) through the databricks-agents library 
 - property create_time: int | str | None
- Alias for created_time (for backward compatibility with managed datasets). 
 - property digest: str | None
- String digest (hash) of the dataset provided by the caller that uniquely identifies 
 - classmethod from_dict(data: dict[str, typing.Any]) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]
- Create instance from dictionary representation. - Note: This creates an MLflow dataset from serialized data. Databricks managed datasets are loaded directly from Unity Catalog, not from dict. 
 - classmethod from_proto(proto)[source]
- Create instance from protobuf representation. - Note: This creates an MLflow dataset from serialized protobuf data. Databricks managed datasets are loaded directly from Unity Catalog, not from protobuf. 
 - has_records() bool[source]
- Check if dataset records are loaded without triggering a load. 
 - merge_records(records: list[dict[str, Any]] | pd.DataFrame | pyspark.sql.DataFrame) EvaluationDataset[source]
- Merge records into the dataset. 
 - set_profile(profile: str) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]
- Set the profile of the dataset. 
 - to_df() pd.DataFrame[source]
- Convert the dataset to a pandas DataFrame. 
 - to_dict() dict[str, typing.Any][source]
- Convert to dictionary representation. 
 - to_evaluation_dataset(path=None, feature_names=None)[source]
- Converts the dataset to the legacy EvaluationDataset for model evaluation. Required for use with mlflow.evaluate(). 
 - to_proto()[source]
- Convert to protobuf representation. 
 
- mlflow.genai.datasets.add_dataset_to_experiments(dataset_id: str, experiment_ids: list[str]) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]
- Add a dataset to additional experiments. - This allows reusing datasets across multiple experiments for evaluation purposes. - Parameters
- dataset_id – The ID of the dataset to update. 
- experiment_ids – List of experiment IDs to associate with the dataset. 
 
- Returns
- The updated EvaluationDataset with new experiment associations. 
 - Example - import mlflow from mlflow.genai.datasets import add_dataset_to_experiments # Add dataset to new experiments dataset = add_dataset_to_experiments( dataset_id="d-abc123", experiment_ids=["1", "2", "3"] ) print(f"Dataset now associated with {len(dataset.experiment_ids)} experiments") 
- mlflow.genai.datasets.create_dataset(name: str | None = None, experiment_id: str | list[str] | None = None, tags: dict[str, typing.Any] | None = None) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]
- Note - Parameter - uc_table_nameis deprecated. Use- nameinstead.- Create a dataset with the given name and associate it with the given experiment. - Parameters
- name – The name of the dataset. In Databricks, this is the UC table name. 
- experiment_id – The ID of the experiment(s) to associate the dataset with. If not provided, the current experiment is inferred from the environment. 
- tags – Dictionary of tags to apply to the dataset. Not supported in Databricks. 
 
- Returns
- An EvaluationDataset object representing the created dataset. 
 - Examples - from mlflow.genai.datasets import create_dataset # Create a dataset with a single experiment dataset = create_dataset( name="customer_support_qa_v1", experiment_id="0", # Default experiment tags={ "version": "1.0", "purpose": "regression_testing", "model": "gpt-4", "team": "ml-platform", }, ) print(f"Created dataset: {dataset.dataset_id}") # Output: Created dataset: d-1a2b3c4d5e6f7890abcdef1234567890 # Create a dataset linked to multiple experiments multi_exp_dataset = create_dataset( name="cross_team_eval_dataset", experiment_id=["1", "2", "5"], # Multiple experiment IDs tags={ "coverage": "comprehensive", "status": "development", }, ) # Create a dataset without tags (minimal example) simple_dataset = create_dataset( name="quick_test_dataset", experiment_id="3", # Specific experiment ) 
- mlflow.genai.datasets.delete_dataset(name: str | None = None, dataset_id: str | None = None) None[source]
- Note - Parameter - uc_table_nameis deprecated. Use- nameinstead.- Delete a dataset. - Parameters
- name – The name of the dataset (Databricks only). In Databricks, this is the UC table name. 
- dataset_id – The ID of the dataset. 
 
 - Note - In Databricks environments: Use ‘name’ to specify the dataset. 
- Outside of Databricks: Use ‘dataset_id’ to specify the dataset 
 - Examples - from mlflow.genai.datasets import delete_dataset, search_datasets # Delete a specific dataset by ID (non-Databricks) delete_dataset(dataset_id="d-4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e") # Clean up old test datasets test_datasets = search_datasets( filter_string="name LIKE 'test_%' AND tags.environment = 'development'", order_by=["created_time ASC"], ) # Delete datasets older than the most recent 5 if len(test_datasets) > 5: for dataset in test_datasets[:-5]: # Keep the 5 most recent print(f"Deleting old test dataset: {dataset.name}") delete_dataset(dataset_id=dataset.dataset_id) # Delete datasets with specific criteria deprecated_datasets = search_datasets(filter_string="tags.status = 'deprecated'") for dataset in deprecated_datasets: delete_dataset(dataset_id=dataset.dataset_id) print(f"Deleted deprecated dataset: {dataset.name}") - Warning - Deleting a dataset is permanent and cannot be undone. All associated records, tags, and metadata will be permanently removed. 
- mlflow.genai.datasets.delete_dataset_tag(dataset_id: str, key: str) None[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Delete a tag from a dataset. - Parameters
- dataset_id – The ID of the dataset. 
- key – The tag key to delete. 
 
 - Examples - from mlflow.genai.datasets import delete_dataset_tag, get_dataset # Get your dataset dataset = get_dataset(dataset_id="d-9e8f7c6b5a4d3e2f1a0b9c8d7e6f5a4b") # Remove a single tag delete_dataset_tag(dataset_id=dataset.dataset_id, key="deprecated") # Remove outdated tags during cleanup outdated_tags = ["old_version", "temp_flag", "development_only"] for tag_key in outdated_tags: delete_dataset_tag(dataset_id=dataset.dataset_id, key=tag_key) # Check remaining tags updated_dataset = get_dataset(dataset_id=dataset.dataset_id) print(f"Remaining tags: {updated_dataset.tags}") - Note - This API is not available in Databricks environments yet. Tags in Databricks are managed through Unity Catalog. 
- mlflow.genai.datasets.get_dataset(name: str | None = None, dataset_id: str | None = None) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]
- Note - Parameter - uc_table_nameis deprecated. Use- nameinstead.- Get the dataset with the given name or ID. - Parameters
- name – The name of the dataset (Databricks only). In Databricks, this is the UC table name. 
- dataset_id – The ID of the dataset. 
 
- Returns
- An EvaluationDataset object representing the retrieved dataset. 
 - Note - In Databricks environments: Use ‘name’ to specify the dataset. 
- Outside of Databricks: Use ‘dataset_id’ to specify the dataset 
 - Examples - from mlflow.genai.datasets import get_dataset # Get a dataset by ID (non-Databricks) dataset = get_dataset(dataset_id="d-7f2e3a9b8c1d4e5f6a7b8c9d0e1f2a3b") # Access dataset properties print(f"Dataset name: {dataset.name}") print(f"Number of records: {len(dataset.records)}") print(f"Tags: {dataset.tags}") print(f"Created by: {dataset.created_by}") # Work with the dataset df = dataset.to_df() # Convert to pandas DataFrame schema = dataset.schema # Get auto-computed schema profile = dataset.profile # Get dataset statistics # Add new records to the dataset new_test_cases = [ { "inputs": {"question": "What is MLflow?"}, "expectations": {"accuracy": 0.95, "contains_tracking": True}, } ] dataset.merge_records(new_test_cases) 
- mlflow.genai.datasets.remove_dataset_from_experiments(dataset_id: str, experiment_ids: list[str]) mlflow.genai.datasets.evaluation_dataset.EvaluationDataset[source]
- Remove a dataset from experiments. - This operation is idempotent - removing non-existent associations will not raise an error but will issue a warning. - Parameters
- dataset_id – The ID of the dataset to update. 
- experiment_ids – List of experiment IDs to disassociate from the dataset. 
 
- Returns
- The updated EvaluationDataset after removing experiment associations. 
 - Example - import mlflow from mlflow.genai.datasets import remove_dataset_from_experiments # Remove dataset from experiments dataset = remove_dataset_from_experiments( dataset_id="d-abc123", experiment_ids=["1", "2"] ) print(f"Dataset now associated with {len(dataset.experiment_ids)} experiments") 
- mlflow.genai.datasets.search_datasets(experiment_ids: Optional[Union[str, list[str]]] = None, filter_string: Optional[str] = None, max_results: Optional[int] = None, order_by: Optional[list[str]] = None) list[mlflow.genai.datasets.evaluation_dataset.EvaluationDataset][source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Search for datasets (non-Databricks only). - Warning - Calling - search_datasets()without any parameters will return ALL datasets in your tracking server. This can be slow or even crash your Python session if you have many datasets. Always use filters or- max_resultsto limit the results.- Parameters
- experiment_ids – Single experiment ID (str) or list of experiment IDs to filter by. If None, searches across all experiments. 
- filter_string – SQL-like filter string for dataset attributes. If not specified, defaults to filtering for datasets created in the last 7 days. Supports filtering by: - name: Dataset name - created_by: User who created the dataset - last_updated_by: User who last updated the dataset - created_time: Creation timestamp (milliseconds since epoch) - tags.<key>: Tag values 
- max_results – Maximum number of results. If not specified, returns all datasets. 
- order_by – List of columns to order by. Each entry can include an optional “DESC” or “ASC” suffix (default is “ASC”). If not specified, defaults to [“created_time DESC”]. Supported columns: - name - created_time - last_update_time 
 
- Returns
- List of EvaluationDataset objects matching the search criteria 
 - Examples - from mlflow.genai.datasets import search_datasets # WARNING: This returns ALL datasets - use with caution! # all_datasets = search_datasets() # May be slow or crash # Better: Always use filters or limits recent_datasets = search_datasets(max_results=100) # Search in specific experiments exp_datasets = search_datasets(experiment_ids=["1", "2", "3"]) # Find production datasets prod_datasets = search_datasets( filter_string="tags.environment = 'production'", order_by=["name ASC"] ) # Iterate through results (pagination handled automatically) for dataset in prod_datasets: print(f"{dataset.name} (ID: {dataset.dataset_id})") print(f" Records: {len(dataset.records)}") print(f" Tags: {dataset.tags}") - Note - This API is not available in Databricks environments. Use Unity Catalog search capabilities in Databricks instead. 
- mlflow.genai.datasets.set_dataset_tags(dataset_id: str, tags: dict[str, typing.Any]) None[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Set tags for a dataset. - This implements a batch tag operation - existing tags are merged with new tags. To remove a tag, set its value to None or use delete_dataset_tag() instead. - Parameters
- dataset_id – The ID of the dataset. 
- tags – Dictionary of tags to set. Setting a value to None removes the tag. 
 
 - Examples - from mlflow.genai.datasets import set_dataset_tags, get_dataset # Get your dataset dataset = get_dataset(dataset_id="d-8f3a2b1c4e5d6f7a8b9c0d1e2f3a4b5c") # Add or update multiple tags set_dataset_tags( dataset_id=dataset.dataset_id, tags={ "environment": "production", # Add new tag "version": "2.0", # Update existing tag "validated": "true", "validation_date": "2024-11-01", "team": "ml-platform", }, ) # Remove tags by setting to None set_dataset_tags( dataset_id=dataset.dataset_id, tags={ "deprecated_tag": None, # This removes the tag "old_version": None, # This also removes the tag }, ) # Update status after validation set_dataset_tags( dataset_id=dataset.dataset_id, tags={ "status": "production_ready", "coverage": "comprehensive", "last_review": "2024-11-01", "approved_by": "data_science_lead@company.com", }, ) - Note - This API is not available in Databricks environments yet. Tags in Databricks are managed through Unity Catalog. 
Databricks Agent Label Schemas Python SDK. For more details see Databricks Agent Evaluation: <https://docs.databricks.com/en/generative-ai/agent-evaluation/index.html>
The API docs can be found here: <https://api-docs.databricks.com/python/databricks-agents/latest/databricks_agent_eval.html#review-app>
- class mlflow.genai.label_schemas.InputCategorical(options: list[str])[source]
- Bases: - mlflow.genai.label_schemas.label_schemas.InputType- A single-select dropdown for collecting assessments from stakeholders. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. 
- class mlflow.genai.label_schemas.InputCategoricalList(options: list[str])[source]
- Bases: - mlflow.genai.label_schemas.label_schemas.InputType- A multi-select dropdown for collecting assessments from stakeholders. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. 
- class mlflow.genai.label_schemas.InputNumeric(min_value: Optional[float] = None, max_value: Optional[float] = None)[source]
- Bases: - mlflow.genai.label_schemas.label_schemas.InputType- A numeric input for collecting assessments from stakeholders. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. 
- class mlflow.genai.label_schemas.InputText(max_length: Optional[int] = None)[source]
- Bases: - mlflow.genai.label_schemas.label_schemas.InputType- A free-form text box for collecting assessments from stakeholders. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. 
- class mlflow.genai.label_schemas.InputTextList(max_length_each: Optional[int] = None, max_count: Optional[int] = None)[source]
- Bases: - mlflow.genai.label_schemas.label_schemas.InputType- Like Text, but allows multiple entries. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. 
- class mlflow.genai.label_schemas.LabelSchema(name: str, type: mlflow.genai.label_schemas.label_schemas.LabelSchemaType, title: str, input: mlflow.genai.label_schemas.label_schemas.InputCategorical | mlflow.genai.label_schemas.label_schemas.InputCategoricalList | mlflow.genai.label_schemas.label_schemas.InputText | mlflow.genai.label_schemas.label_schemas.InputTextList | mlflow.genai.label_schemas.label_schemas.InputNumeric, instruction: Optional[str] = None, enable_comment: bool = False)[source]
- Bases: - object- A label schema for collecting input from stakeholders. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - input: mlflow.genai.label_schemas.label_schemas.InputCategorical | mlflow.genai.label_schemas.label_schemas.InputCategoricalList | mlflow.genai.label_schemas.label_schemas.InputText | mlflow.genai.label_schemas.label_schemas.InputTextList | mlflow.genai.label_schemas.label_schemas.InputNumeric
- Input type specification that defines how stakeholders will provide their assessment (e.g., dropdown, text box, numeric input) 
 - type: mlflow.genai.label_schemas.label_schemas.LabelSchemaType
- Type of the label schema, either ‘feedback’ or ‘expectation’. 
 
- class mlflow.genai.label_schemas.LabelSchemaType(value)[source]
- Bases: - mlflow.genai.utils.enum_utils.StrEnum- Type of label schema. 
- mlflow.genai.label_schemas.create_label_schema(name: str, *, type: Literal['feedback', 'expectation'], title: str, input: mlflow.genai.label_schemas.label_schemas.InputCategorical | mlflow.genai.label_schemas.label_schemas.InputCategoricalList | mlflow.genai.label_schemas.label_schemas.InputText | mlflow.genai.label_schemas.label_schemas.InputTextList | mlflow.genai.label_schemas.label_schemas.InputNumeric, instruction: Optional[str] = None, enable_comment: bool = False, overwrite: bool = False) mlflow.genai.label_schemas.label_schemas.LabelSchema[source]
- Create a new label schema for the review app. - A label schema defines the type of input that stakeholders will provide when labeling items in the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- name – The name of the label schema. Must be unique across the review app. 
- type – The type of the label schema. Either “feedback” or “expectation”. 
- title – The title of the label schema shown to stakeholders. 
- input – The input type of the label schema. 
- instruction – Optional. The instruction shown to stakeholders. 
- enable_comment – Optional. Whether to enable comments for the label schema. 
- overwrite – Optional. Whether to overwrite the existing label schema with the same name. 
 
- Returns
- The created label schema. 
- Return type
 
- mlflow.genai.label_schemas.delete_label_schema(name: str) mlflow.genai.labeling.labeling.ReviewApp[source]
- Delete a label schema from the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- name – The name of the label schema to delete. 
- Returns
- The review app. 
- Return type
 
- mlflow.genai.label_schemas.get_label_schema(name: str) mlflow.genai.label_schemas.label_schemas.LabelSchema[source]
- Get a label schema from the review app. - Note - This functionality is only available in Databricks. Please run pip install mlflow[databricks] to use it. - Parameters
- name – The name of the label schema to get. 
- Returns
- The label schema. 
- Return type
 
- class mlflow.genai.optimize.BasePromptOptimizer(optimizer_config: mlflow.genai.optimize.types.OptimizerConfig)[source]
- Bases: - abc.ABC- Note - Experimental: This class may change or be removed in a future release without warning. - abstract optimize(prompt: PromptVersion, target_llm_params: mlflow.genai.optimize.types.LLMParams, train_data: pd.DataFrame, scorers: list[Scorer], objective: Optional[Callable[[dict[str, bool | float | str | Feedback | list[Feedback]]], float]] = None, eval_data: Optional[pd.DataFrame] = None) mlflow.genai.optimize.types.OptimizerOutput[source]
- Optimize the given prompt using the specified configuration. - Parameters
- prompt – The prompt to optimize. 
- target_llm_params – Parameters for the agent LLM. 
- train_data – Training dataset for optimization. 
- scorers – List of scorers to evaluate the optimization. 
- objective – Optional function to compute overall performance metric. 
- eval_data – Optional evaluation dataset. 
 
- Returns
- The optimized prompt version registered in the prompt registry as a new version. 
 
 - property optimizer_config: mlflow.genai.optimize.types.OptimizerConfig
 
- class mlflow.genai.optimize.DSPyPromptOptimizer(optimizer_config: mlflow.genai.optimize.types.OptimizerConfig)[source]
- Bases: - mlflow.genai.optimize.optimizers.base_optimizer.BasePromptOptimizer- Note - Experimental: This class may change or be removed in a future release without warning. - optimize(prompt: PromptVersion, target_llm_params: mlflow.genai.optimize.types.LLMParams, train_data: pd.DataFrame, scorers: list[Scorer], objective: Optional[Callable[[dict[str, bool | float | str | Feedback | list[Feedback]]], float]] = None, eval_data: Optional[pd.DataFrame] = None) mlflow.genai.optimize.types.OptimizerOutput[source]
- Optimize the given prompt using the specified configuration. - Parameters
- prompt – The prompt to optimize. 
- target_llm_params – Parameters for the agent LLM. 
- train_data – Training dataset for optimization. 
- scorers – List of scorers to evaluate the optimization. 
- objective – Optional function to compute overall performance metric. 
- eval_data – Optional evaluation dataset. 
 
- Returns
- The optimized prompt version registered in the prompt registry as a new version. 
 
 - run_optimization(prompt: PromptVersion, program: dspy.Module, metric: Callable[[dspy.Example], float], train_data: list['dspy.Example'], eval_data: list['dspy.Example']) mlflow.genai.optimize.types.OptimizerOutput[source]
- Run the optimization process for the given prompt and program. - Parameters
- prompt (PromptVersion) – The prompt version to optimize. 
- program (dspy.Module) – The DSPy program/module to optimize. 
- metric (Callable[[dspy.Example], float]) – A callable that computes a metric score for a given example. 
- train_data (list[dspy.Example]) – List of training examples for optimization. 
- eval_data (list[dspy.Example]) – List of evaluation examples for validation. 
 
- Returns
- The result of the optimization, including the optimized prompt and metrics. 
- Return type
- Raises
- NotImplementedError – This method must be implemented by subclasses. 
 
 
- class mlflow.genai.optimize.LLMParams(model_name: str, base_uri: Optional[str] = None, temperature: Optional[float] = None)[source]
- Bases: - object- Note - Experimental: This class may change or be removed in a future release without warning. - Parameters for configuring a LLM Model. - Parameters
- model_name – Name of the model in the format <provider>:/<model name> or <provider>/<model name>. For example, “openai:/gpt-4o”, “anthropic:/claude-4”, or “openai/gpt-4o”. 
- base_uri – Optional base URI for the API endpoint. If not provided, the default endpoint for the provider will be used. 
- temperature – Optional sampling temperature for the model’s outputs. Higher values (e.g., 0.8) make the output more random, while lower values (e.g., 0.2) make it more deterministic. 
 
 
- class mlflow.genai.optimize.OptimizerConfig(num_instruction_candidates: int = 6, max_few_show_examples: int = 6, num_threads: int = <factory>, optimizer_llm: Optional[mlflow.genai.optimize.types.LLMParams] = None, algorithm: str | type['BasePromptOptimizer'] = 'DSPy/MIPROv2', verbose: bool = False, autolog: bool = True, convert_to_single_text: bool = True, extract_instructions: bool = True)[source]
- Bases: - object- Note - Experimental: This class may change or be removed in a future release without warning. - Configuration for prompt optimization. - Parameters
- num_instruction_candidates – Number of candidate instructions to generate during each optimization iteration. Higher values may lead to better results but increase optimization time. Default: 6 
- max_few_show_examples – Maximum number of examples to show in few-shot demonstrations. Default: 6 
- num_threads – Number of threads to use for parallel optimization. Default: (number of CPU cores * 2 + 1) 
- optimizer_llm – Optional LLM parameters for the teacher model. If not provided, the target LLM will be used as the teacher. 
- algorithm – The optimization algorithm to use. When a string is provided, it must be one of the supported algorithms: “DSPy/MIPROv2”. When a BasePromptOptimizer is provided, it will be used as the optimizer. Default: “DSPy/MIPROv2” 
- verbose – Whether to show optimizer logs during optimization. Default: False 
- autolog – Whether to enable automatic logging and prompt registration. If set to True, a MLflow run is automatically created to store optimization parameters, datasets and metrics, and the optimized prompt is registered. If set to False, the raw optimized template is returned without registration. Default: True 
- convert_to_single_text – Whether to convert the optimized prompt to a single prompt. Default: True 
- extract_instructions – Whether to extract instructions from the initial prompt. Default: True 
 
 - optimizer_llm: mlflow.genai.optimize.types.LLMParams | None = None
 
- class mlflow.genai.optimize.OptimizerOutput(*, optimized_prompt: str | dict[str, typing.Any], optimizer_name: str, final_eval_score: Optional[float] = None, initial_eval_score: Optional[float] = None)[source]
- Bases: - object- Note - Experimental: This class may change or be removed in a future release without warning. - Output of the optimize method of - mlflow.genai.optimize.BasePromptOptimizer.- Parameters
- optimized_prompt – The optimized prompt version entity. 
- optimizer_name – The name of the optimizer. 
- final_eval_score – The final evaluation score of the optimized prompt. 
- initial_eval_score – The initial evaluation score of the optimized prompt. 
 
 
- class mlflow.genai.optimize.PromptOptimizationResult(prompt: str | dict[str, typing.Any] | PromptVersion, initial_prompt: PromptVersion, optimizer_name: str, final_eval_score: float | None, initial_eval_score: float | None)[source]
- Bases: - object- Note - Experimental: This class may change or be removed in a future release without warning. - Result of the - mlflow.genai.optimize_prompt()API.- Parameters
- prompt – The optimized prompt. When autolog=True (default), this is a PromptVersion entity containing the registered optimized template. When autolog=False, this is the raw optimized template (str or dict). 
- initial_prompt – A prompt version entity containing the initial template. 
- optimizer_name – The name of the optimizer. 
- final_eval_score – The final evaluation score of the optimized prompt. 
- initial_eval_score – The initial evaluation score of the optimized prompt. 
 
 - initial_prompt: PromptVersion
 - prompt: str | dict[str, typing.Any] | PromptVersion
 
- mlflow.genai.optimize.format_dspy_prompt(program: dspy.Predict, convert_to_single_text: bool) dict[str, typing.Any] | str[source]
- Note - Experimental: This function may change or be removed in a future release without warning. 
- mlflow.genai.optimize.optimize_prompt(*, target_llm_params: mlflow.genai.optimize.types.LLMParams, prompt: str | PromptVersion, train_data: EvaluationDatasetTypes, scorers: list[Scorer], objective: Optional[Callable[[dict[str, bool | float | str | Feedback | list[Feedback]]], float]] = None, eval_data: Optional[EvaluationDatasetTypes] = None, optimizer_config: mlflow.genai.optimize.types.OptimizerConfig | None = None) mlflow.genai.optimize.types.PromptOptimizationResult[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Optimize a LLM prompt using the given dataset and evaluation metrics. By default, the optimized prompt template is automatically registered as a new version of the original prompt and optimization metrics are logged. Currently, this API provides built-in support for DSPy’s MIPROv2 optimizer and you can also implement custom optimization algorithms by extending BasePromptOptimizer class. - Parameters
- target_llm_params – Parameters for the the LLM that prompt is optimized for. The model name can be specified in either format: - <provider>:/<model> (e.g., “openai:/gpt-4o”) - <provider>/<model> (e.g., “openai/gpt-4o”) 
- prompt – The URI or Prompt object of the MLflow prompt to optimize. 
- train_data – - Training dataset used for optimization. The data must be one of the following formats: - An EvaluationDataset entity 
- Pandas DataFrame 
- Spark DataFrame 
- List of dictionaries 
 - The dataset must include the following columns: - inputs: A column containing single inputs in dict format. Each input should contain keys matching the variables in the prompt template. 
- expectations: A column containing a dictionary of ground truths for individual output fields. 
 
- scorers – List of scorers that evaluate the inputs, outputs and expectations. Note: Trace input is not supported for optimization. Use inputs, outputs and expectations for optimization. Also, pass the objective argument when using scorers with string or - Feedbacktype outputs.
- objective – A callable that computes the overall performance metric from individual assessments. Takes a dict mapping assessment names to assessment scores and returns a float value (greater is better). 
- eval_data – Evaluation dataset with the same format as train_data. If not provided, train_data will be automatically split into training and evaluation sets. 
- optimizer_config – Configuration parameters for the optimizer. 
 
- Returns
- The optimization result including the optimized prompt. 
- Return type
 - Example - import os import mlflow from typing import Any from mlflow.genai.scorers import scorer from mlflow.genai.optimize import OptimizerConfig, LLMParams os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" @scorer def exact_match(expectations: dict[str, Any], outputs: dict[str, Any]) -> bool: return expectations == outputs prompt = mlflow.genai.register_prompt( name="qa", template="Answer the following question: {{question}}", ) result = mlflow.genai.optimize_prompt( target_llm_params=LLMParams(model_name="openai:/gpt-4o-mini"), train_data=[ {"inputs": {"question": f"{i}+1"}, "expectations": {"answer": f"{i + 1}"}} for i in range(100) ], scorers=[exact_match], prompt=prompt.uri, optimizer_config=OptimizerConfig(num_instruction_candidates=5), ) print(result.prompt.template) 
- class mlflow.genai.judges.AlignmentOptimizer[source]
- Bases: - abc.ABC- Note - Experimental: This class may change or be removed in a future release without warning. - Abstract base class for judge alignment optimizers. - Alignment optimizers improve judge accuracy by learning from traces that contain human feedback. - abstract align(judge: mlflow.genai.judges.base.Judge, traces: list[Trace]) mlflow.genai.judges.base.Judge[source]
- Align a judge using the provided traces. - Parameters
- judge – The judge to be optimized 
- traces – List of traces containing alignment data (feedback) 
 
- Returns
- A new Judge instance that is better aligned with the input traces. 
 
 
- class mlflow.genai.judges.CategoricalRating(value)[source]
- Bases: - mlflow.genai.utils.enum_utils.StrEnum- A categorical rating for an assessment. - Example - from mlflow.genai.judges import CategoricalRating from mlflow.entities import Feedback # Create feedback with categorical rating feedback = Feedback( name="my_metric", value=CategoricalRating.YES, rationale="The metric is passing." ) 
- class mlflow.genai.judges.Judge(*, name: str, aggregations: list[typing.Union[typing.Literal['min', 'max', 'mean', 'median', 'variance', 'p90'], typing.Callable[[list[int | float]], float]]] | None = None)[source]
- Bases: - Scorer- Note - Experimental: This class may change or be removed in a future release without warning. - Base class for LLM-as-a-judge scorers that can be aligned with human feedback. - Judges are specialized scorers that use LLMs to evaluate outputs based on configurable criteria and the results of human-provided feedback alignment. - align(traces: list[Trace], optimizer: Optional[mlflow.genai.judges.base.AlignmentOptimizer] = None) mlflow.genai.judges.base.Judge[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Align this judge with human preferences using the provided optimizer and traces. - Parameters
- traces – Training traces for alignment 
- optimizer – The alignment optimizer to use. If None, uses the default SIMBA optimizer. 
 
- Returns
- A new Judge instance that is better aligned with the input traces. 
 
 - abstract get_input_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the input fields for this judge. - Returns
- List of JudgeField objects defining the input fields. 
 
 - classmethod get_output_fields() list[mlflow.genai.judges.base.JudgeField][source]
- Get the standard output fields used by all judges. This is the source of truth for judge output field definitions. - Returns
- List of JudgeField objects defining the standard output fields. 
 
 
- mlflow.genai.judges.custom_prompt_judge(*, name: str, prompt_template: str, numeric_values: Optional[dict[str, float]] = None, model: Optional[str] = None) Callable[[...], Feedback][source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Create a custom prompt judge that evaluates inputs using a template. - Parameters
- name – Name of the judge, used as the name of returned - mlflow.entities.Feedbackobject.
- prompt_template – Template string with {{var_name}} placeholders for variable substitution. Should be prompted with choices as outputs. 
- numeric_values – Optional mapping from categorical values to numeric scores. Useful if you want to create a custom judge that returns continuous valued outputs. Defaults to None. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
- Returns
- A callable that takes keyword arguments mapping to the template variables and returns an mlflow - mlflow.entities.Feedback.
 - Example prompt template: - You will look at the response and determine the formality of the response. <request>{{request}}</request> <response>{{response}}</response> You must choose one of the following categories. [[formal]]: The response is very formal. [[semi_formal]]: The response is somewhat formal. The response is somewhat formal if the response mentions friendship, etc. [[not_formal]]: The response is not formal. - Variable names in the template should be enclosed in double curly braces, e.g., {{request}}, {{response}}. They should be alphanumeric and can include underscores, but should not contain spaces or special characters. - It is required for the prompt template to request choices as outputs, with each choice enclosed in square brackets. Choice names should be alphanumeric and can include underscores and spaces. 
- mlflow.genai.judges.is_context_relevant(*, request: str, context: Any, name: Optional[str] = None, model: Optional[str] = None) Feedback[source]
- LLM judge determines whether the given context is relevant to the input request. - Parameters
- request – Input to the application to evaluate, user’s question or query. 
- context – Context to evaluate the relevance to the request. Supports any JSON-serializable object. 
- name – Optional name for overriding the default name of the returned feedback. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
- Returns
- A - mlflow.entities.assessment.Feedback~object with a “yes” or “no” value indicating whether the context is relevant to the request.
 - Example - The following example shows how to evaluate whether a document retrieved by a retriever is relevant to the user’s question. - from mlflow.genai.judges import is_context_relevant feedback = is_context_relevant( request="What is the capital of France?", context="Paris is the capital of France.", ) print(feedback.value) # "yes" feedback = is_context_relevant( request="What is the capital of France?", context="Paris is known for its Eiffel Tower.", ) print(feedback.value) # "no" 
- mlflow.genai.judges.is_context_sufficient(*, request: str, context: Any, expected_facts: list[str], expected_response: Optional[str] = None, name: Optional[str] = None, model: Optional[str] = None) Feedback[source]
- LLM judge determines whether the given context is sufficient to answer the input request. - Parameters
- request – Input to the application to evaluate, user’s question or query. 
- context – Context to evaluate the sufficiency of. Supports any JSON-serializable object. 
- expected_facts – A list of expected facts that should be present in the context. Optional. 
- expected_response – The expected response from the application. Optional. 
- name – Optional name for overriding the default name of the returned feedback. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
- Returns
- A - mlflow.entities.assessment.Feedback~object with a “yes” or “no” value indicating whether the context is sufficient to answer the request.
 - Example - The following example shows how to evaluate whether the documents returned by a retriever gives sufficient context to answer the user’s question. - from mlflow.genai.judges import is_context_sufficient feedback = is_context_sufficient( request="What is the capital of France?", context=[ {"content": "Paris is the capital of France."}, {"content": "Paris is known for its Eiffel Tower."}, ], expected_facts=["Paris is the capital of France."], ) print(feedback.value) # "yes" feedback = is_context_sufficient( request="What is the capital of France?", context={"content": "France is a country in Europe."}, expected_response="Paris is the capital of France.", ) print(feedback.value) # "no" 
- mlflow.genai.judges.is_correct(*, request: str, response: str, expected_facts: Optional[list[str]] = None, expected_response: Optional[str] = None, name: Optional[str] = None, model: Optional[str] = None) Feedback[source]
- LLM judge determines whether the given response is correct for the input request. - Parameters
- request – Input to the application to evaluate, user’s question or query. 
- response – The response from the application to evaluate. 
- expected_facts – A list of expected facts that should be present in the response. Optional. 
- expected_response – The expected response from the application. Optional. 
- name – Optional name for overriding the default name of the returned feedback. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
- Returns
- A - mlflow.entities.assessment.Feedback~object with a “yes” or “no” value indicating whether the response is correct for the request.
 - Example - The following example shows how to evaluate whether the response is correct. - from mlflow.genai.judges import is_correct feedback = is_correct( request="What is the capital of France?", response="Paris is the capital of France.", expected_response="Paris", ) print(feedback.value) # "yes" feedback = is_correct( request="What is the capital of France?", response="London is the capital of France.", expected_facts=["Paris is the capital of France"], ) print(feedback.value) # "no" 
- mlflow.genai.judges.is_grounded(*, request: str, response: str, context: Any, name: Optional[str] = None, model: Optional[str] = None) Feedback[source]
- LLM judge determines whether the given response is grounded in the given context. - Parameters
- request – Input to the application to evaluate, user’s question or query. 
- response – The response from the application to evaluate. 
- context – Context to evaluate the response against. Supports any JSON-serializable object. 
- name – Optional name for overriding the default name of the returned feedback. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
- Returns
- A - mlflow.entities.assessment.Feedback~object with a “yes” or “no” value indicating whether the response is grounded in the context.
 - Example - The following example shows how to evaluate whether the response is grounded in the context. - from mlflow.genai.judges import is_grounded feedback = is_grounded( request="What is the capital of France?", response="Paris", context=[ {"content": "Paris is the capital of France."}, {"content": "Paris is known for its Eiffel Tower."}, ], ) print(feedback.value) # "yes" feedback = is_grounded( request="What is the capital of France?", response="London is the capital of France.", context=[ {"content": "Paris is the capital of France."}, {"content": "Paris is known for its Eiffel Tower."}, ], ) print(feedback.value) # "no" 
- mlflow.genai.judges.is_safe(*, content: str, name: Optional[str] = None, model: Optional[str] = None) Feedback[source]
- LLM judge determines whether the given response is safe. - Parameters
- content – Text content to evaluate for safety. 
- name – Optional name for overriding the default name of the returned feedback. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
- Returns
- A - mlflow.entities.assessment.Feedback~object with a “yes” or “no” value indicating whether the response is safe.
 - Example - from mlflow.genai.judges import is_safe feedback = is_safe(content="I am a happy person.") print(feedback.value) # "yes" 
- mlflow.genai.judges.make_judge(name: str, instructions: str, model: Optional[str] = None) mlflow.genai.judges.base.Judge[source]
- Note - Experimental: This function may change or be removed in a future release without warning. - Create a custom MLflow judge instance. - Parameters
- name – The name of the judge 
- instructions – Natural language instructions for evaluation. Must contain at least one template variable: {{ inputs }}, {{ outputs }}, {{ expectations }}, or {{ trace }} to reference evaluation data. Custom variables are not supported. 
- model – The model identifier to use for evaluation (e.g., “openai:/gpt-4”) 
 
- Returns
- An InstructionsJudge instance configured with the provided parameters 
 - Example - import mlflow from mlflow.genai.judges import make_judge # Create a judge that evaluates response quality using template variables quality_judge = make_judge( name="response_quality", instructions=( "Evaluate if the response in {{ outputs }} correctly answers " "the question in {{ inputs }}. The response should be accurate, " "complete, and professional." ), model="openai:/gpt-4", ) # Evaluate a response result = quality_judge( inputs={"question": "What is machine learning?"}, outputs="ML is basically when computers learn stuff on their own", ) # Create a judge that compares against expectations correctness_judge = make_judge( name="correctness", instructions=( "Compare the {{ outputs }} against the {{ expectations }}. " "Rate how well they match on a scale of 1-5." ), model="openai:/gpt-4", ) # Evaluate with expectations (must be dictionaries) result = correctness_judge( inputs={"question": "What is the capital of France?"}, outputs={"answer": "The capital of France is Paris."}, expectations={"expected_answer": "Paris"}, ) # Create a judge that evaluates based on trace context trace_judge = make_judge( name="trace_quality", instructions="Evaluate the overall quality of the {{ trace }} execution.", model="openai:/gpt-4", ) # Use with search_traces() - evaluate each trace traces = mlflow.search_traces(experiment_ids=["1"], return_type="list") for trace in traces: feedback = trace_judge(trace=trace) print(f"Trace {trace.info.trace_id}: {feedback.value} - {feedback.rationale}") 
- mlflow.genai.judges.meets_guidelines(*, guidelines: str | list[str], context: dict[str, typing.Any], name: Optional[str] = None, model: Optional[str] = None) Feedback[source]
- LLM judge determines whether the given response meets the given guideline(s). - Parameters
- guidelines – A single guideline or a list of guidelines. 
- context – Mapping of context to be evaluated against the guidelines. For example, pass {“response”: “<response text>”} to evaluate whether the response meets the given guidelines. 
- name – Optional name for overriding the default name of the returned feedback. 
- model – - Judge model to use. Must be either “databricks” or a form of <provider>:/<model-name>, such as “openai:/gpt-4.1-mini”, “anthropic:/claude-3.5-sonnet-20240620”. MLflow natively supports [“openai”, “anthropic”, “bedrock”, “mistral”], and more providers are supported through LiteLLM. Default model depends on the tracking URI setup: - Databricks: databricks 
- Otherwise: openai:/gpt-4.1-mini. 
 
 
- Returns
- A - mlflow.entities.assessment.Feedback~object with a “yes” or “no” value indicating whether the response meets the guideline(s).
 - Example - The following example shows how to evaluate whether the response meets the given guideline(s). - from mlflow.genai.judges import meets_guidelines feedback = meets_guidelines( guidelines="Be polite and respectful.", context={"response": "Hello, how are you?"}, ) print(feedback.value) # "yes" feedback = meets_guidelines( guidelines=["Be polite and respectful.", "Must be in English."], context={"response": "Hola, ¿cómo estás?"}, ) print(feedback.value) # "no"