Arize Phoenix

Arize Phoenix is an open-source LLM observability and evaluation framework from Arize AI. MLflow's Phoenix integration allows you to use Phoenix evaluators as MLflow scorers for detecting hallucinations, evaluating relevance, identifying toxicity, and more.

Prerequisites

Phoenix scorers require the arize-phoenix-evals package:

bash
pip install arize-phoenix-evals

Quick Start

You can call Phoenix scorers directly:

python
from mlflow.genai.scorers.phoenix import Hallucination

scorer = Hallucination(model="openai:/gpt-4")
feedback = scorer(
    inputs="What is the capital of France?",
    outputs="Paris is the capital of France.",
    expectations={"context": "France is a country in Europe. Its capital is Paris."},
)

print(feedback.value)  # "factual" or "hallucinated"
print(feedback.metadata["score"])  # Numeric score

Or use them in mlflow.genai.evaluate:

python
import mlflow
from mlflow.genai.scorers.phoenix import Hallucination, Relevance

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow?"},
        "outputs": "MLflow is an open-source platform for managing machine learning workflows.",
        "expectations": {
            "context": "MLflow is an ML platform for experiment tracking and model deployment."
        },
    },
    {
        "inputs": {"query": "How do I track experiments?"},
        "outputs": "You can use mlflow.start_run() to begin tracking experiments.",
        "expectations": {
            "context": "MLflow provides APIs like mlflow.start_run() for experiment tracking."
        },
    },
]

results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        Hallucination(model="openai:/gpt-4"),
        Relevance(model="openai:/gpt-4"),
    ],
)

Available Phoenix Scorers

Phoenix scorers evaluate different aspects of LLM outputs:

Scorer	What does it evaluate?	Phoenix Docs
Hallucination	Does the output contain fabricated information not in the context?	Link
Relevance	Is the retrieved context relevant to the input query?	Link
Toxicity	Does the output contain toxic or harmful content?	Link
QA	Does the answer correctly address the question based on reference?	Link
Summarization	Is the summary accurate and complete relative to the original text?	Link

Creating Scorers by Name

You can also create Phoenix scorers dynamically using get_scorer:

python
from mlflow.genai.scorers.phoenix import get_scorer

# Create scorer by name
scorer = get_scorer(
    metric_name="Hallucination",
    model="openai:/gpt-4",
)

feedback = scorer(
    inputs="What is MLflow?",
    outputs="MLflow is a platform for ML workflows.",
    expectations={"context": "MLflow is an ML platform."},
)