Skip to main content

Arize Phoenix

Arize Phoenix is an open-source LLM observability and evaluation framework from Arize AI. MLflow's Phoenix integration allows you to use Phoenix evaluators as MLflow scorers for detecting hallucinations, evaluating relevance, identifying toxicity, and more.

Prerequisites

Phoenix scorers require the arize-phoenix-evals package:

bash
pip install arize-phoenix-evals

Quick Start

You can call Phoenix scorers directly:

python
from mlflow.genai.scorers.phoenix import Hallucination

scorer = Hallucination(model="openai:/gpt-4")
feedback = scorer(
inputs="What is the capital of France?",
outputs="Paris is the capital of France.",
expectations={"context": "France is a country in Europe. Its capital is Paris."},
)

print(feedback.value) # "factual" or "hallucinated"
print(feedback.metadata["score"]) # Numeric score

Or use them in mlflow.genai.evaluate:

python
import mlflow
from mlflow.genai.scorers.phoenix import Hallucination, Relevance

eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source platform for managing machine learning workflows.",
"expectations": {
"context": "MLflow is an ML platform for experiment tracking and model deployment."
},
},
{
"inputs": {"query": "How do I track experiments?"},
"outputs": "You can use mlflow.start_run() to begin tracking experiments.",
"expectations": {
"context": "MLflow provides APIs like mlflow.start_run() for experiment tracking."
},
},
]

results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Hallucination(model="openai:/gpt-4"),
Relevance(model="openai:/gpt-4"),
],
)

Available Phoenix Scorers

Phoenix scorers evaluate different aspects of LLM outputs:

ScorerWhat does it evaluate?Phoenix Docs
HallucinationDoes the output contain fabricated information not in the context?Link
RelevanceIs the retrieved context relevant to the input query?Link
ToxicityDoes the output contain toxic or harmful content?Link
QADoes the answer correctly address the question based on reference?Link
SummarizationIs the summary accurate and complete relative to the original text?Link

Creating Scorers by Name

You can also create Phoenix scorers dynamically using get_scorer:

python
from mlflow.genai.scorers.phoenix import get_scorer

# Create scorer by name
scorer = get_scorer(
metric_name="Hallucination",
model="openai:/gpt-4",
)

feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is a platform for ML workflows.",
expectations={"context": "MLflow is an ML platform."},
)

Next Steps