Skip to main content

DeepEval

DeepEval is a comprehensive evaluation framework for LLM applications that provides metrics for RAG systems, agents, conversational AI, and safety evaluation. MLflow's DeepEval integration allows you to use most DeepEval metrics as MLflow scorers.

Prerequisites

DeepEval scorers require the deepeval package:

bash
pip install deepeval

Quick Start

You can call DeepEval scorers directly:

python
from mlflow.genai.scorers.deepeval import AnswerRelevancy

scorer = AnswerRelevancy(threshold=0.7, model="openai:/gpt-4")
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is an open-source platform for managing machine learning workflows.",
)

print(feedback.value) # "yes" or "no"
print(feedback.metadata["score"]) # 0.85

Or use them in mlflow.genai.evaluate:

python
import mlflow
from mlflow.genai.scorers.deepeval import AnswerRelevancy, Faithfulness

eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source platform for managing machine learning workflows.",
},
{
"inputs": {"query": "How do I track experiments?"},
"outputs": "You can use mlflow.start_run() to begin tracking experiments.",
},
]

results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
AnswerRelevancy(threshold=0.7, model="openai:/gpt-4"),
Faithfulness(threshold=0.8, model="openai:/gpt-4"),
],
)

Available DeepEval Scorers

DeepEval scorers are organized into categories based on their evaluation focus:

RAG (Retrieval-Augmented Generation) Metrics

Evaluate retrieval quality and answer generation in RAG systems:

ScorerWhat does it evaluate?DeepEval Docs
AnswerRelevancyIs the output relevant to the input query?Link
FaithfulnessIs the output factually consistent with retrieval context?Link
ContextualRecallDoes retrieval context contain all necessary information?Link
ContextualPrecisionAre relevant nodes ranked higher than irrelevant ones?Link
ContextualRelevancyIs the retrieval context relevant to the query?Link

Agentic Metrics

Evaluate AI agent performance and behavior:

ScorerWhat does it evaluate?DeepEval Docs
TaskCompletionDoes the agent successfully complete its assigned task?Link
ToolCorrectnessDoes the agent use the correct tools?Link
ArgumentCorrectnessAre tool arguments correct?Link
StepEfficiencyDoes the agent take an optimal path?Link
PlanAdherenceDoes the agent follow its plan?Link
PlanQualityIs the agent's plan well-structured?Link

Conversational Metrics

Evaluate multi-turn conversations and dialogue systems:

ScorerWhat does it evaluate?DeepEval Docs
TurnRelevancyIs each turn relevant to the conversation?Link
RoleAdherenceDoes the assistant maintain its assigned role?Link
KnowledgeRetentionDoes the agent retain information across turns?Link
ConversationCompletenessAre all user questions addressed?Link
GoalAccuracyDoes the conversation achieve its goal?Link
ToolUseDoes the agent use tools appropriately in conversation?Link
TopicAdherenceDoes the conversation stay on topic?Link

Safety Metrics

Detect harmful content, bias, and policy violations:

ScorerWhat does it evaluate?DeepEval Docs
BiasDoes the output contain biased content?Link
ToxicityDoes the output contain toxic language?Link
NonAdviceDoes the model inappropriately provide advice in restricted domains?Link
MisuseCould the output be used for harmful purposes?Link
PIILeakageDoes the output leak personally identifiable information?Link
RoleViolationDoes the assistant break out of its assigned role?Link

Other

Additional evaluation metrics for common use cases:

ScorerWhat does it evaluate?DeepEval Docs
HallucinationDoes the LLM fabricate information not in the context?Link
SummarizationIs the summary accurate and complete?Link
JsonCorrectnessDoes JSON output match the expected schema?Link
PromptAlignmentDoes the output align with prompt instructions?Link

Non-LLM

Fast, rule-based metrics that don't require LLM calls:

ScorerWhat does it evaluate?DeepEval Docs
ExactMatchDoes output exactly match expected output?Link
PatternMatchDoes output match a regex pattern?Link

Creating Scorers by Name

You can also create DeepEval scorers dynamically using get_scorer:

python
from mlflow.genai.scorers.deepeval import get_scorer

# Create scorer by name
scorer = get_scorer(
metric_name="AnswerRelevancy",
threshold=0.7,
model="openai:/gpt-4",
)

feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is a platform for ML workflows.",
)

Configuration

DeepEval scorers accept all parameters supported by the underlying DeepEval metrics. Any additional keyword arguments are passed directly to the DeepEval metric constructor:

python
from mlflow.genai.scorers.deepeval import AnswerRelevancy, TurnRelevancy

# Common parameters
scorer = AnswerRelevancy(
model="openai:/gpt-4", # Model URI (also supports "databricks", "databricks:/endpoint", etc.)
threshold=0.7, # Pass/fail threshold (0.0-1.0, scorer passes if score >= threshold)
include_reason=True, # Include detailed rationale in feedback
)

# Metric-specific parameters are passed through to DeepEval
conversational_scorer = TurnRelevancy(
model="openai:/gpt-4o",
threshold=0.8,
window_size=3, # DeepEval-specific: number of conversation turns to consider
strict_mode=True, # DeepEval-specific: enforce stricter evaluation criteria
)

Refer to the DeepEval documentation for metric-specific parameters.

Next Steps