TruLens
TruLens is an evaluation and observability framework for LLM applications that provides feedback functions for RAG systems and agent trace analysis. MLflow's TruLens integration allows you to use TruLens feedback functions as MLflow scorers, including benchmarked goal-plan-action alignment evaluations for agent traces.
Prerequisites
TruLens scorers require the trulens and trulens-providers-litellm packages:
pip install trulens trulens-providers-litellm
Quick Start
- Invoke directly
- Invoke with evaluate()
from mlflow.genai.scorers.trulens import Groundedness
scorer = Groundedness(model="openai:/gpt-5-mini")
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is an open-source platform for managing machine learning workflows.",
expectations={
"context": "MLflow is an ML platform for experiment tracking and model deployment."
},
)
print(feedback.value) # "yes" or "no"
print(feedback.metadata["score"]) # 0.85
import mlflow
from mlflow.genai.scorers.trulens import Groundedness, AnswerRelevance
eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source platform for managing machine learning workflows.",
"expectations": {
"context": "MLflow is an ML platform for experiment tracking and model deployment."
},
},
{
"inputs": {"query": "How do I track experiments?"},
"outputs": "You can use mlflow.start_run() to begin tracking experiments.",
"expectations": {
"context": "MLflow provides APIs like mlflow.start_run() for experiment tracking."
},
},
]
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Groundedness(model="openai:/gpt-5-mini"),
AnswerRelevance(model="openai:/gpt-5-mini"),
],
)
Available TruLens Scorers
TruLens scorers are organized into categories based on their evaluation focus:
RAG (Retrieval-Augmented Generation) Metrics
Evaluate retrieval quality and answer generation in RAG systems:
| Scorer | What does it evaluate? | TruLens Docs |
|---|---|---|
| Groundedness | Is the response grounded in the provided context? | Link |
| ContextRelevance | Is the retrieved context relevant to the input query? | Link |
| AnswerRelevance | Is the output relevant to the input query? | Link |
| Coherence | Is the output coherent and logically consistent? | Link |
Agent Trace Metrics
Evaluate AI agent execution traces using goal-plan-action alignment:
| Scorer | What does it evaluate? | TruLens Docs |
|---|---|---|
| LogicalConsistency | Is the agent's reasoning logically consistent throughout execution? | Link |
| ExecutionEfficiency | Does the agent take an optimal path without unnecessary steps? | Link |
| PlanAdherence | Does the agent follow its stated plan during execution? | Link |
| PlanQuality | Is the agent's plan well-structured and appropriate for the goal? | Link |
| ToolSelection | Does the agent choose the appropriate tools for each step? | Link |
| ToolCalling | Does the agent invoke tools with correct parameters? | Link |
Agent trace scorers require a trace argument and evaluate the full execution trace:
import mlflow
from mlflow.genai.scorers.trulens import LogicalConsistency, ToolSelection
traces = mlflow.search_traces(experiment_ids=["1"])
results = mlflow.genai.evaluate(
data=traces,
scorers=[
LogicalConsistency(model="openai:/gpt-5-mini"),
ToolSelection(model="openai:/gpt-5-mini"),
],
)
Creating Scorers by Name
You can also create TruLens scorers dynamically using get_scorer:
from mlflow.genai.scorers.trulens import get_scorer
# Create scorer by name
scorer = get_scorer(
metric_name="Groundedness",
model="openai:/gpt-5-mini",
)
feedback = scorer(
inputs="What is MLflow?",
outputs="MLflow is a platform for ML workflows.",
expectations={"context": "MLflow is an ML platform."},
)
Configuration
TruLens scorers accept common parameters for controlling evaluation behavior:
from mlflow.genai.scorers.trulens import Groundedness, ContextRelevance
# Common parameters
scorer = Groundedness(
model="openai:/gpt-5-mini", # Model URI (also supports "databricks", "databricks:/endpoint", etc.)
threshold=0.7, # Pass/fail threshold (0.0-1.0, scorer passes if score >= threshold)
)
# Default threshold is 0.5
scorer = ContextRelevance(model="openai:/gpt-5-mini")
Refer to the TruLens documentation for additional details on feedback functions and advanced usage.