RAGAS
RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework designed for LLM applications. MLflow's RAGAS integration allows you to use RAGAS metrics as MLflow scorers for evaluating retrieval quality, answer generation, and other aspects of LLM applications.
Prerequisites
RAGAS scorers require the ragas package:
pip install ragas
Quick Start
You can call RAGAS scorers directly:
from mlflow.genai.scorers.ragas import Faithfulness
scorer = Faithfulness(model="openai:/gpt-4")
feedback = scorer(trace=trace)
print(feedback.value) # Score between 0.0 and 1.0
print(feedback.rationale) # Explanation of the score
Or use them in mlflow.genai.evaluate:
import mlflow
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision
traces = mlflow.search_traces()
results = mlflow.genai.evaluate(
data=traces,
scorers=[
Faithfulness(model="openai:/gpt-4"),
ContextPrecision(model="openai:/gpt-4"),
],
)
Available RAGAS Scorers
RAGAS scorers are organized into categories based on their evaluation focus:
RAG (Retrieval-Augmented Generation) Metrics
Evaluate retrieval quality and answer generation in RAG systems:
| Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
| ContextPrecision | Are relevant retrieved documents ranked higher than irrelevant ones? | Link |
| NonLLMContextPrecisionWithReference | Non-LLM version of context precision using reference answers | Link |
| ContextRecall | Does retrieval context contain all information needed to answer the query? | Link |
| NonLLMContextRecall | Non-LLM version of context recall using reference answers | Link |
| ContextEntityRecall | Are entities from the expected answer present in the retrieved context? | Link |
| NoiseSensitivity | How sensitive is the model to irrelevant information in the context? | Link |
| Faithfulness | Is the output factually consistent with the retrieval context? | Link |
Natural Language Comparison
Evaluate answer quality through natural language comparison:
| Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
| FactualCorrectness | Is the output factually correct compared to expected answer? | Link |
| NonLLMStringSimilarity | String similarity between output and expected answer | Link |
| BleuScore | BLEU score for text comparison | Link |
| ChrfScore | CHRF score for text comparison | Link |
| RougeScore | ROUGE score for text comparison | Link |
| StringPresence | Is a specific string present in the output? | Link |
| ExactMatch | Does output exactly match expected output? | Link |
General Purpose
Flexible evaluation metrics for various use cases:
| Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
| AspectCritic | Evaluates specific aspects of the output using LLM | Link |
| RubricsScore | Scores output based on predefined rubrics | Link |
| InstanceRubrics | Scores output based on instance-specific rubrics | Link |
Other Tasks
Specialized metrics for specific tasks:
| Scorer | What does it evaluate? | RAGAS Docs |
|---|---|---|
| SummarizationScore | Quality of text summarization | Link |
Creating Scorers by Name
You can also create RAGAS scorers dynamically using get_scorer:
from mlflow.genai.scorers.ragas import get_scorer
# Create scorer by name
scorer = get_scorer(
metric_name="Faithfulness",
model="openai:/gpt-4",
)
feedback = scorer(trace=trace)
Configuration
RAGAS scorers accept metric-specific parameters. Any additional keyword arguments are passed directly to the RAGAS metric constructor:
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision, ExactMatch
# LLM-based metric with model specification
scorer = Faithfulness(model="openai:/gpt-4")
# Non-LLM metric (no model required)
deterministic_scorer = ExactMatch()
Refer to the RAGAS documentation for metric-specific parameters and advanced usage.