Skip to main content

RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework designed for LLM applications. MLflow's RAGAS integration allows you to use RAGAS metrics as MLflow scorers for evaluating retrieval quality, answer generation, and other aspects of LLM applications.

Prerequisites

RAGAS scorers require the ragas package:

bash
pip install ragas

Quick Start

You can call RAGAS scorers directly:

python
from mlflow.genai.scorers.ragas import Faithfulness

scorer = Faithfulness(model="openai:/gpt-4")
feedback = scorer(trace=trace)

print(feedback.value) # Score between 0.0 and 1.0
print(feedback.rationale) # Explanation of the score

Or use them in mlflow.genai.evaluate:

python
import mlflow
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision

traces = mlflow.search_traces()
results = mlflow.genai.evaluate(
data=traces,
scorers=[
Faithfulness(model="openai:/gpt-4"),
ContextPrecision(model="openai:/gpt-4"),
],
)

Available RAGAS Scorers

RAGAS scorers are organized into categories based on their evaluation focus:

RAG (Retrieval-Augmented Generation) Metrics

Evaluate retrieval quality and answer generation in RAG systems:

ScorerWhat does it evaluate?RAGAS Docs
ContextPrecisionAre relevant retrieved documents ranked higher than irrelevant ones?Link
NonLLMContextPrecisionWithReferenceNon-LLM version of context precision using reference answersLink
ContextRecallDoes retrieval context contain all information needed to answer the query?Link
NonLLMContextRecallNon-LLM version of context recall using reference answersLink
ContextEntityRecallAre entities from the expected answer present in the retrieved context?Link
NoiseSensitivityHow sensitive is the model to irrelevant information in the context?Link
FaithfulnessIs the output factually consistent with the retrieval context?Link

Natural Language Comparison

Evaluate answer quality through natural language comparison:

ScorerWhat does it evaluate?RAGAS Docs
FactualCorrectnessIs the output factually correct compared to expected answer?Link
NonLLMStringSimilarityString similarity between output and expected answerLink
BleuScoreBLEU score for text comparisonLink
ChrfScoreCHRF score for text comparisonLink
RougeScoreROUGE score for text comparisonLink
StringPresenceIs a specific string present in the output?Link
ExactMatchDoes output exactly match expected output?Link

General Purpose

Flexible evaluation metrics for various use cases:

ScorerWhat does it evaluate?RAGAS Docs
AspectCriticEvaluates specific aspects of the output using LLMLink
RubricsScoreScores output based on predefined rubricsLink
InstanceRubricsScores output based on instance-specific rubricsLink

Other Tasks

Specialized metrics for specific tasks:

ScorerWhat does it evaluate?RAGAS Docs
SummarizationScoreQuality of text summarizationLink

Creating Scorers by Name

You can also create RAGAS scorers dynamically using get_scorer:

python
from mlflow.genai.scorers.ragas import get_scorer

# Create scorer by name
scorer = get_scorer(
metric_name="Faithfulness",
model="openai:/gpt-4",
)

feedback = scorer(trace=trace)

Configuration

RAGAS scorers accept metric-specific parameters. Any additional keyword arguments are passed directly to the RAGAS metric constructor:

python
from mlflow.genai.scorers.ragas import Faithfulness, ContextPrecision, ExactMatch

# LLM-based metric with model specification
scorer = Faithfulness(model="openai:/gpt-4")

# Non-LLM metric (no model required)
deterministic_scorer = ExactMatch()

Refer to the RAGAS documentation for metric-specific parameters and advanced usage.

Next Steps