Guardrails AI

Guardrails AI is a framework for validating LLM outputs using a community-driven hub of validators for safety, PII detection, content quality, and more. MLflow's Guardrails AI integration allows you to use Guardrails validators as MLflow scorers, providing rule-based evaluation without requiring LLM calls.

Prerequisites

Guardrails AI scorers require the guardrails-ai package:

bash
pip install guardrails-ai

Quick Start

Invoke directly
Invoke with evaluate()

python
from mlflow.genai.scorers.guardrails import ToxicLanguage

scorer = ToxicLanguage(threshold=0.7)
feedback = scorer(
    outputs="This is a professional and helpful response.",
)

print(feedback.value)  # "yes" or "no"

python
import mlflow
from mlflow.genai.scorers.guardrails import ToxicLanguage, DetectPII

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow?"},
        "outputs": "MLflow is an open-source AI engineering platform for agents and LLMs.",
    },
    {
        "inputs": {"query": "How do I contact support?"},
        "outputs": "You can reach us at support@example.com or call 555-0123.",
    },
]

results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        ToxicLanguage(threshold=0.7),
        DetectPII(),
    ],
)

Available Guardrails AI Scorers

Safety and Content Quality

Detect harmful content, PII, and other safety issues in LLM outputs:

Scorer	What does it evaluate?	Guardrails Hub
ToxicLanguage	Does the output contain toxic or offensive language?	Link
NSFWText	Does the output contain NSFW or explicit content?	Link
DetectJailbreak	Does the input contain a jailbreak or prompt injection attempt?	Link
DetectPII	Does the output contain personally identifiable information?	Link
SecretsPresent	Does the output contain API keys, tokens, or other secrets?	Link
GibberishText	Does the output contain nonsensical or incoherent text?	Link

Creating Scorers by Name

You can also create Guardrails AI scorers dynamically using get_scorer:

python
from mlflow.genai.scorers.guardrails import get_scorer

# Create scorer by name
scorer = get_scorer(
    validator_name="ToxicLanguage",
    threshold=0.7,
)

feedback = scorer(
    outputs="This is a professional response.",
)

Configuration

Guardrails AI scorers accept validator-specific parameters. Any additional keyword arguments are passed directly to the underlying Guardrails validator:

python
from mlflow.genai.scorers.guardrails import ToxicLanguage, DetectPII, DetectJailbreak

# Toxicity detection with custom threshold
scorer = ToxicLanguage(
    threshold=0.7,  # Confidence threshold for detection
    validation_method="sentence",  # "sentence" or "full" text validation
)

# PII detection with custom entity types
pii_scorer = DetectPII(
    pii_entities=["CREDIT_CARD", "SSN", "EMAIL_ADDRESS"],
)

# Jailbreak detection with custom sensitivity
jailbreak_scorer = DetectJailbreak(
    threshold=0.9,  # Lower values are more sensitive
)