Guardrails AI
Guardrails AI is a framework for validating LLM outputs using a community-driven hub of validators for safety, PII detection, content quality, and more. MLflow's Guardrails AI integration allows you to use Guardrails validators as MLflow scorers, providing rule-based evaluation without requiring LLM calls.
Prerequisites
Guardrails AI scorers require the guardrails-ai package:
pip install guardrails-ai
Quick Start
- Invoke directly
- Invoke with evaluate()
from mlflow.genai.scorers.guardrails import ToxicLanguage
scorer = ToxicLanguage(threshold=0.7)
feedback = scorer(
outputs="This is a professional and helpful response.",
)
print(feedback.value) # "yes" or "no"
import mlflow
from mlflow.genai.scorers.guardrails import ToxicLanguage, DetectPII
eval_dataset = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": "MLflow is an open-source platform for managing machine learning workflows.",
},
{
"inputs": {"query": "How do I contact support?"},
"outputs": "You can reach us at support@example.com or call 555-0123.",
},
]
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
ToxicLanguage(threshold=0.7),
DetectPII(),
],
)
Available Guardrails AI Scorers
Safety and Content Quality
Detect harmful content, PII, and other safety issues in LLM outputs:
| Scorer | What does it evaluate? | Guardrails Hub |
|---|---|---|
| ToxicLanguage | Does the output contain toxic or offensive language? | Link |
| NSFWText | Does the output contain NSFW or explicit content? | Link |
| DetectJailbreak | Does the input contain a jailbreak or prompt injection attempt? | Link |
| DetectPII | Does the output contain personally identifiable information? | Link |
| SecretsPresent | Does the output contain API keys, tokens, or other secrets? | Link |
| GibberishText | Does the output contain nonsensical or incoherent text? | Link |
Creating Scorers by Name
You can also create Guardrails AI scorers dynamically using get_scorer:
from mlflow.genai.scorers.guardrails import get_scorer
# Create scorer by name
scorer = get_scorer(
validator_name="ToxicLanguage",
threshold=0.7,
)
feedback = scorer(
outputs="This is a professional response.",
)
Configuration
Guardrails AI scorers accept validator-specific parameters. Any additional keyword arguments are passed directly to the underlying Guardrails validator:
from mlflow.genai.scorers.guardrails import ToxicLanguage, DetectPII, DetectJailbreak
# Toxicity detection with custom threshold
scorer = ToxicLanguage(
threshold=0.7, # Confidence threshold for detection
validation_method="sentence", # "sentence" or "full" text validation
)
# PII detection with custom entity types
pii_scorer = DetectPII(
pii_entities=["CREDIT_CARD", "SSN", "EMAIL_ADDRESS"],
)
# Jailbreak detection with custom sensitivity
jailbreak_scorer = DetectJailbreak(
threshold=0.9, # Lower values are more sensitive
)
Refer to the Guardrails AI documentation and the Guardrails Hub for validator-specific parameters and the full list of available validators.