Create Custom LLM Scorers
While MLflow's predefined LLM judge scorers offer excellent starting points for common quality dimensions in simpler applications, you'll need to create custom LLM judges as your application becomes more complex and to tune your evaluation criteria to meet the specific, nuanced business requirements of your use case and align with your domain expert's judgment. MLflow provides robust and flexible ways to create custom LLM judges tailored to these unique requirements.
Guidelines (we suggest starting here)
Guidelines is a powerful scorer class designed to let you quickly and easily customize evaluation by defining natural language criteria that are framed as pass/fail conditions. It is ideal for checking compliance with rules, style guides, or information inclusion/exclusion.
Guidelines have the distinct advantage of being easy to explain to business stakeholders ("we are evaluating if the app delivers upon this set of rules") and, as such, can often be directly written by domain experts.
Example usage
First, define the guidelines as a simple string:
tone = "The response must maintain a courteous, respectful tone throughout. It must show empathy for customer concerns."
easy_to_understand = "The response must use clear, concise language and structure responses logically. It must avoid jargon or explain technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
Then pass each guideline to the Guidelines
class to create a scorer and run evaluation:
import mlflow
eval_dataset = [
{
"inputs": {"question": "I'm having trouble with my account. I can't log in."},
"outputs": "I'm sorry to hear that you're having trouble logging in. Please provide me with your username and the specific issue you're experiencing, and I'll be happy to help you resolve it.",
},
{
"inputs": {"question": "How much does a microwave cost?"},
"outputs": "The microwave costs $100.",
},
{
"inputs": {"question": "How does a refrigerator work?"},
"outputs": "A refrigerator operates via thermodynamic vapor-compression cycles utilizing refrigerant phase transitions. The compressor pressurizes vapor which condenses externally, then expands through evaporator coils to absorb internal heat through endothermic vaporization.",
},
]
mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
# Create a scorer for each guideline
Guidelines(name="tone", guidelines=tone),
Guidelines(name="easy_to_understand", guidelines=easy_to_understand),
Guidelines(name="banned_topics", guidelines=banned_topics),
],
)

Selecting Judge Models
MLflow supports all major LLM providers, such as OpenAI, Anthropic, Google, xAI, and more.
See Supported Models for more details.
Bring Your Own Prompt (full control)
While Guidelines
scorers have the distinct advantage of being easy to write and maintain, you may need more control or can't write your evaluation criteria as pass/fail guidelines.
The custom_prompt_judge API allows you to define a full prompt for the judge, while still letting MLflow handle complexities like response parsing.
See Bring Your Own Prompt for more details.
Next Steps
Evaluate Agents
Learn how to evaluate AI agents with specialized techniques and scorers
Evaluate Traces
Evaluate production traces to understand and improve your AI application's behavior
Collect User Feedback
Integrate user feedback to continuously improve your evaluation criteria and model performance