Skip to main content

Ground Truth Expectations

MLflow Expectations provide a systematic way to capture ground truth - the correct or desired outputs that your AI should produce. By establishing these reference points, you create the foundation for meaningful evaluation and continuous improvement of your GenAI applications.

For complete API documentation and implementation details, see the mlflow.log_expectation() reference.

What are Expectations?​

Expectations define the "gold standard" for what your AI should produce given specific inputs. They represent the correct answer, desired behavior, or ideal output as determined by domain experts. Think of expectations as the answer key against which actual AI performance is measured.

Unlike feedback that evaluates what happened, expectations establish what should happen. They're always created by humans who have the expertise to define correct outcomes.

Prerequisites​

Before using the Expectations API, ensure you have:

  • MLflow 3.2.0 or later installed
  • An active MLflow tracking server or local tracking setup
  • Traces that have been logged from your GenAI application to an MLflow Experiment

Why Annotate Ground Truth?​

Create Evaluation Baselines

Establish reference points for objective accuracy measurement. Without ground truth, you can't measure how well your AI performs against known correct answers.

Enable Systematic Testing

Transform ad-hoc testing into systematic evaluation by building datasets of expected outputs to consistently measure performance across versions and configurations.

Support Fine-Tuning and Training

Create high-quality training data from ground truth annotations. Essential for fine-tuning models and training automated evaluators.

Establish Quality Standards

Codify quality requirements and transform implicit knowledge into explicit, measurable criteria that everyone can understand and follow.

Types of Expectations​

Factual Expectations​

For questions with definitive answers:

mlflow.log_expectation(
trace_id=trace_id,
name="expected_answer",
value="The speed of light in vacuum is 299,792,458 meters per second",
source=AssessmentSource(
source_type=AssessmentSourceType.HUMAN,
source_id="physics_expert@university.edu",
),
)

Step-by-Step Guides​

Add Ground Truth Annotation via UI​

The MLflow UI provides an intuitive way to add expectations directly to traces. This approach is ideal for domain experts who need to define ground truth without writing code, and for collaborative annotation workflows where multiple stakeholders contribute different perspectives.

Add Expectation

The expectation will be immediately attached to the trace, establishing the ground truth reference for future evaluation.

Log Ground Truth via API​

Use the programmatic mlflow.log_expectation() API when you need to automate expectation creation, integrate with existing annotation tools, or build custom ground truth collection workflows.

Programmatically create expectations for systematic ground truth collection:

1. Set up your annotation environment:

import mlflow
from mlflow.entities import AssessmentSource
from mlflow.entities.assessment_source import AssessmentSourceType

# Define your domain expert source
expert_source = AssessmentSource(
source_type=AssessmentSourceType.HUMAN, source_id="domain_expert@company.com"
)

2. Create expectations for different data types:

def log_factual_expectation(trace_id, question, correct_answer):
"""Log expectation for factual questions."""
mlflow.log_expectation(
trace_id=trace_id,
name="expected_factual_answer",
value=correct_answer,
source=expert_source,
metadata={
"question": question,
"expectation_type": "factual",
"confidence": "high",
"verified_by": "subject_matter_expert",
},
)


def log_structured_expectation(trace_id, expected_extraction):
"""Log expectation for structured data extraction."""
mlflow.log_expectation(
trace_id=trace_id,
name="expected_extraction",
value=expected_extraction,
source=expert_source,
metadata={
"expectation_type": "structured",
"schema_version": "v1.0",
"annotation_guidelines": "company_extraction_standards_v2",
},
)


def log_behavioral_expectation(trace_id, expected_behavior):
"""Log expectation for AI behavior patterns."""
mlflow.log_expectation(
trace_id=trace_id,
name="expected_behavior",
value=expected_behavior,
source=expert_source,
metadata={
"expectation_type": "behavioral",
"behavior_category": "customer_service",
"compliance_requirement": "company_policy_v3",
},
)

3. Use the functions in your annotation workflow:

# Example: Annotating a customer service interaction
trace_id = "tr-customer-service-001"

# Define what the AI should have said
factual_answer = "Your account balance is $1,234.56 as of today."
log_factual_expectation(trace_id, "What is my account balance?", factual_answer)

# Define expected data extraction
expected_extraction = {
"intent": "account_balance_inquiry",
"account_type": "checking",
"urgency": "low",
"requires_authentication": True,
}
log_structured_expectation(trace_id, expected_extraction)

# Define expected behavior
expected_behavior = {
"should_verify_identity": True,
"tone": "professional_helpful",
"should_offer_additional_help": True,
"escalation_required": False,
}
log_behavioral_expectation(trace_id, expected_behavior)

Expectation Annotation Workflows​

Different stages of your AI development lifecycle require different approaches to expectation annotation. The following workflows help you systematically create and maintain ground truth expectations that align with your development process and quality goals.

Development Phase

Define success criteria by identifying test scenarios, creating expectations with domain experts, testing AI outputs, and iterating on configurations until expectations are met.

Production Monitoring

Enable systematic quality tracking by sampling production traces, adding expectations to create evaluation datasets, and tracking performance trends over time.

Collaborative Annotation

Use team-based annotation where domain experts define initial expectations, review committees validate and refine, and consensus building resolves disagreements.

Best Practices​

Be Specific and Measurable​

Vague expectations lead to inconsistent evaluation. Define clear, specific criteria that can be objectively verified.

Document Your Reasoning​

Use metadata to explain why an expectation is defined a certain way:

mlflow.log_expectation(
trace_id=trace_id,
name="expected_diagnosis",
value={
"primary": "Type 2 Diabetes",
"risk_factors": ["obesity", "family_history"],
"recommended_tests": ["HbA1c", "fasting_glucose"],
},
metadata={
"guideline_version": "ADA_2024",
"confidence": "high",
"based_on": "clinical_presentation_and_history",
},
source=AssessmentSource(
source_type=AssessmentSourceType.HUMAN, source_id="endocrinologist@hospital.org"
),
)

Maintain Consistency​

Use standardized naming and structure across your expectations to enable meaningful analysis and comparison.

Managing Expectations​

Once you've defined expectations for your traces, you may need to retrieve, update, or delete them to maintain accurate ground truth data.

Retrieving Expectations​

Retrieve specific expectations to analyze your ground truth data:

# Get a specific expectation by ID
expectation = mlflow.get_assessment(
trace_id="tr-1234567890abcdef", assessment_id="a-0987654321abcdef"
)

# Access expectation details
name = expectation.name
value = expectation.value
source_type = expectation.source.source_type
metadata = expectation.metadata if hasattr(expectation, "metadata") else None

Updating Expectations​

Update existing expectations when ground truth needs refinement:

from mlflow.entities import Expectation

# Update expectation with corrected information
updated_expectation = Expectation(
name="expected_answer",
value="The capital of France is Paris, located in the Île-de-France region",
)

mlflow.update_assessment(
trace_id="tr-1234567890abcdef",
assessment_id="a-0987654321abcdef",
assessment=updated_expectation,
)

Deleting Expectations​

Remove expectations that were logged incorrectly:

# Delete specific expectation
mlflow.delete_assessment(
trace_id="tr-1234567890abcdef", assessment_id="a-5555666677778888"
)

Integration with Evaluation​

Expectations are most powerful when combined with systematic evaluation:

  1. Automated scoring against expectations
  2. Human feedback on expectation achievement
  3. Gap analysis between expected and actual
  4. Performance metrics based on expectation matching

Next Steps​