Skip to main content

Expectation Concepts

What are Expectations?

Expectations in MLflow represent the ground truth or desired outputs for your GenAI application. They provide a standardized way to capture what your AI system should produce for a given input, establishing the reference point against which actual performance is measured.

Expectations serve as the foundation for systematic evaluation, enabling you to define clear quality standards and measure how well your application meets them across different scenarios and use cases.

Expectations in UI

Use Cases

Ground Truth Definition

Establish clear, measurable standards for what your GenAI application should produce. For example, define expected answers for factual questions or desired formats for structured outputs.

Expert Knowledge Capture

Capture domain expertise by having subject matter experts define the correct outputs for complex scenarios, creating a knowledge base that can guide both development and evaluation.

Quality Standards

Set explicit quality benchmarks for safety, accuracy, and compliance requirements. Expectations help ensure your AI meets organizational and regulatory standards.

Model Comparison

Use expectations as a consistent baseline to compare different models, prompts, or configurations. This enables objective evaluation of which approach best meets your requirements.

Core Structure

Expectations are always created by human experts who understand the correct behavior for your AI system. The Expectation object in MLflow provides a standard container for storing these ground truth values along with metadata about their creation. Expectations are associated with a Trace, or a particular Span in the Trace, allowing you to define expected behavior at any level of granularity.

Expectation Object Schema

FieldTypeDescription
namestrA string identifying the specific aspect being defined as ground truth
valueAnyThe expected value, which can be

  • Text responses (e.g., "The capital of France is Paris")
  • Structured data (e.g., {"category": "complaint", "priority": "high"})
  • Lists (e.g., ["doc_123", "doc_456"] for expected retrieval results)
  • Any JSON-serializable value representing the desired output
sourceAssessmentSourceThe source of the expectation, always of type HUMAN for expectations. The ID typically identifies the expert who defined the ground truth (e.g., email, username, or team identifier).
metadataOptional[dict[str, str]]Optional key-value pairs providing context about the expectation, such as confidence level, annotation guidelines used, or version information.
create_time_msintThe timestamp of when the expectation was created, in milliseconds.
last_update_time_msintThe timestamp of when the expectation was last updated, in milliseconds.
trace_idstrThe ID of the trace that the expectation is attached to.
span_idOptional[str]The ID of the span that the expectation is attached to, if it targets a specific operation within the trace. For example, you can set expectations for what documents should be retrieved in a RAG application.

Expectation Examples

Expected Answer for Factual Question

{
"name": "expected_answer",
"value": "The capital of France is Paris. It has been the capital since 987 AD and is home to over 2 million people.",
"source": {
"source_type": "HUMAN",
"source_id": "geography_expert@company.com"
},
"metadata": {
"confidence": "high",
"reference": "Company knowledge base v2.1"
}
}

Expected Classification Output

{
"name": "expected_classification",
"value": {
"category": "customer_complaint",
"sentiment": "negative",
"priority": "high",
"department": "billing"
},
"source": {
"source_type": "HUMAN",
"source_id": "support_team_lead@company.com"
},
"metadata": {
"classification_version": "v3.2",
"based_on": "Historical ticket analysis"
}
}

Expected Document Retrieval for RAG System

{
"name": "expected_documents",
"value": ["policy_doc_2024_v3", "faq_billing_section", "terms_of_service_5.1"],
"source": {
"source_type": "HUMAN",
"source_id": "rag_specialist@company.com"
},
"metadata": {
"relevance_threshold": "0.85",
"expected_order": "by_relevance"
}
}

Key Differences from Feedback

While both Expectations and Feedback are types of assessments in MLflow, they serve distinct purposes:

AspectExpectationsFeedback
PurposeDefine what the AI should produceEvaluate how well the AI performed
TimingSet before or during developmentApplied after AI generates output
SourceAlways from human expertsCan be from humans, LLM judges, or code
ContentGround truth valuesQuality scores, pass/fail judgments
UsageReference point for evaluationActual evaluation results

Best Practices

  1. Be Specific: Define expectations that are clear and unambiguous. Avoid vague expectations that could be interpreted multiple ways.

  2. Consider Edge Cases: Include expectations for edge cases and error scenarios, not just happy path examples.

  3. Version Your Standards: Use metadata to track which version of guidelines or standards was used to create expectations.

  4. Target Appropriate Granularity: Use span-level expectations when you need to validate specific operations (like retrieval or parsing) within a larger workflow.

  5. Maintain Consistency: Ensure multiple experts use the same criteria when defining expectations for similar scenarios.

Integration with Evaluation

Expectations work hand-in-hand with MLflow's evaluation capabilities:

  • Automated Comparison: Use expectations as the ground truth for automated evaluation metrics
  • Human Review: Compare actual outputs against expectations during manual review
  • LLM Judge Evaluation: Provide expectations as context to LLM judges for more accurate assessment
  • Performance Tracking: Monitor how well your system meets expectations over time

Next Steps