Agentic Metric
The Agentic metric evaluates AI agent performance by measuring complete conversation correctness. A conversation is correct only if ALL its interactions are correct. It supports pluggable statistical modes — frequentist returns point estimates for pass@K, Bayesian propagates the uncertainty in the estimated success rate through the pass@K formula to produce credible intervals.- Conversation Correctness: A conversation is correct only if ALL interactions are correct
- pass@K: Probability of ≥1 correct conversation when attempting k conversations (0.0–1.0)
- pass^K: Probability of all k conversations being correct (0.0–1.0)
- Tool Correctness: Evaluates tool selection, parameter accuracy, execution sequence, and result utilization per interaction
p = c/n — a point estimate
Bayesian: p is a Beta-Binomial posterior distribution — the pass@K formula is applied across all posterior samples, yielding a credible interval for pass@K and pass^K
k is a required parameter. pass@K and pass^K are computed per conversation using n = total_interactions and c = correct_interactions. The default tool_threshold=1.0 requires perfect tool usage — lower it (e.g. 0.75) to allow minor deviations.Installation
Basic Usage
Required Parameters
| Parameter | Type | Description |
|---|---|---|
retriever | Type[Retriever] | Data source class — each Dataset = 1 conversation |
model | BaseChatModel | LangChain-compatible model for LLM-as-judge evaluation |
k | int | Number of independent attempts for pass@K/pass^K computation |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
statistical_mode | StatisticalMode | FrequentistMode() | Statistical computation mode |
threshold | float | 0.7 | Answer correctness threshold (0.0–1.0) |
tool_threshold | float | 1.0 | Tool correctness threshold (0.0–1.0) |
tool_weights | dict[str, float] | 0.25 each | Weights for tool aspects (selection, parameters, sequence, utilization) |
use_structured_output | bool | True | Use LangChain structured output |
verbose | bool | False | Enable verbose logging |
Statistical Modes
- Frequentist
- Bayesian
Computes
p = c/n as a point estimate and plugs it directly into the pass@K formulas. Simple and fast.pass_at_k_ci_low, pass_at_k_ci_high, pass_pow_k_ci_low, pass_pow_k_ci_high are all None.Why Bayesian matters for agentic evaluation: A pass@3 of 0.90 sounds great — but if it comes from only 5 conversations, the 95% CI might be [0.55, 0.99]. With 100 conversations, the same rate gives [0.84, 0.95], which is much more trustworthy. Use Bayesian mode when you have few test conversations and need to communicate reliability honestly.
Data Requirements
EachDataset represents one complete conversation. A conversation is correct only if ALL interactions are correct:
Output Schema
AgenticMetric
ToolCorrectnessScore
Quality Assessment
| pass@K | pass^K | Assessment |
|---|---|---|
| 0.95 | 0.70 | ✅ Reliable — High success and consistency |
| 0.95 | 0.50 | ⚠️ Inconsistent — Can succeed but unreliable |
| 0.70 | any | ❌ Needs Improvement — Low success rate |
Custom Tool Weights
Best Practices
Use Bayesian Mode for Small Test Suites
Use Bayesian Mode for Small Test Suites
If you have fewer than 30 conversations, frequentist pass@K estimates can be misleading. Bayesian mode shows you the credible interval, making it clear when more data is needed before drawing conclusions.
Choose Appropriate K Values
Choose Appropriate K Values
- K=1: Evaluate single conversation success rate
- K=3–5: Balance between reliability and cost (recommended)
- K=10+: High-stakes scenarios requiring high confidence
Set Meaningful Thresholds
Set Meaningful Thresholds
- Strict (0.8–0.9): Factual accuracy matters (medical, legal)
- Moderate (0.7): General purpose — recommended default
- Lenient (0.6): Creative or subjective tasks
Define Clear Tool Expectations
Define Clear Tool Expectations
Provide complete
ground_truth_agentic per interaction with expected tool names, required parameters, whether sequence matters, and whether tool results should influence the final answer.Troubleshooting
Judge Returns Low Scores for Correct Answers
Judge Returns Low Scores for Correct Answers
Lower the
threshold parameter (try 0.6–0.65), use a more capable judge model, or ensure ground truth is clear and unambiguous. Check verbose logs to see judge reasoning.Tool Correctness Always Fails
Tool Correctness Always Fails
The default
tool_threshold=1.0 requires perfect tool correctness. Lower it with tool_threshold=0.75 to allow minor deviations. Verify tool names match exactly (case-sensitive) and check parameter structure.Bayesian CI is Very Wide
Bayesian CI is Very Wide
A wide CI means there is not enough data to estimate the true success rate precisely. This is intentional — collect more test conversations to narrow the interval.
Next Steps
Statistical Modes
Deep dive into Frequentist vs Bayesian approaches
BestOf Metric
Compare multiple agents in tournament-style evaluation
Context Metric
Evaluate context alignment