Agentic Metric

The Agentic metric evaluates AI agent performance by measuring complete conversation correctness. A conversation is correct only if ALL its interactions are correct. It supports pluggable statistical modes — frequentist returns point estimates for pass@K, Bayesian propagates the uncertainty in the estimated success rate through the pass@K formula to produce credible intervals.

Conversation Correctness: A conversation is correct only if ALL interactions are correct
pass@K: Probability of ≥1 correct conversation when attempting k conversations (0.0–1.0)
pass^K: Probability of all k conversations being correct (0.0–1.0)
Tool Correctness: Evaluates tool selection, parameter accuracy, execution sequence, and result utilization per interaction

pass@k = 1 - (1 - p)^k   # Probability of ≥1 correct in k independent attempts
pass^k = p^k              # Probability of all k attempts correct

Where p = estimated success rate from evaluation

Frequentist: p = c/n — a point estimate Bayesian: p is a Beta-Binomial posterior distribution — the pass@K formula is applied across all posterior samples, yielding a credible interval for pass@K and pass^K

k is a required parameter. pass@K and pass^K are computed per conversation using n = total_interactions and c = correct_interactions. The default tool_threshold=1.0 requires perfect tool usage — lower it (e.g. 0.75) to allow minor deviations.

Installation

uv add gaussia
uv add langchain-openai  # Or your preferred LLM provider

Basic Usage

from gaussia.metrics.agentic import Agentic
from langchain_openai import ChatOpenAI
from your_retriever import AgenticRetriever

judge_model = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    threshold=0.7,
    verbose=True,
)

for metric in metrics:
    print(f"{metric.session_id}:")
    print(f"  pass@{metric.k} = {metric.pass_at_k:.3f}")
    print(f"  pass^{metric.k} = {metric.pass_pow_k:.3f}")

from gaussia.metrics.agentic import Agentic
from gaussia.statistical import BayesianMode
from langchain_openai import ChatOpenAI
from your_retriever import AgenticRetriever

judge_model = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    threshold=0.7,
    statistical_mode=BayesianMode(mc_samples=5000, ci_level=0.95),
    verbose=True,
)

for metric in metrics:
    print(f"{metric.session_id}:")
    print(f"  pass@{metric.k} = {metric.pass_at_k:.3f}  [{metric.pass_at_k_ci_low:.3f}, {metric.pass_at_k_ci_high:.3f}]")
    print(f"  pass^{metric.k} = {metric.pass_pow_k:.3f}  [{metric.pass_pow_k_ci_low:.3f}, {metric.pass_pow_k_ci_high:.3f}]")

Required Parameters

Parameter	Type	Description
`retriever`	`Type[Retriever]`	Data source class — each Dataset = 1 conversation
`model`	`BaseChatModel`	LangChain-compatible model for LLM-as-judge evaluation
`k`	`int`	Number of independent attempts for pass@K/pass^K computation

Optional Parameters

Parameter	Type	Default	Description
`statistical_mode`	`StatisticalMode`	`FrequentistMode()`	Statistical computation mode
`threshold`	`float`	`0.7`	Answer correctness threshold (0.0–1.0)
`tool_threshold`	`float`	`1.0`	Tool correctness threshold (0.0–1.0)
`tool_weights`	`dict[str, float]`	`0.25` each	Weights for tool aspects (selection, parameters, sequence, utilization)
`use_structured_output`	`bool`	`True`	Use LangChain structured output
`verbose`	`bool`	`False`	Enable verbose logging

Statistical Modes

Frequentist
Bayesian

Computes p = c/n as a point estimate and plugs it directly into the pass@K formulas. Simple and fast.

# With 7 correct out of 10 interactions, k=3:
# p = 7/10 = 0.70
# pass@3 = 1 - (1 - 0.70)^3 = 0.973
# pass^3 = 0.70^3 = 0.343

pass_at_k_ci_low, pass_at_k_ci_high, pass_pow_k_ci_low, pass_pow_k_ci_high are all None.

Uses a Beta-Binomial posterior over p. The pass@K formula is applied vectorized across all MC samples, yielding a full posterior distribution for both pass@K and pass^K.

# With 7 correct out of 10 interactions, k=3, Beta(1,1) prior:
# Posterior for p: Beta(8, 4) — centered at 0.67 but with uncertainty
# pass@3 samples: 1 - (1 - p_samples)^3  → mean=0.960, CI=[0.820, 0.998]
# pass^3 samples: p_samples^3            → mean=0.330, CI=[0.126, 0.570]

The CI tells you: with only 10 observations, your true pass@3 could plausibly be anywhere in that range.

Why Bayesian matters for agentic evaluation: A pass@3 of 0.90 sounds great — but if it comes from only 5 conversations, the 95% CI might be [0.55, 0.99]. With 100 conversations, the same rate gives [0.84, 0.95], which is much more trustworthy. Use Bayesian mode when you have few test conversations and need to communicate reliability honestly.

Data Requirements

Each Dataset represents one complete conversation. A conversation is correct only if ALL interactions are correct:

from gaussia.core.retriever import Retriever
from gaussia.schemas.common import Dataset, Batch

class AgenticRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        return [
            Dataset(
                session_id="conversation_001",
                assistant_id="agent_v1",
                language="english",
                context="Math calculator conversation",
                conversation=[
                    Batch(
                        qa_id="q1_interaction1",
                        query="What is 5 + 3?",
                        assistant="The result is 8.",
                        ground_truth_assistant="8",
                        agentic={
                            "tools_used": [{
                                "tool_name": "calculator",
                                "parameters": {"a": 5, "b": 3},
                                "result": 8,
                                "step": 1
                            }],
                            "final_answer_uses_tools": True
                        },
                        ground_truth_agentic={
                            "expected_tools": [{
                                "tool_name": "calculator",
                                "parameters": {"a": 5, "b": 3},
                                "step": 1
                            }],
                            "tool_sequence_matters": False
                        }
                    ),
                    Batch(
                        qa_id="q1_interaction2",
                        query="What is 100 / 4?",
                        assistant="100 divided by 4 is 25.",
                        ground_truth_assistant="25"
                    ),
                ],
            ),
        ]

Output Schema

AgenticMetric

class AgenticMetric(BaseMetric):
    session_id: str
    total_interactions: int
    correct_interactions: int
    is_fully_correct: bool
    threshold: float
    correctness_scores: list[float]
    correct_indices: list[int]
    tool_correctness_scores: list[ToolCorrectnessScore | None]
    k: int
    pass_at_k: float
    pass_at_k_ci_low: float | None    # Bayesian only
    pass_at_k_ci_high: float | None   # Bayesian only
    pass_pow_k: float
    pass_pow_k_ci_low: float | None   # Bayesian only
    pass_pow_k_ci_high: float | None  # Bayesian only

ToolCorrectnessScore

class ToolCorrectnessScore(BaseModel):
    tool_selection_correct: float   # 0-1: Correct tools chosen
    parameter_accuracy: float       # 0-1: Correct parameters passed
    sequence_correct: float         # 0-1: Correct order (if required)
    result_utilization: float       # 0-1: Tool results used in answer
    overall_correctness: float      # Weighted average
    is_correct: bool                # overall >= tool_threshold
    reasoning: str | None           # Explanation

Quality Assessment

pass@K	pass^K	Assessment
0.95	0.70	✅ Reliable — High success and consistency
0.95	0.50	⚠️ Inconsistent — Can succeed but unreliable
0.70	any	❌ Needs Improvement — Low success rate

Custom Tool Weights

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    tool_weights={
        "selection": 0.4,
        "parameters": 0.2,
        "sequence": 0.1,
        "utilization": 0.3,
    },
)

Best Practices

Use Bayesian Mode for Small Test Suites

If you have fewer than 30 conversations, frequentist pass@K estimates can be misleading. Bayesian mode shows you the credible interval, making it clear when more data is needed before drawing conclusions.

Choose Appropriate K Values

K=1: Evaluate single conversation success rate
K=3–5: Balance between reliability and cost (recommended)
K=10+: High-stakes scenarios requiring high confidence

Set Meaningful Thresholds

Strict (0.8–0.9): Factual accuracy matters (medical, legal)
Moderate (0.7): General purpose — recommended default
Lenient (0.6): Creative or subjective tasks

Define Clear Tool Expectations

Provide complete ground_truth_agentic per interaction with expected tool names, required parameters, whether sequence matters, and whether tool results should influence the final answer.

Troubleshooting

Judge Returns Low Scores for Correct Answers

Lower the threshold parameter (try 0.6–0.65), use a more capable judge model, or ensure ground truth is clear and unambiguous. Check verbose logs to see judge reasoning.

Tool Correctness Always Fails

The default tool_threshold=1.0 requires perfect tool correctness. Lower it with tool_threshold=0.75 to allow minor deviations. Verify tool names match exactly (case-sensitive) and check parameter structure.

Bayesian CI is Very Wide

A wide CI means there is not enough data to estimate the true success rate precisely. This is intentional — collect more test conversations to narrow the interval.

Next Steps

Statistical Modes

Deep dive into Frequentist vs Bayesian approaches

BestOf Metric

Compare multiple agents in tournament-style evaluation

Context Metric

Evaluate context alignment

​Agentic Metric

​Installation

​Basic Usage

​Required Parameters

​Optional Parameters

​Statistical Modes

​Data Requirements

​Output Schema

​AgenticMetric

​ToolCorrectnessScore

​Quality Assessment

​Custom Tool Weights

​Best Practices

​Troubleshooting

​Next Steps

Statistical Modes

BestOf Metric

Context Metric

Agentic Metric

Installation

Basic Usage

Required Parameters

Optional Parameters

Statistical Modes

Data Requirements

Output Schema

AgenticMetric

ToolCorrectnessScore

Quality Assessment

Custom Tool Weights

Best Practices

Troubleshooting

Next Steps