Skip to main content

Agentic Metric

The Agentic metric evaluates AI agent performance by measuring complete conversation correctness. A conversation is correct only if ALL its interactions are correct. It supports pluggable statistical modes — frequentist returns point estimates for pass@K, Bayesian propagates the uncertainty in the estimated success rate through the pass@K formula to produce credible intervals.
  • Conversation Correctness: A conversation is correct only if ALL interactions are correct
  • pass@K: Probability of ≥1 correct conversation when attempting k conversations (0.0–1.0)
  • pass^K: Probability of all k conversations being correct (0.0–1.0)
  • Tool Correctness: Evaluates tool selection, parameter accuracy, execution sequence, and result utilization per interaction
pass@k = 1 - (1 - p)^k   # Probability of ≥1 correct in k independent attempts
pass^k = p^k              # Probability of all k attempts correct

Where p = estimated success rate from evaluation
Frequentist: p = c/n — a point estimate Bayesian: p is a Beta-Binomial posterior distribution — the pass@K formula is applied across all posterior samples, yielding a credible interval for pass@K and pass^K
k is a required parameter. pass@K and pass^K are computed per conversation using n = total_interactions and c = correct_interactions. The default tool_threshold=1.0 requires perfect tool usage — lower it (e.g. 0.75) to allow minor deviations.

Installation

uv add gaussia
uv add langchain-openai  # Or your preferred LLM provider

Basic Usage

from gaussia.metrics.agentic import Agentic
from langchain_openai import ChatOpenAI
from your_retriever import AgenticRetriever

judge_model = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    threshold=0.7,
    verbose=True,
)

for metric in metrics:
    print(f"{metric.session_id}:")
    print(f"  pass@{metric.k} = {metric.pass_at_k:.3f}")
    print(f"  pass^{metric.k} = {metric.pass_pow_k:.3f}")

Required Parameters

ParameterTypeDescription
retrieverType[Retriever]Data source class — each Dataset = 1 conversation
modelBaseChatModelLangChain-compatible model for LLM-as-judge evaluation
kintNumber of independent attempts for pass@K/pass^K computation

Optional Parameters

ParameterTypeDefaultDescription
statistical_modeStatisticalModeFrequentistMode()Statistical computation mode
thresholdfloat0.7Answer correctness threshold (0.0–1.0)
tool_thresholdfloat1.0Tool correctness threshold (0.0–1.0)
tool_weightsdict[str, float]0.25 eachWeights for tool aspects (selection, parameters, sequence, utilization)
use_structured_outputboolTrueUse LangChain structured output
verboseboolFalseEnable verbose logging

Statistical Modes

Computes p = c/n as a point estimate and plugs it directly into the pass@K formulas. Simple and fast.
# With 7 correct out of 10 interactions, k=3:
# p = 7/10 = 0.70
# pass@3 = 1 - (1 - 0.70)^3 = 0.973
# pass^3 = 0.70^3 = 0.343
pass_at_k_ci_low, pass_at_k_ci_high, pass_pow_k_ci_low, pass_pow_k_ci_high are all None.
Why Bayesian matters for agentic evaluation: A pass@3 of 0.90 sounds great — but if it comes from only 5 conversations, the 95% CI might be [0.55, 0.99]. With 100 conversations, the same rate gives [0.84, 0.95], which is much more trustworthy. Use Bayesian mode when you have few test conversations and need to communicate reliability honestly.

Data Requirements

Each Dataset represents one complete conversation. A conversation is correct only if ALL interactions are correct:
from gaussia.core.retriever import Retriever
from gaussia.schemas.common import Dataset, Batch

class AgenticRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        return [
            Dataset(
                session_id="conversation_001",
                assistant_id="agent_v1",
                language="english",
                context="Math calculator conversation",
                conversation=[
                    Batch(
                        qa_id="q1_interaction1",
                        query="What is 5 + 3?",
                        assistant="The result is 8.",
                        ground_truth_assistant="8",
                        agentic={
                            "tools_used": [{
                                "tool_name": "calculator",
                                "parameters": {"a": 5, "b": 3},
                                "result": 8,
                                "step": 1
                            }],
                            "final_answer_uses_tools": True
                        },
                        ground_truth_agentic={
                            "expected_tools": [{
                                "tool_name": "calculator",
                                "parameters": {"a": 5, "b": 3},
                                "step": 1
                            }],
                            "tool_sequence_matters": False
                        }
                    ),
                    Batch(
                        qa_id="q1_interaction2",
                        query="What is 100 / 4?",
                        assistant="100 divided by 4 is 25.",
                        ground_truth_assistant="25"
                    ),
                ],
            ),
        ]

Output Schema

AgenticMetric

class AgenticMetric(BaseMetric):
    session_id: str
    total_interactions: int
    correct_interactions: int
    is_fully_correct: bool
    threshold: float
    correctness_scores: list[float]
    correct_indices: list[int]
    tool_correctness_scores: list[ToolCorrectnessScore | None]
    k: int
    pass_at_k: float
    pass_at_k_ci_low: float | None    # Bayesian only
    pass_at_k_ci_high: float | None   # Bayesian only
    pass_pow_k: float
    pass_pow_k_ci_low: float | None   # Bayesian only
    pass_pow_k_ci_high: float | None  # Bayesian only

ToolCorrectnessScore

class ToolCorrectnessScore(BaseModel):
    tool_selection_correct: float   # 0-1: Correct tools chosen
    parameter_accuracy: float       # 0-1: Correct parameters passed
    sequence_correct: float         # 0-1: Correct order (if required)
    result_utilization: float       # 0-1: Tool results used in answer
    overall_correctness: float      # Weighted average
    is_correct: bool                # overall >= tool_threshold
    reasoning: str | None           # Explanation

Quality Assessment

pass@Kpass^KAssessment
0.950.70Reliable — High success and consistency
0.950.50⚠️ Inconsistent — Can succeed but unreliable
0.70anyNeeds Improvement — Low success rate

Custom Tool Weights

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    tool_weights={
        "selection": 0.4,
        "parameters": 0.2,
        "sequence": 0.1,
        "utilization": 0.3,
    },
)

Best Practices

If you have fewer than 30 conversations, frequentist pass@K estimates can be misleading. Bayesian mode shows you the credible interval, making it clear when more data is needed before drawing conclusions.
  • K=1: Evaluate single conversation success rate
  • K=3–5: Balance between reliability and cost (recommended)
  • K=10+: High-stakes scenarios requiring high confidence
  • Strict (0.8–0.9): Factual accuracy matters (medical, legal)
  • Moderate (0.7): General purpose — recommended default
  • Lenient (0.6): Creative or subjective tasks
Provide complete ground_truth_agentic per interaction with expected tool names, required parameters, whether sequence matters, and whether tool results should influence the final answer.

Troubleshooting

Lower the threshold parameter (try 0.6–0.65), use a more capable judge model, or ensure ground truth is clear and unambiguous. Check verbose logs to see judge reasoning.
The default tool_threshold=1.0 requires perfect tool correctness. Lower it with tool_threshold=0.75 to allow minor deviations. Verify tool names match exactly (case-sensitive) and check parameter structure.
A wide CI means there is not enough data to estimate the true success rate precisely. This is intentional — collect more test conversations to narrow the interval.

Next Steps

Statistical Modes

Deep dive into Frequentist vs Bayesian approaches

BestOf Metric

Compare multiple agents in tournament-style evaluation

Context Metric

Evaluate context alignment