BestOf - Gaussia

Overview

The BestOf metric implements a king-of-the-hill tournament to compare multiple AI assistants. The first assistant becomes the initial King, and each subsequent assistant challenges the current King in a head-to-head LLM-judged comparison.

How it works

N-1 comparisons for N assistants (not a full bracket)
Order-dependent: The first assistant starts as King and defends
Requires at least 2 assistants per block

Usage

from langchain_openai import ChatOpenAI
from gaussia.metrics.best_of import BestOf

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

results = BestOf.run(
    MyRetriever,
    model=model,
    criteria="helpfulness",
)

for r in results:
    print(f"Winner: {r.bestof_winner_id}")
    for contest in r.bestof_contests:
        print(f"  Round {contest.round}: {contest.left_id} vs {contest.right_id} → {contest.winner_id}")

Your Retriever must return multiple Dataset entries with the same qa_id values but different assistant_id values. Each assistant’s response to the same questions will be compared.

Parameters

Parameter	Type	Default	Description
`retriever`	`type[Retriever]`	required	Retriever class
`model`	`BaseChatModel`	required	LangChain model for judging
`criteria`	`str`	`"BestOf"`	Label describing evaluation criteria
`use_structured_output`	`bool`	`False`	Use structured output
`strict`	`bool`	`True`	Strict schema validation

Output schema

BestOfMetric

Field	Type	Description
`session_id`	`str`	Always `"bestof"`
`qa_id`	`str`	Interaction identifier or `"batch_len_N"`
`assistant_id`	`str`	Final winner’s assistant ID
`bestof_winner_id`	`str`	The winning assistant
`bestof_contests`	`list[BestOfContest]`	All match records

BestOfContest

Field	Type	Description
`round`	`int`	Round number
`left_id`	`str`	Current King’s assistant ID
`right_id`	`str`	Challenger’s assistant ID
`winner_id`	`str`	Winner or `"tie"`
`confidence`	`float \| None`	Judge’s confidence
`verdict`	`str \| None`	Judge’s verdict
`reasoning`	`str \| None`	Judge’s reasoning

​Overview

​How it works

​Usage

​Parameters

​Output schema

​BestOfMetric

​BestOfContest

Overview

How it works

Usage

Parameters

Output schema

BestOfMetric

BestOfContest