Skip to main content

Overview

The BestOf metric implements a king-of-the-hill tournament to compare multiple AI assistants. The first assistant becomes the initial King, and each subsequent assistant challenges the current King in a head-to-head LLM-judged comparison.

How it works

  • N-1 comparisons for N assistants (not a full bracket)
  • Order-dependent: The first assistant starts as King and defends
  • Requires at least 2 assistants per block

Usage

from langchain_openai import ChatOpenAI
from gaussia.metrics.best_of import BestOf

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

results = BestOf.run(
    MyRetriever,
    model=model,
    criteria="helpfulness",
)

for r in results:
    print(f"Winner: {r.bestof_winner_id}")
    for contest in r.bestof_contests:
        print(f"  Round {contest.round}: {contest.left_id} vs {contest.right_id}{contest.winner_id}")
Your Retriever must return multiple Dataset entries with the same qa_id values but different assistant_id values. Each assistant’s response to the same questions will be compared.

Parameters

ParameterTypeDefaultDescription
retrievertype[Retriever]requiredRetriever class
modelBaseChatModelrequiredLangChain model for judging
criteriastr"BestOf"Label describing evaluation criteria
use_structured_outputboolFalseUse structured output
strictboolTrueStrict schema validation

Output schema

BestOfMetric

FieldTypeDescription
session_idstrAlways "bestof"
qa_idstrInteraction identifier or "batch_len_N"
assistant_idstrFinal winner’s assistant ID
bestof_winner_idstrThe winning assistant
bestof_contestslist[BestOfContest]All match records

BestOfContest

FieldTypeDescription
roundintRound number
left_idstrCurrent King’s assistant ID
right_idstrChallenger’s assistant ID
winner_idstrWinner or "tie"
confidencefloat | NoneJudge’s confidence
verdictstr | NoneJudge’s verdict
reasoningstr | NoneJudge’s reasoning