Overview
Several Gaussia metrics (Context, Conversational, BestOf, Agentic) use an LLM-as-a-Judge pattern to evaluate AI responses. The Judge class handles prompt rendering, model invocation, and response parsing.
How it works
The Judge supports two evaluation modes:
| Mode | How it works | Best for |
|---|
| Structured output | Uses LangChain’s create_agent with response_format for schema-validated responses | Models that support structured outputs (GPT-4o, Gemini) |
| Regex extraction | Embeds JSON schema in the prompt, extracts from markdown code blocks | Any model, including open-source |
Configuration
You configure the judge through the metric’s constructor parameters:
from langchain_openai import ChatOpenAI
from gaussia.metrics.context import Context
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Structured output mode (recommended for supported models)
results = Context.run(
MyRetriever,
model=model,
use_structured_output=True,
strict=True,
)
# Regex extraction mode (works with any model)
results = Context.run(
MyRetriever,
model=model,
use_structured_output=False,
bos_json_clause="```json",
eos_json_clause="```",
)
Parameters
| Parameter | Default | Description |
|---|
model | required | Any LangChain BaseChatModel instance |
use_structured_output | False | Use schema-validated structured output |
strict | True | Enforce strict schema validation |
bos_json_clause | ```json | Opening marker for JSON extraction (regex mode only) |
eos_json_clause | ``` | Closing marker for JSON extraction (regex mode only) |
Compatible models
The Judge works with any LangChain-compatible chat model:
# OpenAI
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o-mini")
# Anthropic
from langchain_anthropic import ChatAnthropic
model = ChatAnthropic(model="claude-sonnet-4-20250514")
# Groq
from langchain_groq import ChatGroq
model = ChatGroq(model="llama-3.3-70b-versatile")
# Google
from langchain_google_genai import ChatGoogleGenerativeAI
model = ChatGoogleGenerativeAI(model="gemini-2.0-flash")
When available, the Judge automatically extracts reasoning content from the model’s response. This is supported by models that provide chain-of-thought reasoning (e.g., OpenAI’s reasoning models, Anthropic’s extended thinking).
The reasoning is returned as the first element of the tuple from judge.check() and is used internally for logging and debugging.
For best results with use_structured_output=True, use models that natively support structured outputs like GPT-4o or Gemini. For open-source models, use_structured_output=False with regex extraction is more reliable.