LogprobEvaluator is an opt-in judge for the GEPA optimizer. Instead of asking the model to verbalize a numeric score — which is coarse and tends to snap to round numbers — it asks a Yes/No question and reads the model’s token log probabilities. The probability mass on “Yes” versus “No” becomes a genuinely graded score.
The payoff shows on uncertain judgments. A verbalized score jumps from 0.7 to 0.9 with nothing in between; a logprob is continuous, so borderline answers get borderline scores. This is a port of pygaussia’s guardian scoring.
Use it
Construct aLogprobEvaluator and pass its evaluate function as the optimizer’s evaluator. .evaluate is a bound field, so you can pass it around freely.
Options
The judge model. Token log probabilities are requested from it.
The natural-language criterion the answer is judged against.
Token representing a passing judgment. Compared case-insensitively.
Token representing a failing judgment. Compared case-insensitively.
Where the signal comes from
The score is derived from raw token log probabilities, which any consumer can request directly from aLanguageModel. Pass a number for logprobs to get top-K alternatives per token; passing true returns the chosen tokens with no alternatives.
Math.exp(logprob) converts a log probability into a probability. The evaluator reads the probability on the positive token versus the negative token to produce its score.
Logprob vs. structured judging
On a clear-cut answer both judges agree. On a borderline answer the logprob judge is graded where the structured judge rounds off.partial row is where logprobs earn their keep: it is genuinely half-right, and the logprob score reflects that instead of snapping to 0 or 1.
Fallback behavior
Log probabilities are provider-dependent. When the model returns no usable logprobs, or the response does not contain both the positive and negative tokens, the evaluator transparently falls back to the structured-outputLLMEvaluator so optimization continues uninterrupted.
To benefit from logprob scoring, use a provider that returns token log probabilities (for example, OpenAI). The ai-sdk adapter maps provider logprobs into the vendor-neutral
TokenLogprob[] shape.When to use it
Logprob judge
Subjective criteria where confidence is genuinely graded, and your provider exposes logprobs.
Default LLM judge
A simple structured-output judge. Omit
evaluator, or use LLMEvaluator with a rubric.Deterministic evaluator
Scoring is checkable in code and you want it cheap and reproducible.