run returns those scores. By the end you will see exactly where the model does its work.
Prerequisites
- The SDK and its peers installed. See Installation.
- The
aipeer and a provider package such as@ai-sdk/openai, with provider credentials configured. The evaluation below calls a real model.
1. Load a dataset with a retriever
A retriever returnsDataset[]. Build datasets and batches through their schemas so they are validated. Here one session holds two questions — one answered well, one answered poorly — so the scores vary.
2. Wire a model
Evaluators reach a language model through an adapter. Wrap a Vercel AI SDK model so it satisfies theLanguageModel contract. This model is what the evaluator calls in the next step.
3. Define an evaluator that uses the model
SubclassGaussia and implement batch. This method is where the measuring happens: the base class calls it once per session, and you decide what to compute. Here you ask the model to score each answer against its ground truth and push the score to this.metrics.
model constant from step 2 — the base class does not inject one, so your evaluator decides whether and how to use a model. A purely deterministic evaluator (string match, regex) needs no model at all.
4. Run it and read the scores
run constructs the evaluator from your retriever class, drives every session through batch, and returns whatever you collected in this.metrics.
batch, and run returned the per-answer scores. The model was the judge, not a side demo.
Next steps
Evaluators
The full
batch lifecycle, iteration levels, and weighting.Generators
Generate datasets from context documents instead of writing them by hand.