Skip to main content
The @gaussia/sdk/prompt-optimizer subpath improves a system prompt against a dataset using GEPA (Generative Evolutionary Prompt Adaptation). Each iteration:
  1. runs the current prompt over the dataset (the executor),
  2. scores each answer (the evaluator),
  3. collects answers scoring below failureThreshold as failing examples,
  4. asks the model for improved candidate prompts driven by those failures,
  5. keeps the best candidate only if it strictly improves the mean score,
  6. repeats until nothing fails, nothing improves, or the iteration budget is spent.
The optimizer talks only to the LanguageModel interface — it has no idea which provider you use.

Optimize a prompt

run takes a Retriever class, the config passed to that retriever’s constructor, and the options. A small inline retriever that returns a fixed Dataset[] is often all you need.
import type { Retriever } from "@gaussia/sdk";
import { GEPAOptimizer } from "@gaussia/sdk/prompt-optimizer";
import { Dataset } from "@gaussia/sdk/schemas";
import type { DatasetT } from "@gaussia/sdk/schemas";

class EvalSet implements Retriever {
  readonly iterationLevel = "full_dataset" as const;
  constructor(private readonly data: DatasetT[]) {}
  async loadDataset(): Promise<DatasetT[]> {
    return this.data;
  }
}

const datasets: DatasetT[] = [
  Dataset.parse({
    sessionId: "support",
    assistantId: "help-bot",
    context: "Refunds reach the original payment method within 5 business days; a refund needs the order ID.",
    conversation: [
      { qaId: "time", query: "How long does a refund take?", assistant: "", groundTruthAssistant: "Within 5 business days, to your original payment method." },
      { qaId: "need", query: "What do I need for a refund?", assistant: "", groundTruthAssistant: "Your order ID." },
    ],
  }),
];

const result = await GEPAOptimizer.run(EvalSet, datasets, {
  model, // any LanguageModel
  seedPrompt: "Help the customer.",
  objective: "Answer refund questions accurately and concisely from the context only.",
});

console.log(result.initialScore, "→", result.finalScore);
console.log(result.optimizedPrompt);

Options

model
LanguageModel
required
Used for candidate generation, and for the default judge when no evaluator is given.
seedPrompt
string
required
The starting system prompt to improve.
objective
string
required
What a good answer must do. Also the default judge’s criteria.
executor
Executor
How answers are produced. Omit to call the model directly; supply one to optimize a real pipeline.
evaluator
Evaluator
How answers are scored. Omit for the built-in LLM judge; supply one to grade with a rubric or with code.
iterations
number
default:"5"
Maximum optimization rounds.
candidatesPerIteration
number
default:"3"
Candidate prompts generated per round.
failureThreshold
number
default:"0.6"
Score (in [0,1]) below which an answer counts as failing and drives candidate generation.
concurrency
number
default:"1"
Parallel evaluation chains. Results are gathered in input order, so the output is identical for any value — only faster. 1 matches the reference behavior exactly.
candidateRetries
number
default:"2"
Retries a transient malformed candidate response before throwing OptimizerError.
onProgress
(event: ProgressEvent) => void
Called once per executed round with { iteration, bestScore, failing }.
logger
Logger
Optional logger for diagnostics.

The two seams

Two function contracts let you adapt the optimizer to your system without subclassing.

Evaluator — how answers are scored

type Evaluator = (
  actual: string,
  expected: string,
  query: string,
  context: string,
) => number | Promise<number>;
By convention an evaluator returns [0,1], but a custom evaluator owns its range. Supplying one fully replaces the default — the built-in LLM judge is never constructed.
Omit evaluator. The built-in judge scores against your objective.
const result = await GEPAOptimizer.run(EvalSet, datasets, {
  model,
  seedPrompt: "Help the customer.",
  objective: "Answer accurately and concisely from the context only.",
});
For better-calibrated scoring on subjective criteria, see the logprob evaluator.

Executor — how answers are produced

type Executor = (
  prompt: string,
  query: string,
  context: string,
) => string | Promise<string>;
Omit it to call the model directly. Supply one to optimize the prompt for your real system — a RAG chain, an agent, a tool pipeline — so the prompt is tuned end to end.
import type { Executor } from "@gaussia/sdk/prompt-optimizer";

// A retrieve-then-generate pipeline. It does its own retrieval, like production RAG.
const ragPipeline: Executor = async (prompt, query) => {
  const article = await retrieveArticle(query);
  const { text } = await model.generateText({
    system: `${prompt}\n\nKnowledge base article:\n${article.context}`,
    prompt: query,
  });
  return text;
};

const result = await GEPAOptimizer.run(EvalSet, datasets, {
  model, // still used for candidate generation + the default judge
  seedPrompt: "Help the customer.",
  objective: "Answer accurately from retrieved context.",
  executor: ragPipeline,
});

Watch it run

onProgress fires once per executed round. Use it for live feedback.
await GEPAOptimizer.run(EvalSet, datasets, {
  model,
  seedPrompt: "Help the customer.",
  objective: "Answer accurately and concisely.",
  iterations: 4,
  candidatesPerIteration: 2,
  failureThreshold: 0.8,
  onProgress: ({ iteration, bestScore, failing }) =>
    console.log(`round ${iteration}: best ${bestScore.toFixed(2)} · ${failing} failing`),
});
If onProgress fires zero or one time, GEPA converged early — the seed already passed, or no candidate improved. Raise failureThreshold and the iteration budget when you want more rounds to observe. Output is non-deterministic against a real model.

The result

run returns a validated OptimizationResult:
optimizedPrompt
string
The best prompt found (the seed if nothing improved).
initialScore
number
Mean score of the seed prompt.
finalScore
number
Best mean score reached.
iterationsRun
number
Rounds actually executed (0 if the seed passed immediately).
nExamples
number
Number of evaluation examples.
history
IterationResult[]
Per-round detail: iteration, bestPrompt, bestScore, candidates (each { prompt, score }), and failingExamples (each { query, context, expected, actual, score }).

Inspect the trajectory

console.log(`score ${result.initialScore.toFixed(2)}${result.finalScore.toFixed(2)} over ${result.iterationsRun} round(s)`);

for (const round of result.history) {
  console.log(`Round ${round.iteration} (best ${round.bestScore.toFixed(2)}):`);
  for (const c of round.candidates) {
    const mark = c.prompt === round.bestPrompt ? "★" : "·";
    console.log(`  ${mark} ${c.score.toFixed(2)}  ${c.prompt.slice(0, 60)}`);
  }
  console.log(`  driven by ${round.failingExamples.length} failing example(s)`);
}

Errors and resilience

  • A run that needs the model to produce candidates may hit a malformed response; candidateRetries (default 2) retries transient failures before throwing OptimizerError. Smaller models fail this more often — retry or use a stronger model.
  • The result schemas (OptimizationResult, IterationResult, CandidateResult, FailingExample) come from @gaussia/sdk/schemas and are validated before run returns.

Runtime

The prompt-optimizer bundle contains zero AI-SDK bytes and no Node-only built-ins, so it is fully isomorphic. Streaming retrievers are rejected.