- job submission
- provider registration
- benchmark fan-out
- runtime execution in Kubernetes
- experiment tracking integration
- dataset construction
- metric execution
- benchmark-specific validation
- benchmark results and artifacts
Gaussia ships an optional EvalHub provider adapter in
gaussia[evalhub]. The adapter reads an EvalHub job spec, builds a gaussia.Dataset, dispatches the requested metric, and reports results back to EvalHub.How the integration works
In a typical setup:- Your system submits an EvalHub evaluation job.
- The job contains one or more benchmarks with
provider_id: "gaussia". - EvalHub launches one runtime job per benchmark.
- The Gaussia provider reads the benchmark payload from
JobSpec.parameters. - The provider builds a
gaussia.Dataset. - The provider runs the requested benchmark.
- The provider reports benchmark status back to EvalHub.
- The provider returns structured results and artifacts.
- If MLflow is enabled, benchmark evidence is stored there as well.
Logical flow
Deployment view
One event to many benchmarks
The contract between EvalHub and Gaussia
The preferred contract is to put a nativegaussia.Dataset inside each benchmark and put operational identifiers in metadata:
- benchmark-local business input
- one EvalHub job with many Gaussia benchmarks
- a simple provider contract
- clean benchmark dispatch inside the provider
Build a gaussia.Dataset from your payload
The most important design step is not the EvalHub side. It is the mapping from your source payload into gaussia.Dataset.
Gaussia expects a conversational structure with:
session_idassistant_idcontextconversation: list[Batch]
human/assistant pairs from a persisted message list.
Minimal mapping example
Batch objects.
The source can be:
- a fixture
- an HTTP request
- an event bus message
- a replay from storage
- an audit pipeline
Choose benchmarks and runtime dependencies
Expose only the benchmarks your payload can actually support. A common policy is:- always include
humanity,context, andconversational - include
biasandtoxicityonly when the payload yields at least 5human/assistantpairs
humanitydoes not need an external judge modelcontextandconversationalneed a judge modelbiasneeds a guardian-style modeltoxicityneeds embeddings and clustering support
A maintainable provider pattern
Keep these concerns separate:- payload parsing and dataset construction
- benchmark dispatch
- benchmark execution
- artifact serialization
- MLflow logging, if you use it
if/elif chain.
Example dispatch pattern
- add benchmarks
- test each benchmark in isolation
- keep benchmark-specific requirements local
- avoid a monolithic adapter function
Run the built-in provider adapter
Use the packaged EvalHub adapter as the provider entrypoint:humanitycontextconversationalbiastoxicity
GAUSSIA_* runtime settings from the environment, logs benchmark-level MLflow evidence when MLFLOW_TRACKING_URI is configured, and writes OCI artifacts when EvalHub requests an OCI export.
Register the provider in EvalHub
A practical provider definition should stay focused on:- provider identity
- runtime image and command
- benchmark registration
Runtime settings
Use your platform’s usual mechanism to inject these values into the provider runtime. In Kubernetes, that usually meansenv, Secret, ConfigMap, or your own rendered provider spec.
Judge model settings
Judge model settings
These settings are required for
context and conversational.| Variable | Required | Purpose |
|---|---|---|
GAUSSIA_JUDGE_MODEL | Yes | Judge model name |
GAUSSIA_JUDGE_API_KEY | Yes | Credentials for the judge endpoint |
GAUSSIA_JUDGE_BASE_URL | No | Override for OpenAI-compatible endpoints |
GAUSSIA_JUDGE_TEMPERATURE | No | Judge temperature. Default is 0.0 |
GAUSSIA_JUDGE_USE_STRUCTURED_OUTPUT | No | Enable structured output. Default is true |
Guardian settings
Guardian settings
These settings are required for
bias.| Variable | Required | Purpose |
|---|---|---|
GAUSSIA_GUARDIAN_MODEL | Yes | Guardian model name |
GAUSSIA_GUARDIAN_API_KEY | Yes | Credentials for the guardian endpoint |
GAUSSIA_GUARDIAN_BASE_URL | No | Override for OpenAI-compatible endpoints |
GAUSSIA_GUARDIAN_TEMPERATURE | No | Guardian temperature. Default is 0.01 |
GAUSSIA_GUARDIAN_LOGPROBS | No | Enable guardian logprobs. Default is false |
Writable cache settings
Writable cache settings
These settings are strongly recommended for
bias and toxicity.| Variable | Typical value |
|---|---|
HOME | /data |
XDG_CACHE_HOME | /data/.cache |
HF_HOME | /data/.cache/huggingface |
TRANSFORMERS_CACHE | /data/.cache/huggingface/transformers |
SENTENCE_TRANSFORMERS_HOME | /data/.cache/sentence-transformers |
Toxicity tuning
Toxicity tuning
The defaults are usually enough to get started. The most common overrides are:
| Variable | Default | Purpose |
|---|---|---|
GAUSSIA_TOXICITY_EMBEDDING_MODEL | all-MiniLM-L6-v2 | Embedding model |
GAUSSIA_TOXICITY_STATISTICAL_MODE | frequentist | Frequentist or Bayesian mode |
GAUSSIA_TOXICITY_MIN_CLUSTER_SIZE | 5 | Minimum cluster size |
GAUSSIA_TOXICITY_UMAP_N_NEIGHBORS | 15 | UMAP neighborhood size |
GAUSSIA_TOXICITY_GROUP_PROTOTYPES_JSON | built-in defaults | Override protected group prototypes |
GAUSSIA_TOXICITY_BAYESIAN_MC_SAMPLES | 10000 | Bayesian sampling depth |
MLflow run logging
MLflow run logging
If your provider logs one MLflow run per benchmark, the common runtime settings are:
| Variable | Required | Purpose |
|---|---|---|
MLFLOW_TRACKING_URI | Yes | Tracking server base URL |
MLFLOW_TRACKING_TOKEN | No | Direct bearer token |
MLFLOW_TRACKING_TOKEN_PATH | No | Path-based token source |
MLFLOW_TOKEN_PATH | No | Fallback token path |
MLFLOW_WORKSPACE | No | Workspace or tenant header for workspace-aware deployments |
Add an optional bridge
You only need a bridge if your upstream system does not already submit EvalHub jobs directly. A bridge usually does this:- accepts an incoming event or payload
- normalizes the payload shape
- derives interaction count
- decides benchmark eligibility
- builds a single
JobSubmissionRequest - deduplicates with a business key such as
stream_id + control_id - submits the job to EvalHub
Example job request
context_persistance while they migrate to dataset and metadata.
Choose a result model
A robust setup looks like this:- one EvalHub evaluation job per source event
- one provider benchmark execution per benchmark
- one MLflow run per benchmark, if benchmark-level evidence matters to you
- one top-level orchestration record
- clear per-benchmark isolation
- traceable benchmark metrics
- strong auditability in MLflow
Understand how MLflow fits in
A common pattern is:- EvalHub owns experiment-level orchestration
- the Gaussia provider logs benchmark-level evidence
benchmark_idevaluation_job_idassistant_idsession_idstream_idcontrol_idagentspace_id
Production notes
Writable cache matters
If you runbias or toxicity in containers, give the pod writable cache directories for Hugging Face, Transformers, and Sentence Transformers. This matters much more in provider jobs than in a local notebook workflow.
toxicity needs more resources
toxicity is usually the heaviest benchmark because it may involve embeddings, dimensionality reduction, clustering, and group profiling. Do not size it like humanity.
Some MLflow deployments use Kubernetes identity
In some OpenShift AI and Open Data Hub deployments, MLflow is authenticated with Kubernetes or OpenShift identity instead of a standalone API token. That means your runtime may need:- a valid projected ServiceAccount token
- workspace RBAC
- a workspace header such as
X-MLFLOW-WORKSPACE
Keep benchmark eligibility deterministic
Ifbias and toxicity require enough interactions, gate them before submission. Do not submit benchmarks that you already know cannot run.
That keeps failures meaningful instead of noisy.
Next steps
Python SDK quickstart
Start with the Gaussia SDK basics before wiring EvalHub.
Gaussia architecture
Understand retrievers, datasets, and the core processing model.
Context metric
Learn how judge-based conversational evaluation works.
Toxicity metric
Review the heaviest runtime dependency in a typical EvalHub integration.