Skip to main content

Architecture

Gaussia follows a simple yet powerful architecture designed for extensibility and ease of use.

Overview

Data Flow

The core data flow in Gaussia is:
  1. Retriever loads your conversation data (list[Dataset], Iterator[Dataset], or Iterator[StreamedBatch])
  2. Gaussia base class iterates through datasets
  3. Metric implementations process each conversation batch
  4. Results are collected in self.metrics
1

Load Data

Retriever.load_dataset() returns list[Dataset]
2

Process Datasets

Gaussia._process() iterates through datasets
3

Compute Metrics

Metric.batch() processes each conversation
4

Collect Results

Results stored in self.metrics

Gaussia Base Class

All metrics inherit from Gaussia (gaussia/core/base.py):
from abc import ABC, abstractmethod
from typing import Type
from gaussia.core.retriever import Retriever

class Gaussia(ABC):
    def __init__(self, retriever: Type[Retriever], verbose: bool = False, **kwargs):
        self.retriever = retriever(**kwargs)
        self.metrics = []
        self.verbose = verbose

    @abstractmethod
    def batch(self, session_id: str, context: str, assistant_id: str,
              batch: list[Batch], language: str | None) -> None:
        """Process a batch of conversations. Implemented by each metric."""
        pass

    @classmethod
    def run(cls, retriever: Type[Retriever], **kwargs) -> list:
        """One-shot execution: instantiate and process."""
        instance = cls(retriever, **kwargs)
        instance._process()
        return instance.metrics

Retriever

Abstract base class for data loading:
from abc import ABC, abstractmethod
from gaussia.schemas.common import Dataset

class Retriever(ABC):
    def __init__(self, **kwargs):
        pass

    @property
    def iteration_level(self) -> IterationLevel:
        return IterationLevel.FULL_DATASET  # default

    @abstractmethod
    def load_dataset(self) -> list[Dataset] | Iterator[Dataset] | Iterator[StreamedBatch]:
        """Load and return datasets for evaluation."""
        pass

Data Structures

Dataset: A complete conversation session
class Dataset(BaseModel):
    session_id: str          # Unique session identifier
    assistant_id: str        # ID of the assistant being evaluated
    language: str | None     # Language code (e.g., "english")
    context: str             # System context/instructions
    conversation: list[Batch] # List of Q&A interactions
Batch: A single Q&A interaction
class Batch(BaseModel):
    qa_id: str                          # Unique interaction ID
    query: str                          # User question
    assistant: str                      # Assistant response
    ground_truth_assistant: str | None  # Expected response
    observation: str | None             # Additional notes
    weight: float | None                # Importance weight
    agentic: dict | None                # Tool usage metadata
    ground_truth_agentic: dict | None   # Expected tool usage
    logprobs: dict | None               # Log probabilities

Metric Architecture

Each metric follows this pattern:
from gaussia.core.base import Gaussia

class MyMetric(Gaussia):
    def __init__(self, retriever, verbose=False, **kwargs):
        super().__init__(retriever, verbose, **kwargs)
        # Initialize metric-specific components

    def batch(self, session_id, context, assistant_id, batch, language):
        # Process the batch and compute metrics
        result = self._compute(batch)
        self.metrics.append(result)

Statistical Modes

Gaussia supports two statistical approaches:
Returns point estimates (floats):
from gaussia.statistical import FrequentistMode

metrics = Toxicity.run(
    MyRetriever,
    statistical_mode=FrequentistMode(),
)
# Returns: metric.group_profiling.frequentist.DIDT = 0.33

Module Structure

gaussia/
├── core/
│   ├── base.py           # Gaussia base class
│   ├── retriever.py      # Retriever abstract class
│   ├── guardian.py        # Guardian interface (bias detection)
│   ├── sentiment.py       # Sentiment analyzer interface
│   ├── loader.py          # Toxicity loader interface
│   └── extractor.py       # Group extractor interface
├── metrics/
│   ├── context.py         # Context metric
│   ├── conversational.py  # Conversational metric
│   ├── toxicity.py        # Toxicity metric
│   ├── bias.py            # Bias metric
│   ├── humanity.py        # Humanity metric
│   ├── best_of.py         # BestOf metric
│   ├── agentic.py         # Agentic metric
│   ├── vision.py          # Vision metrics
│   └── regulatory.py      # Regulatory metric
├── schemas/
│   ├── common.py          # Dataset, Batch schemas
│   └── ...                # Metric-specific schemas
├── statistical/
│   ├── base.py            # StatisticalMode interface
│   ├── frequentist.py     # Frequentist implementation
│   └── bayesian.py        # Bayesian implementation
├── generators/            # Test dataset generation
├── llm/                   # LLM integration (Judge)
├── guardians/             # Guardian implementations
├── extractors/            # Group extractor implementations
└── loaders/               # Toxicity lexicon loaders

Extension Points

Gaussia is designed for extensibility:
ComponentInterfacePurpose
Retrieverload_dataset()Load custom data sources
Guardianis_biased()Custom bias detection
SentimentAnalyzerinfer()Custom sentiment analysis
ToxicityLoaderload()Custom toxicity lexicons
BaseGroupExtractordetect_one()Custom group detection
StatisticalModeVarious methodsCustom statistical analysis

Next Steps

Retriever

Create custom retrievers for any data source

Dataset & Batch

Understand data structures

Statistical Modes

Frequentist vs Bayesian approaches