Skip to main content

Metrics Overview

Gaussia provides nine specialized metrics for comprehensive AI evaluation. Each metric focuses on a different aspect of AI behavior and quality.

Available Metrics

Context

Evaluates how well responses align with provided context and instructions.

Conversational

Evaluates dialogue quality using Grice’s Maxims (Quality, Quantity, Relation, Manner).

Toxicity

Measures toxic language with clustering and demographic group profiling using the DIDT framework.

Bias

Detects bias across protected attributes (gender, race, religion, nationality, sexual orientation).

Humanity

Analyzes emotional depth and human-likeness using the NRC Emotion Lexicon.

BestOf

Tournament-style evaluation to compare multiple assistants head-to-head.

Agentic

Evaluate AI agent responses with pass@K metrics and tool correctness.

Vision

Evaluate VLM scene descriptions using semantic similarity and hallucination detection.

Regulatory

Evaluate responses against regulatory compliance using RAG-based retrieval.

Comparison Table

MetricPurposeOutput TypeLLM Required
ContextMeasure context alignmentPer-session scoresYes (Judge)
ConversationalEvaluate dialogue qualityPer-session scoresYes (Judge)
ToxicityDetect toxic language patternsPer-session metricsNo
BiasIdentify biased responsesPer-session metricsYes (Guardian)
HumanityAnalyze emotional expressionPer-interaction scoresNo
BestOfCompare multiple assistantsTournament resultsYes (Judge)
AgenticAgent correctness with pass@KPer-session metricsYes (Judge)
VisionVLM similarity / hallucinationPer-session metricsNo
RegulatoryRegulatory compliancePer-session scoresNo

Common Usage Pattern

All metrics follow the same usage pattern:
from gaussia.metrics.<metric> import <Metric>
from gaussia.core.retriever import Retriever

# 1. Define your retriever
class MyRetriever(Retriever):
    def load_dataset(self):
        # Return list[Dataset]
        pass

# 2. Run the metric
results = <Metric>.run(
    MyRetriever,
    **metric_specific_parameters,
    verbose=True,
)

# 3. Analyze results
for result in results:
    # Process metric-specific output
    pass

Metric Categories

Lexicon-Based Metrics

These metrics use predefined lexicons and don’t require external LLMs:
  • Toxicity: Uses Hurtlex toxicity lexicon + HDBSCAN clustering
  • Humanity: Uses NRC Emotion Lexicon for emotion detection
  • Vision: Uses embedding-based similarity scoring
# No LLM required
from gaussia.metrics.toxicity import Toxicity

results = Toxicity.run(
    MyRetriever,
    group_prototypes={...},
    verbose=True,
)

LLM-Judge Metrics

These metrics use an LLM as a judge to evaluate responses:
  • Context: Evaluates context alignment
  • Conversational: Evaluates dialogue quality
  • BestOf: Compares assistants in tournaments
  • Agentic: Evaluates agent correctness
# Requires LangChain-compatible model
from gaussia.metrics.context import Context
from langchain_openai import ChatOpenAI

judge = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

results = Context.run(
    MyRetriever,
    model=judge,
    use_structured_output=True,
    verbose=True,
)

Guardian-Based Metrics

These metrics use specialized guardian models for detection:
  • Bias: Uses guardian models for bias detection
from gaussia.metrics.bias import Bias

results = Bias.run(
    MyRetriever,
    guardian=MyGuardian,
    verbose=True,
)

RAG-Based Metrics

These metrics use retrieval-augmented generation:
  • Regulatory: Retrieves and cross-references regulatory documents
from gaussia.metrics.regulatory import Regulatory

results = Regulatory.run(
    MyRetriever,
    corpus_connector=corpus,
    embedder=embedder,
    reranker=reranker,
)

Choosing a Metric

Use Toxicity for detecting toxic language patterns and demographic targeting.
Use Bias for detecting discrimination across protected attributes.
Use Context for measuring alignment with system context.
Use Conversational for assessing dialogue using Grice’s Maxims.
Use Humanity for analyzing emotional depth and human-likeness.
Use BestOf for tournament-style head-to-head comparisons.
Use Agentic for pass@K metrics and tool correctness scoring.
Use Vision for VLM hallucination detection and similarity scoring.
Use Regulatory for evaluating responses against a regulatory corpus.

Next Steps

Context

Start with context evaluation

Toxicity

Learn about toxicity detection

Agentic

Evaluate AI agents