Metrics Overview
Gaussia provides nine specialized metrics for comprehensive AI evaluation. Each metric focuses on a different aspect of AI behavior and quality.Available Metrics
Context
Evaluates how well responses align with provided context and instructions.
Conversational
Evaluates dialogue quality using Grice’s Maxims (Quality, Quantity, Relation, Manner).
Toxicity
Measures toxic language with clustering and demographic group profiling using the DIDT framework.
Bias
Detects bias across protected attributes (gender, race, religion, nationality, sexual orientation).
Humanity
Analyzes emotional depth and human-likeness using the NRC Emotion Lexicon.
BestOf
Tournament-style evaluation to compare multiple assistants head-to-head.
Agentic
Evaluate AI agent responses with pass@K metrics and tool correctness.
Vision
Evaluate VLM scene descriptions using semantic similarity and hallucination detection.
Regulatory
Evaluate responses against regulatory compliance using RAG-based retrieval.
Comparison Table
| Metric | Purpose | Output Type | LLM Required |
|---|---|---|---|
| Context | Measure context alignment | Per-session scores | Yes (Judge) |
| Conversational | Evaluate dialogue quality | Per-session scores | Yes (Judge) |
| Toxicity | Detect toxic language patterns | Per-session metrics | No |
| Bias | Identify biased responses | Per-session metrics | Yes (Guardian) |
| Humanity | Analyze emotional expression | Per-interaction scores | No |
| BestOf | Compare multiple assistants | Tournament results | Yes (Judge) |
| Agentic | Agent correctness with pass@K | Per-session metrics | Yes (Judge) |
| Vision | VLM similarity / hallucination | Per-session metrics | No |
| Regulatory | Regulatory compliance | Per-session scores | No |
Common Usage Pattern
All metrics follow the same usage pattern:Metric Categories
Lexicon-Based Metrics
These metrics use predefined lexicons and don’t require external LLMs:- Toxicity: Uses Hurtlex toxicity lexicon + HDBSCAN clustering
- Humanity: Uses NRC Emotion Lexicon for emotion detection
- Vision: Uses embedding-based similarity scoring
LLM-Judge Metrics
These metrics use an LLM as a judge to evaluate responses:- Context: Evaluates context alignment
- Conversational: Evaluates dialogue quality
- BestOf: Compares assistants in tournaments
- Agentic: Evaluates agent correctness
Guardian-Based Metrics
These metrics use specialized guardian models for detection:- Bias: Uses guardian models for bias detection
RAG-Based Metrics
These metrics use retrieval-augmented generation:- Regulatory: Retrieves and cross-references regulatory documents
Choosing a Metric
I want to detect harmful language
I want to detect harmful language
Use Toxicity for detecting toxic language patterns and demographic targeting.
I want to check for bias
I want to check for bias
Use Bias for detecting discrimination across protected attributes.
I want to ensure responses follow instructions
I want to ensure responses follow instructions
Use Context for measuring alignment with system context.
I want to evaluate conversation quality
I want to evaluate conversation quality
Use Conversational for assessing dialogue using Grice’s Maxims.
I want to measure emotional expression
I want to measure emotional expression
Use Humanity for analyzing emotional depth and human-likeness.
I want to compare multiple assistants
I want to compare multiple assistants
Use BestOf for tournament-style head-to-head comparisons.
I want to evaluate an AI agent
I want to evaluate an AI agent
Use Agentic for pass@K metrics and tool correctness scoring.
I want to evaluate a vision model
I want to evaluate a vision model
Use Vision for VLM hallucination detection and similarity scoring.
I want to check regulatory compliance
I want to check regulatory compliance
Use Regulatory for evaluating responses against a regulatory corpus.
Next Steps
Context
Start with context evaluation
Toxicity
Learn about toxicity detection
Agentic
Evaluate AI agents