Skip to main content

Overview

The Vision module provides two complementary metrics for evaluating Vision Language Models (VLMs):
  • VisionSimilarity: How accurately the VLM describes scenes compared to human ground truth
  • VisionHallucination: How often the VLM describes content not present in the scene
Both metrics use a pluggable SimilarityScorer (defaulting to cosine similarity with all-mpnet-base-v2).

VisionSimilarity

Measures semantic similarity between VLM descriptions and human annotations.
from gaussia.metrics.vision import VisionSimilarity

results = VisionSimilarity.run(MyRetriever)

for r in results:
    print(f"Mean similarity: {r.mean_similarity:.0%}")
    print(f"Range: [{r.min_similarity:.0%}, {r.max_similarity:.0%}]")
    print(r.summary)

Output

FieldTypeDescription
mean_similarityfloatAverage similarity across all frames
min_similarityfloatMinimum similarity score
max_similarityfloatMaximum similarity score
summarystrHuman-readable summary
interactionslist[VisionSimilarityInteraction]Per-frame scores

VisionHallucination

Flags frames where similarity falls below a threshold as hallucinations.
from gaussia.metrics.vision import VisionHallucination

results = VisionHallucination.run(
    MyRetriever,
    threshold=0.75,
)

for r in results:
    print(f"Hallucination rate: {r.hallucination_rate:.0%}")
    print(f"Hallucinations: {r.n_hallucinations}/{r.n_frames}")

Output

FieldTypeDescription
hallucination_ratefloatFraction of hallucinated frames
n_hallucinationsintNumber of hallucinated frames
n_framesintTotal frames evaluated
thresholdfloatThreshold used
summarystrHuman-readable summary
interactionslist[VisionHallucinationInteraction]Per-frame results

Parameters (both metrics)

ParameterTypeDefaultDescription
retrievertype[Retriever]requiredRetriever class
scorerSimilarityScorerCosine + mpnetSimilarity scoring strategy
thresholdfloat0.75Hallucination threshold

Custom scorer

from gaussia.embedders import SentenceTransformerEmbedder
from gaussia.scorers import CosineSimilarity

scorer = CosineSimilarity(SentenceTransformerEmbedder(model="all-MiniLM-L6-v2"))
results = VisionSimilarity.run(MyRetriever, scorer=scorer)

Expected batch format

Batch(
    qa_id="frame-001",
    query="Describe the scene",
    assistant="A person walking a dog in a park",        # VLM output
    ground_truth_assistant="A woman jogging with her golden retriever",  # Human annotation
)
Requires the vision extra: pip install "gaussia[vision]".