Documentation Index
Fetch the complete documentation index at: https://docs.gaussia.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Generators module creates synthetic Dataset objects from context documents. This is useful for bootstrapping evaluations when you don’t have real conversation data.
How it works
- A context loader reads and chunks your documents
- A chunk selection strategy picks which chunks to process
- The generator uses an LLM to create realistic QA pairs from each chunk
Usage
from langchain_openai import ChatOpenAI
from gaussia.generators import BaseGenerator, create_markdown_loader
model = ChatOpenAI(model="gpt-4o-mini")
generator = BaseGenerator(model=model)
loader = create_markdown_loader()
datasets = await generator.generate_dataset(
context_loader=loader,
source="./docs/knowledge_base.md",
assistant_id="my-assistant",
)
Context loaders
LocalMarkdownLoader
Reads markdown files and splits them into chunks based on headers and size:
from gaussia.generators import create_markdown_loader
loader = create_markdown_loader(
max_chunk_size=2000, # Max characters per chunk
min_chunk_size=200, # Min characters per chunk
overlap=100, # Overlap between size-based chunks
header_levels=[1, 2], # Split on H1 and H2
)
Custom loader
Implement BaseContextLoader for custom document sources:
from gaussia.generators import BaseContextLoader, Chunk
class MyLoader(BaseContextLoader):
def load(self, source: str) -> list[Chunk]:
# Return list of Chunk objects
...
Chunk selection strategies
| Strategy | Description |
|---|
SequentialStrategy | Process all chunks in order (default) |
RandomSamplingStrategy | Randomly sample chunks multiple times |
from gaussia.generators import RandomSamplingStrategy
datasets = await generator.generate_dataset(
context_loader=loader,
source="./docs/",
assistant_id="my-assistant",
strategy=RandomSamplingStrategy(n_samples=10, seed=42),
)