5.5 KiB

Raw Blame History

Collection Naming Strategy

Why This Module Exists

When working with vector databases like Qdrant, the name of your collection is not cosmetic — it is a data integrity contract. Mixing incompatible vectors in the same collection silently corrupts your search results with no error or warning.

This module (collection_name.py) generates deterministic, self-documenting collection names that encode every critical parameter. If any parameter changes, a new name is produced, which means a new collection — preventing data corruption automatically.

The Golden Rule of Vector Collections

You MUST create a new collection whenever the mathematical comparison between two vectors becomes invalid.

If you compare a vector from Model A with a vector from Model B, the result is meaningless garbage. The numbers look real, cosine similarity returns a value, but it means nothing. There is no error. There is no crash. Your app just silently returns wrong answers.

Therefore, incompatible vectors cannot live in the same collection.

The 4 Factors That Force a New Collection

#	Factor	Why It Forces a New Collection	Priority
1	Model	`jina-v3` vectors are numerically unrelated to `openai-v3` vectors. They live in completely different mathematical spaces.	🔴 Critical
2	Dimensions	You cannot insert a 1024-dim vector into a collection built for 768-dim. Qdrant will reject it outright.	🔴 Critical
3	Distance Metric	A collection using `Cosine` ranks results differently than one using `Dot Product`. Mixing them invalidates your ranking logic.	🟠 High
4	Chunking Strategy	If Batch A was chunked by paragraphs and Batch B by sentences, search results become biased and inconsistent. Apples vs. oranges.	🟡 Medium

Naming Format

[PROJECT]_[MODEL]_[DIMENSIONS]d_[CHUNK_SIZE]t_[TYPE]

Example Breakdown

imam_reza_jina-v3_1024d_512t_hybrid
│          │       │      │     │
│          │       │      │     └── hybrid = sparse + dense vectors enabled
│          │       │      └── 512t = 512 tokens per chunk
│          │       └── 1024d = vector dimensionality
│          └── jina-v3 = embedding model
└── imam_reza = project / domain name

Segment	Purpose
`imam_reza`	Project/Domain — keeps this data isolated from other apps or datasets.
`jina-v3`	Model — the embedding brain. Changing models = new collection, always.
`1024d`	Dimensions — tells you (and Qdrant) the vector size at a glance.
`512t`	Chunk size — "512 tokens per chunk". If you later experiment with 128-token chunks, a separate collection lets you compare performance fairly.
`hybrid`	Search type — indicates sparse vectors are enabled alongside dense vectors.

How It Works

get_collection_name() pulls values from two sources:

config/embeddings.yaml — the active embedding model, its id, and dimensions.
Environment variables (.env) — BASE_COLLECTION_NAME, EMBEDDER_DIMENSIONS, CHUNK_SIZE, IS_HYBRID.

Every parameter can also be overridden explicitly via function arguments.

Resolution Priority (per parameter)

Parameter	1st (highest)	2nd	3rd (fallback)
`project_name`	Explicit arg	`BASE_COLLECTION_NAME` env var	`"default_project"`
`model_name`	Explicit arg	`embeddings.yaml` → `default` key	—
`dimensions`	Explicit arg	`embeddings.yaml` → model config	`EMBEDDER_DIMENSIONS` env var
`chunk_size`	Explicit arg	`CHUNK_SIZE` env var	`500`
`is_hybrid`	Explicit arg	`IS_HYBRID` env var	`true`

Model Lookup

The model_name parameter is flexible — it accepts either:

A config key from embeddings.yaml (e.g., "jina_AI")
A model id (e.g., "jina-embeddings-v4")

Usage

from src.utils.collection_name import get_collection_name

# All defaults from config + env
name = get_collection_name()
# → "dovodi_collection_jina-embeddings-v4_1024d_500t_hybrid"

# Override chunk size
name = get_collection_name(chunk_size=128)
# → "dovodi_collection_jina-embeddings-v4_1024d_128t_hybrid"

# Dense-only collection
name = get_collection_name(is_hybrid=False)
# → "dovodi_collection_jina-embeddings-v4_1024d_500t_dense"

# Different model from embeddings.yaml
name = get_collection_name(model_name="openai_small")
# → "dovodi_collection_text-embedding-3-small_1536d_500t_hybrid"

What Happens If You Don't Do This

Scenario	What Goes Wrong
You switch from `jina-v3` to `openai-v3` but keep the same collection	Old jina vectors and new openai vectors get compared. Search returns nonsense. No error is raised.
You change dimensions from 1024 to 768	Qdrant crashes on insert — dimension mismatch. At least this one is loud.
You change chunk size from 512 to 128	Short chunks get unfairly boosted in similarity scores against long chunks. Search quality degrades silently.
You switch from Cosine to Dot Product	Ranking logic is inverted for unnormalized vectors. Top results become bottom results.

The naming convention makes these mistakes structurally impossible — if any parameter changes, a new collection name is generated, and the old data is never touched.

5.5 KiB Raw Blame History