5.5 KiB
Collection Naming Strategy
Why This Module Exists
When working with vector databases like Qdrant, the name of your collection is not cosmetic — it is a data integrity contract. Mixing incompatible vectors in the same collection silently corrupts your search results with no error or warning.
This module (collection_name.py) generates deterministic, self-documenting collection names that encode every critical parameter. If any parameter changes, a new name is produced, which means a new collection — preventing data corruption automatically.
The Golden Rule of Vector Collections
You MUST create a new collection whenever the mathematical comparison between two vectors becomes invalid.
If you compare a vector from Model A with a vector from Model B, the result is meaningless garbage. The numbers look real, cosine similarity returns a value, but it means nothing. There is no error. There is no crash. Your app just silently returns wrong answers.
Therefore, incompatible vectors cannot live in the same collection.
The 4 Factors That Force a New Collection
| # | Factor | Why It Forces a New Collection | Priority |
|---|---|---|---|
| 1 | Model | jina-v3 vectors are numerically unrelated to openai-v3 vectors. They live in completely different mathematical spaces. |
🔴 Critical |
| 2 | Dimensions | You cannot insert a 1024-dim vector into a collection built for 768-dim. Qdrant will reject it outright. | 🔴 Critical |
| 3 | Distance Metric | A collection using Cosine ranks results differently than one using Dot Product. Mixing them invalidates your ranking logic. |
🟠 High |
| 4 | Chunking Strategy | If Batch A was chunked by paragraphs and Batch B by sentences, search results become biased and inconsistent. Apples vs. oranges. | 🟡 Medium |
Naming Format
[PROJECT]_[MODEL]_[DIMENSIONS]d_[CHUNK_SIZE]t_[TYPE]
Example Breakdown
imam_reza_jina-v3_1024d_512t_hybrid
│ │ │ │ │
│ │ │ │ └── hybrid = sparse + dense vectors enabled
│ │ │ └── 512t = 512 tokens per chunk
│ │ └── 1024d = vector dimensionality
│ └── jina-v3 = embedding model
└── imam_reza = project / domain name
| Segment | Purpose |
|---|---|
imam_reza |
Project/Domain — keeps this data isolated from other apps or datasets. |
jina-v3 |
Model — the embedding brain. Changing models = new collection, always. |
1024d |
Dimensions — tells you (and Qdrant) the vector size at a glance. |
512t |
Chunk size — "512 tokens per chunk". If you later experiment with 128-token chunks, a separate collection lets you compare performance fairly. |
hybrid |
Search type — indicates sparse vectors are enabled alongside dense vectors. |
How It Works
get_collection_name() pulls values from two sources:
config/embeddings.yaml— the active embedding model, itsid, anddimensions.- Environment variables (
.env) —BASE_COLLECTION_NAME,EMBEDDER_DIMENSIONS,CHUNK_SIZE,IS_HYBRID.
Every parameter can also be overridden explicitly via function arguments.
Resolution Priority (per parameter)
| Parameter | 1st (highest) | 2nd | 3rd (fallback) |
|---|---|---|---|
project_name |
Explicit arg | BASE_COLLECTION_NAME env var |
"default_project" |
model_name |
Explicit arg | embeddings.yaml → default key |
— |
dimensions |
Explicit arg | embeddings.yaml → model config |
EMBEDDER_DIMENSIONS env var |
chunk_size |
Explicit arg | CHUNK_SIZE env var |
500 |
is_hybrid |
Explicit arg | IS_HYBRID env var |
true |
Model Lookup
The model_name parameter is flexible — it accepts either:
- A config key from
embeddings.yaml(e.g.,"jina_AI") - A model id (e.g.,
"jina-embeddings-v4")
Usage
from src.utils.collection_name import get_collection_name
# All defaults from config + env
name = get_collection_name()
# → "dovodi_collection_jina-embeddings-v4_1024d_500t_hybrid"
# Override chunk size
name = get_collection_name(chunk_size=128)
# → "dovodi_collection_jina-embeddings-v4_1024d_128t_hybrid"
# Dense-only collection
name = get_collection_name(is_hybrid=False)
# → "dovodi_collection_jina-embeddings-v4_1024d_500t_dense"
# Different model from embeddings.yaml
name = get_collection_name(model_name="openai_small")
# → "dovodi_collection_text-embedding-3-small_1536d_500t_hybrid"
What Happens If You Don't Do This
| Scenario | What Goes Wrong |
|---|---|
You switch from jina-v3 to openai-v3 but keep the same collection |
Old jina vectors and new openai vectors get compared. Search returns nonsense. No error is raised. |
| You change dimensions from 1024 to 768 | Qdrant crashes on insert — dimension mismatch. At least this one is loud. |
| You change chunk size from 512 to 128 | Short chunks get unfairly boosted in similarity scores against long chunks. Search quality degrades silently. |
| You switch from Cosine to Dot Product | Ranking logic is inverted for unnormalized vectors. Top results become bottom results. |
The naming convention makes these mistakes structurally impossible — if any parameter changes, a new collection name is generated, and the old data is never touched.