You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

5.5 KiB

Collection Naming Strategy

Why This Module Exists

When working with vector databases like Qdrant, the name of your collection is not cosmetic — it is a data integrity contract. Mixing incompatible vectors in the same collection silently corrupts your search results with no error or warning.

This module (collection_name.py) generates deterministic, self-documenting collection names that encode every critical parameter. If any parameter changes, a new name is produced, which means a new collection — preventing data corruption automatically.


The Golden Rule of Vector Collections

You MUST create a new collection whenever the mathematical comparison between two vectors becomes invalid.

If you compare a vector from Model A with a vector from Model B, the result is meaningless garbage. The numbers look real, cosine similarity returns a value, but it means nothing. There is no error. There is no crash. Your app just silently returns wrong answers.

Therefore, incompatible vectors cannot live in the same collection.


The 4 Factors That Force a New Collection

# Factor Why It Forces a New Collection Priority
1 Model jina-v3 vectors are numerically unrelated to openai-v3 vectors. They live in completely different mathematical spaces. 🔴 Critical
2 Dimensions You cannot insert a 1024-dim vector into a collection built for 768-dim. Qdrant will reject it outright. 🔴 Critical
3 Distance Metric A collection using Cosine ranks results differently than one using Dot Product. Mixing them invalidates your ranking logic. 🟠 High
4 Chunking Strategy If Batch A was chunked by paragraphs and Batch B by sentences, search results become biased and inconsistent. Apples vs. oranges. 🟡 Medium

Naming Format

[PROJECT]_[MODEL]_[DIMENSIONS]d_[CHUNK_SIZE]t_[TYPE]

Example Breakdown

imam_reza_jina-v3_1024d_512t_hybrid
│          │       │      │     │
│          │       │      │     └── hybrid = sparse + dense vectors enabled
│          │       │      └── 512t = 512 tokens per chunk
│          │       └── 1024d = vector dimensionality
│          └── jina-v3 = embedding model
└── imam_reza = project / domain name
Segment Purpose
imam_reza Project/Domain — keeps this data isolated from other apps or datasets.
jina-v3 Model — the embedding brain. Changing models = new collection, always.
1024d Dimensions — tells you (and Qdrant) the vector size at a glance.
512t Chunk size — "512 tokens per chunk". If you later experiment with 128-token chunks, a separate collection lets you compare performance fairly.
hybrid Search type — indicates sparse vectors are enabled alongside dense vectors.

How It Works

get_collection_name() pulls values from two sources:

  1. config/embeddings.yaml — the active embedding model, its id, and dimensions.
  2. Environment variables (.env) — BASE_COLLECTION_NAME, EMBEDDER_DIMENSIONS, CHUNK_SIZE, IS_HYBRID.

Every parameter can also be overridden explicitly via function arguments.

Resolution Priority (per parameter)

Parameter 1st (highest) 2nd 3rd (fallback)
project_name Explicit arg BASE_COLLECTION_NAME env var "default_project"
model_name Explicit arg embeddings.yamldefault key
dimensions Explicit arg embeddings.yaml → model config EMBEDDER_DIMENSIONS env var
chunk_size Explicit arg CHUNK_SIZE env var 500
is_hybrid Explicit arg IS_HYBRID env var true

Model Lookup

The model_name parameter is flexible — it accepts either:

  • A config key from embeddings.yaml (e.g., "jina_AI")
  • A model id (e.g., "jina-embeddings-v4")

Usage

from src.utils.collection_name import get_collection_name

# All defaults from config + env
name = get_collection_name()
# → "dovodi_collection_jina-embeddings-v4_1024d_500t_hybrid"

# Override chunk size
name = get_collection_name(chunk_size=128)
# → "dovodi_collection_jina-embeddings-v4_1024d_128t_hybrid"

# Dense-only collection
name = get_collection_name(is_hybrid=False)
# → "dovodi_collection_jina-embeddings-v4_1024d_500t_dense"

# Different model from embeddings.yaml
name = get_collection_name(model_name="openai_small")
# → "dovodi_collection_text-embedding-3-small_1536d_500t_hybrid"

What Happens If You Don't Do This

Scenario What Goes Wrong
You switch from jina-v3 to openai-v3 but keep the same collection Old jina vectors and new openai vectors get compared. Search returns nonsense. No error is raised.
You change dimensions from 1024 to 768 Qdrant crashes on insert — dimension mismatch. At least this one is loud.
You change chunk size from 512 to 128 Short chunks get unfairly boosted in similarity scores against long chunks. Search quality degrades silently.
You switch from Cosine to Dot Product Ranking logic is inverted for unnormalized vectors. Top results become bottom results.

The naming convention makes these mistakes structurally impossible — if any parameter changes, a new collection name is generated, and the old data is never touched.