# Collection Naming Strategy ## Why This Module Exists When working with vector databases like Qdrant, **the name of your collection is not cosmetic — it is a data integrity contract.** Mixing incompatible vectors in the same collection silently corrupts your search results with no error or warning. This module (`collection_name.py`) generates deterministic, self-documenting collection names that encode every critical parameter. If any parameter changes, a new name is produced, which means a new collection — preventing data corruption automatically. --- ## The Golden Rule of Vector Collections > **You MUST create a new collection whenever the mathematical comparison between two vectors becomes invalid.** If you compare a vector from **Model A** with a vector from **Model B**, the result is **meaningless garbage**. The numbers look real, cosine similarity returns a value, but it means nothing. There is no error. There is no crash. Your app just silently returns wrong answers. Therefore, incompatible vectors **cannot live in the same collection.** --- ## The 4 Factors That Force a New Collection | # | Factor | Why It Forces a New Collection | Priority | |---|--------|-------------------------------|----------| | 1 | **Model** | `jina-v3` vectors are numerically unrelated to `openai-v3` vectors. They live in completely different mathematical spaces. | :red_circle: Critical | | 2 | **Dimensions** | You cannot insert a 1024-dim vector into a collection built for 768-dim. Qdrant will reject it outright. | :red_circle: Critical | | 3 | **Distance Metric** | A collection using `Cosine` ranks results differently than one using `Dot Product`. Mixing them invalidates your ranking logic. | :orange_circle: High | | 4 | **Chunking Strategy** | If Batch A was chunked by paragraphs and Batch B by sentences, search results become biased and inconsistent. Apples vs. oranges. | :yellow_circle: Medium | --- ## Naming Format ``` [PROJECT]_[MODEL]_[DIMENSIONS]d_[CHUNK_SIZE]t_[TYPE] ``` ### Example Breakdown ``` imam_reza_jina-v3_1024d_512t_hybrid │ │ │ │ │ │ │ │ │ └── hybrid = sparse + dense vectors enabled │ │ │ └── 512t = 512 tokens per chunk │ │ └── 1024d = vector dimensionality │ └── jina-v3 = embedding model └── imam_reza = project / domain name ``` | Segment | Purpose | |---------|---------| | `imam_reza` | **Project/Domain** — keeps this data isolated from other apps or datasets. | | `jina-v3` | **Model** — the embedding brain. Changing models = new collection, always. | | `1024d` | **Dimensions** — tells you (and Qdrant) the vector size at a glance. | | `512t` | **Chunk size** — "512 tokens per chunk". If you later experiment with 128-token chunks, a separate collection lets you compare performance fairly. | | `hybrid` | **Search type** — indicates sparse vectors are enabled alongside dense vectors. | --- ## How It Works `get_collection_name()` pulls values from two sources: 1. **`config/embeddings.yaml`** — the active embedding model, its `id`, and `dimensions`. 2. **Environment variables** (`.env`) — `BASE_COLLECTION_NAME`, `EMBEDDER_DIMENSIONS`, `CHUNK_SIZE`, `IS_HYBRID`. Every parameter can also be overridden explicitly via function arguments. ### Resolution Priority (per parameter) | Parameter | 1st (highest) | 2nd | 3rd (fallback) | |-----------|---------------|-----|-----------------| | `project_name` | Explicit arg | `BASE_COLLECTION_NAME` env var | `"default_project"` | | `model_name` | Explicit arg | `embeddings.yaml` → `default` key | — | | `dimensions` | Explicit arg | `embeddings.yaml` → model config | `EMBEDDER_DIMENSIONS` env var | | `chunk_size` | Explicit arg | `CHUNK_SIZE` env var | `500` | | `is_hybrid` | Explicit arg | `IS_HYBRID` env var | `true` | ### Model Lookup The `model_name` parameter is flexible — it accepts either: - A **config key** from `embeddings.yaml` (e.g., `"jina_AI"`) - A **model id** (e.g., `"jina-embeddings-v4"`) --- ## Usage ```python from src.utils.collection_name import get_collection_name # All defaults from config + env name = get_collection_name() # → "dovodi_collection_jina-embeddings-v4_1024d_500t_hybrid" # Override chunk size name = get_collection_name(chunk_size=128) # → "dovodi_collection_jina-embeddings-v4_1024d_128t_hybrid" # Dense-only collection name = get_collection_name(is_hybrid=False) # → "dovodi_collection_jina-embeddings-v4_1024d_500t_dense" # Different model from embeddings.yaml name = get_collection_name(model_name="openai_small") # → "dovodi_collection_text-embedding-3-small_1536d_500t_hybrid" ``` --- ## What Happens If You Don't Do This | Scenario | What Goes Wrong | |----------|----------------| | You switch from `jina-v3` to `openai-v3` but keep the same collection | Old jina vectors and new openai vectors get compared. Search returns nonsense. **No error is raised.** | | You change dimensions from 1024 to 768 | Qdrant crashes on insert — dimension mismatch. At least this one is loud. | | You change chunk size from 512 to 128 | Short chunks get unfairly boosted in similarity scores against long chunks. Search quality degrades silently. | | You switch from Cosine to Dot Product | Ranking logic is inverted for unnormalized vectors. Top results become bottom results. | The naming convention makes these mistakes **structurally impossible** — if any parameter changes, a new collection name is generated, and the old data is never touched.