# Collection Naming Strategy

## Why This Module Exists

When working with vector databases like Qdrant, **the name of your collection is not cosmetic — it is a data integrity contract.** Mixing incompatible vectors in the same collection silently corrupts your search results with no error or warning.

This module (`collection_name.py`) generates deterministic, self-documenting collection names that encode every critical parameter. If any parameter changes, a new name is produced, which means a new collection — preventing data corruption automatically.

---

## The Golden Rule of Vector Collections

> **You MUST create a new collection whenever the mathematical comparison between two vectors becomes invalid.**

If you compare a vector from **Model A** with a vector from **Model B**, the result is **meaningless garbage**. The numbers look real, cosine similarity returns a value, but it means nothing. There is no error. There is no crash. Your app just silently returns wrong answers.

Therefore, incompatible vectors **cannot live in the same collection.**

---

## The 4 Factors That Force a New Collection

| # | Factor | Why It Forces a New Collection | Priority |
|---|--------|-------------------------------|----------|
| 1 | **Model** | `jina-v3` vectors are numerically unrelated to `openai-v3` vectors. They live in completely different mathematical spaces. | :red_circle: Critical |
| 2 | **Dimensions** | You cannot insert a 1024-dim vector into a collection built for 768-dim. Qdrant will reject it outright. | :red_circle: Critical |
| 3 | **Distance Metric** | A collection using `Cosine` ranks results differently than one using `Dot Product`. Mixing them invalidates your ranking logic. | :orange_circle: High |
| 4 | **Chunking Strategy** | If Batch A was chunked by paragraphs and Batch B by sentences, search results become biased and inconsistent. Apples vs. oranges. | :yellow_circle: Medium |

---

## Naming Format

```
[PROJECT]_[MODEL]_[DIMENSIONS]d_[CHUNK_SIZE]t_[TYPE]
```

### Example Breakdown

```
imam_reza_jina-v3_1024d_512t_hybrid
│          │       │      │     │
│          │       │      │     └── hybrid = sparse + dense vectors enabled
│          │       │      └── 512t = 512 tokens per chunk
│          │       └── 1024d = vector dimensionality
│          └── jina-v3 = embedding model
└── imam_reza = project / domain name
```

| Segment | Purpose |
|---------|---------|
| `imam_reza` | **Project/Domain** — keeps this data isolated from other apps or datasets. |
| `jina-v3` | **Model** — the embedding brain. Changing models = new collection, always. |
| `1024d` | **Dimensions** — tells you (and Qdrant) the vector size at a glance. |
| `512t` | **Chunk size** — "512 tokens per chunk". If you later experiment with 128-token chunks, a separate collection lets you compare performance fairly. |
| `hybrid` | **Search type** — indicates sparse vectors are enabled alongside dense vectors. |

---

## How It Works

`get_collection_name()` pulls values from two sources:

1. **`config/embeddings.yaml`** — the active embedding model, its `id`, and `dimensions`.
2. **Environment variables** (`.env`) — `BASE_COLLECTION_NAME`, `EMBEDDER_DIMENSIONS`, `CHUNK_SIZE`, `IS_HYBRID`.

Every parameter can also be overridden explicitly via function arguments.

### Resolution Priority (per parameter)

| Parameter | 1st (highest) | 2nd | 3rd (fallback) |
|-----------|---------------|-----|-----------------|
| `project_name` | Explicit arg | `BASE_COLLECTION_NAME` env var | `"default_project"` |
| `model_name` | Explicit arg | `embeddings.yaml` → `default` key | — |
| `dimensions` | Explicit arg | `embeddings.yaml` → model config | `EMBEDDER_DIMENSIONS` env var |
| `chunk_size` | Explicit arg | `CHUNK_SIZE` env var | `500` |
| `is_hybrid` | Explicit arg | `IS_HYBRID` env var | `true` |

### Model Lookup

The `model_name` parameter is flexible — it accepts either:
- A **config key** from `embeddings.yaml` (e.g., `"jina_AI"`)
- A **model id** (e.g., `"jina-embeddings-v4"`)

---

## Usage

```python
from src.utils.collection_name import get_collection_name

# All defaults from config + env
name = get_collection_name()
# → "dovodi_collection_jina-embeddings-v4_1024d_500t_hybrid"

# Override chunk size
name = get_collection_name(chunk_size=128)
# → "dovodi_collection_jina-embeddings-v4_1024d_128t_hybrid"

# Dense-only collection
name = get_collection_name(is_hybrid=False)
# → "dovodi_collection_jina-embeddings-v4_1024d_500t_dense"

# Different model from embeddings.yaml
name = get_collection_name(model_name="openai_small")
# → "dovodi_collection_text-embedding-3-small_1536d_500t_hybrid"
```

---

## What Happens If You Don't Do This

| Scenario | What Goes Wrong |
|----------|----------------|
| You switch from `jina-v3` to `openai-v3` but keep the same collection | Old jina vectors and new openai vectors get compared. Search returns nonsense. **No error is raised.** |
| You change dimensions from 1024 to 768 | Qdrant crashes on insert — dimension mismatch. At least this one is loud. |
| You change chunk size from 512 to 128 | Short chunks get unfairly boosted in similarity scores against long chunks. Search quality degrades silently. |
| You switch from Cosine to Dot Product | Ranking logic is inverted for unnormalized vectors. Top results become bottom results. |

The naming convention makes these mistakes **structurally impossible** — if any parameter changes, a new collection name is generated, and the old data is never touched.