# Incremental Vector Synchronization **PostgreSQL → Qdrant | Database-Driven Incremental Embedding Pipeline** --- ## Overview This feature introduces a fully automated, admin-triggered pipeline that synchronizes textual data from a relational database (PostgreSQL / Django ORM) to a vector database (Qdrant). The system guarantees **zero duplicate embeddings**, **crash resilience**, and seamless **A/B testing** across different embedding models — all without manual file uploads. | Property | Value | | ----------------- | -------------------------------------------- | | **Source** | PostgreSQL (Django Models) | | **Target** | Qdrant Vector Database | | **State Tracking**| JSONB Arrays (`embedded_in` column) | | **Deduplication** | Deterministic Hash ID via MD5 | | **Trigger** | Admin UI → FastAPI Background Task | --- ## Architecture ``` ┌──────────────────────┐ HTTP POST ┌──────────────────────────┐ │ Django Admin UI │ ────────────────────▸ │ FastAPI Agent │ │ (EmbeddingSession) │ │ /api/sync-knowledge │ └──────────────────────┘ └────────────┬─────────────┘ ▲ │ │ progress updates BackgroundTask │ (status, %, items) │ │ ▼ ┌────────┴─────────────┐ ┌──────────────────────────┐ │ PostgreSQL │ ◂────── read/update ──▸ │ run_global_embedding_ │ │ embedded_in (JSONB) │ │ sync(session_id) │ └──────────────────────┘ └────────────┬─────────────┘ │ embed + upsert │ ▼ ┌──────────────────────────┐ │ Qdrant Vector DB │ │ (hybrid search collection)│ └──────────────────────────┘ ``` --- ## How It Works ### 1. State Tracking via JSONB Every syncable Django model (e.g. `Article`, `Hadith`) includes an `embedded_in` column: ```python # Django model field embedded_in = models.JSONField(default=list) # Example value: ["dovodi_jina-embeddings-v4_hybrid", "dovodi_text-embedding-3-small_hybrid"] ``` **Why an array instead of a boolean?** A simple `is_embedded = True` flag cannot distinguish *which* model the row was embedded into. By storing the exact collection name, the system natively supports multiple embedding models simultaneously. Switching the active model in `config/embeddings.yaml` from `jina_AI` to `openai_small` causes every row to be detected as "missing" for the new collection — triggering a full, automatic re-sync. **Self-healing on update:** When an admin edits the text of a record in Django admin, the overridden `save()` method resets `embedded_in` to `[]`, guaranteeing the updated content is picked up in the next sync cycle. ### 2. The Trigger Mechanism (Django → FastAPI) A dedicated `EmbeddingSession` model in Django tracks the lifecycle of each sync operation: | Field | Purpose | | ----------------- | ------------------------------------------ | | `status` | `PENDING` / `PROCESSING` / `COMPLETED` / `FAILED` | | `total_items` | Total rows needing embedding | | `processed_items` | Rows embedded so far | | `progress` | Percentage (0–100) | | `error_message` | Captured exception on failure | When the admin saves a new session, Django sends a **non-blocking** `HTTP POST` to the FastAPI agent: ``` POST /api/sync-knowledge Body: { "session_id": 42 } ``` The FastAPI route (`src/api/routes.py`) receives this and delegates to a `BackgroundTask`: ```python @router.post("/api/sync-knowledge") async def sync_knowledge(request: SyncRequest, background_tasks: BackgroundTasks): background_tasks.add_task(run_global_embedding_sync, request.session_id) return {"status": "started", "session_id": request.session_id} ``` The Django admin UI renders a **live Tailwind progress bar** powered by `processed_items / total_items`. ### 3. The Smart Background Worker The core logic lives in `src/knowledge/sync_rag.py` → `run_global_embedding_sync(session_id)`: **Step A — Resolve the active model:** ```python embed_factory = EmbeddingFactory() # reads config/embeddings.yaml embedder = embed_factory.get_embedder() # returns the default embedder vector_db = get_qdrant_store(embedder=embedder) active_collection_name = vector_db.collection ``` The collection name is dynamically composed as `{BASE_COLLECTION_NAME}_{model_id}_hybrid` (e.g. `dovodi_jina-embeddings-v4_hybrid`). **Step B — Query only missing rows:** Using PostgreSQL's native JSONB containment operator (`@>`), the system fetches *only* the rows whose `embedded_in` array does **not** contain the active collection name: ```sql SELECT * FROM article_article WHERE NOT (CAST(embedded_in AS jsonb) @> CAST('["dovodi_jina-embeddings-v4_hybrid"]' AS jsonb)) ``` This is the key to incremental sync — already-embedded rows are never touched. **Step C — Process, embed, upsert:** For each pending row, the worker: 1. Formats the record into a continuous string (`TITLE: ... \n CONTENT: ...`) 2. Generates a **deterministic Qdrant ID** (see below) 3. Wraps it in an Agno `Document` and calls `vector_db.upsert()` 4. Updates the row's `embedded_in` array in PostgreSQL 5. Reports progress to the `EmbeddingSession` every 5 items ### 4. Deterministic Deduplication To prevent duplicate vectors when resuming a crashed sync or re-embedding updated text, auto-generated UUIDs are **not used**. Instead, the system computes a deterministic ID: ```python hash_input = f"{data_type}_{row.id}_{active_collection_name}" # e.g. "ARTICLE_42_dovodi_jina-embeddings-v4_hybrid" hash_id = hashlib.md5(hash_input.encode()).hexdigest() qdrant_id = str(uuid.UUID(hash_id)) # valid UUID from MD5 ``` This ID is passed as `content_hash` to `vector_db.upsert()`, which causes Qdrant to **overwrite** any existing point with the same ID rather than creating a duplicate. ### 5. Closing the Loop After a successful upsert, the worker appends the active collection name to the row's `embedded_in` array: ```sql UPDATE article_article SET embedded_in = CAST(embedded_in AS jsonb) || CAST('["dovodi_jina-embeddings-v4_hybrid"]' AS jsonb) WHERE id = 42 ``` Once all tables are processed, the session status is set to `COMPLETED`. If any exception occurs, the status is set to `FAILED` with the error message captured. --- ## File Reference | File | Responsibility | | --- | --- | | `src/knowledge/sync_rag.py` | Core sync logic — queries missing rows, embeds, upserts, updates state | | `src/api/routes.py` | FastAPI endpoint `/api/sync-knowledge` that triggers the background task | | `src/main.py` | Application factory — includes the API router | | `src/knowledge/embedding_factory.py` | Reads `config/embeddings.yaml` and instantiates the correct embedder | | `src/knowledge/vector_store.py` | Builds the Qdrant client with dynamic collection naming | | `config/embeddings.yaml` | Declares available embedding models and the active default | --- ## Configuration The active embedding model is controlled by `config/embeddings.yaml`: ```yaml embeddings: default: jina_AI # switch to "openai_small" to trigger full re-sync models: openai_small: provider: "openai" id: "text-embedding-3-small" dimensions: 1536 api_key: ${OPENAI_API_KEY} jina_AI: provider: "jinaai" id: "jina-embeddings-v4" dimensions: 1024 api_key: ${JINA_API_KEY} ``` Changing `default` from one model to another is all that is needed — the sync worker will detect that no rows contain the new collection name and re-embed everything. --- ## Why This Design is Production-Ready | Property | Detail | | --- | --- | | **Zero data duplication** | Deterministic MD5 hashing ensures the same row always maps to the same Qdrant point ID. Re-running sync never creates duplicates. | | **Seamless model switching** | Changing the embedding model in YAML causes the system to treat all rows as "missing" for the new collection. No manual migration needed. | | **Crash resilience** | State is committed row-by-row to PostgreSQL. If the server crashes at 99%, restarting picks up the remaining 1%. | | **Cost efficiency** | Only un-embedded rows hit the AI API. Already-processed data is never re-sent, saving API costs. | | **Non-blocking UI** | The Django admin fires an async HTTP call and renders live progress — the UI never freezes. | | **Multi-table support** | The `tables_to_sync` list can be extended with new Django models without touching the core algorithm. |