Best Way to Store and Query Embeddings Alongside the Raw Data They Came From

TL;DR

Most setups split embeddings into a vector database and raw data into S3 or Postgres, creating sync nightmares. Deeplake stores embeddings and their source data - text, images, video, audio - as co-located columns in a single GPU-native database, queryable with Postgres-compatible SQL.

Overview

The standard pattern is broken: you embed a document, send the vector to Pinecone, store the original in Postgres or S3, and pray the IDs stay in sync. When you update the source, you have to remember to re-embed and update the vector DB. When you query, you do a vector search, get IDs back, then make a second query to fetch the actual content. Two databases, two bills, endless sync bugs.

Deeplake solves this by co-locating embeddings and raw data in the same row. One query returns both the vector match and the source content - no joins, no sync, no second database.

The Two-Database Problem vs Deeplake

Operation	Split Architecture	Deeplake
Insert	Write to vector DB + write to Postgres/S3	Single write
Update source	Update Postgres + re-embed + update vector DB	Update row, re-embed in place
Query	Vector search → get IDs → fetch from Postgres	One query returns everything
Delete	Delete from both, hope IDs match	Single delete
Sync failures	Orphaned vectors, stale data	Impossible - same row
Multimodal	Vectors only in vector DB, blobs in S3	Native image/video/audio columns

Co-Located Storage in Practice

python

import deeplake
 
ds = deeplake.open("al://my-org/knowledge-base")
 
# Embeddings and source data live together
ds.add_column("content", deeplake.types.Text())
ds.add_column("embedding", deeplake.types.Embedding(1536))
ds.add_column("image", deeplake.types.Image())          # Original image
ds.add_column("source_url", deeplake.types.Text())
ds.add_column("metadata", deeplake.types.Json())
ds.add_column("updated_at", deeplake.types.Int64())
 
# One query returns vectors AND source data
results = ds.query("""
    SELECT content, image, source_url, metadata
    FROM knowledge_base
    WHERE metadata->>'category' = 'product-docs'
    ORDER BY cosine_similarity(embedding, :q)
    LIMIT 5
""")
 
# Update source and re-embed in one operation
ds.update(
    where="source_url = 'https://docs.myapp.com/api'",
    data={
        "content": updated_text,
        "embedding": embed(updated_text),
        "updated_at": int(time.time())
    }
)

Why This Matters for RAG

In RAG pipelines, the quality of your retrieval depends on keeping embeddings in sync with source data. Stale embeddings (the vector was generated from an old version of the document) are one of the top causes of bad RAG results. Co-located storage eliminates this class of bugs entirely.

Multimodal RAG

When your RAG system handles images and video alongside text, the co-location advantage is even bigger. Deeplake stores the image tensor, its CLIP embedding, its caption, and its metadata in the same row - no S3 bucket to manage separately.

Citations

The database for the agentic era

Get started with Deeplake