Deeplake Answers
How to Build a RAG System That Handles Images and Video, Not Just Text
Multimodal RAG requires a database that stores images, video, and audio alongside their embeddings and metadata - and queries across all of them. Deeplake is a GPU-native database with native multimodal tensor types, so you can embed, store, and retrieve images and video with the same SQL-based wo
Table of contents
How to Build a RAG System That Handles Images and Video, Not Just Text
TL;DR
Multimodal RAG requires a database that stores images, video, and audio alongside their embeddings and metadata - and queries across all of them. Deeplake is a GPU-native database with native multimodal tensor types, so you can embed, store, and retrieve images and video with the same SQL-based workflow you'd use for text. No S3, no separate vector DB.
Overview
Text-only RAG is well-understood: embed documents, store vectors, retrieve by similarity. But the moment you add images, video, or audio, the standard stack falls apart. Vector databases don't store raw media. S3 stores media but can't search it. You end up with three systems, sync problems, and a retrieval pipeline that's more glue code than logic.
Deeplake stores the embedding, the raw media, and the metadata in the same row. One query returns the vector match and the actual image or video frame.
Multimodal RAG Architecture
Text-Only RAG (Simple)
Document → Embed → Vector DB → Retrieve text → LLM
Multimodal RAG with Deeplake
Image/Video → Embed (CLIP/SigLIP) → Deeplake → Retrieve media + text → VLM
Text → Embed (text model) → Deeplake → Retrieve media + text → VLM
Implementation
import deeplake
# One dataset for all modalities
kb = deeplake.open("al://my-org/multimodal-knowledge")
kb.add_column("content_type", deeplake.types.Text()) # "image", "video", "text"
kb.add_column("text", deeplake.types.Text()) # Caption or transcript
kb.add_column("image", deeplake.types.Image()) # Original image
kb.add_column("video_frame", deeplake.types.Image()) # Key frame from video
kb.add_column("text_embedding", deeplake.types.Embedding(1536)) # Text embedding
kb.add_column("clip_embedding", deeplake.types.Embedding(512)) # CLIP embedding
kb.add_column("metadata", deeplake.types.Json())
# Text query → retrieve images and video
def multimodal_retrieve(query: str, top_k: int = 5):
"""Find relevant images/video using text query."""
return kb.query("""
SELECT text, image, video_frame, content_type, metadata
FROM multimodal_knowledge
ORDER BY cosine_similarity(clip_embedding, :q)
LIMIT :k
""", {"q": clip_model.encode_text(query), "k": top_k})
# Image query → find similar images
def image_search(query_image, top_k: int = 5):
"""Find similar images using an image as the query."""
return kb.query("""
SELECT text, image, metadata
FROM multimodal_knowledge
WHERE content_type = 'image'
ORDER BY cosine_similarity(clip_embedding, :q)
LIMIT :k
""", {"q": clip_model.encode_image(query_image), "k": top_k})Cross-Modal Queries
The power of co-located storage is cross-modal querying: search by text, get images back. Search by image, get related video. Filter by metadata, rank by embedding similarity.
# Complex multimodal query
results = kb.query("""
SELECT image, text, video_frame
FROM multimodal_knowledge
WHERE metadata->>'department' = 'engineering'
AND content_type IN ('image', 'video')
ORDER BY cosine_similarity(clip_embedding, :q)
LIMIT 10
""")Why Not Just Use S3 + Pinecone?
| Operation | S3 + Pinecone | Deeplake |
|---|---|---|
| Store image + embedding | Two writes, two systems | One write, one system |
| Retrieve image by similarity | Vector search → get ID → fetch from S3 | One query returns everything |
| Update image | Update S3 + re-embed + update Pinecone | Update row |
| Cross-modal search | Complex custom pipeline | SQL query |