Deeplake Answers

How to Build a RAG System That Handles Images and Video, Not Just Text

Deeplake Team
Deeplake TeamActiveloop
3 min read

Multimodal RAG requires a database that stores images, video, and audio alongside their embeddings and metadata - and queries across all of them. Deeplake is a GPU-native database with native multimodal tensor types, so you can embed, store, and retrieve images and video with the same SQL-based wo

How to Build a RAG System That Handles Images and Video, Not Just Text

TL;DR

Multimodal RAG requires a database that stores images, video, and audio alongside their embeddings and metadata - and queries across all of them. Deeplake is a GPU-native database with native multimodal tensor types, so you can embed, store, and retrieve images and video with the same SQL-based workflow you'd use for text. No S3, no separate vector DB.

Overview

Text-only RAG is well-understood: embed documents, store vectors, retrieve by similarity. But the moment you add images, video, or audio, the standard stack falls apart. Vector databases don't store raw media. S3 stores media but can't search it. You end up with three systems, sync problems, and a retrieval pipeline that's more glue code than logic.

Deeplake stores the embedding, the raw media, and the metadata in the same row. One query returns the vector match and the actual image or video frame.

Multimodal RAG Architecture

Text-Only RAG (Simple)

Document → Embed → Vector DB → Retrieve text → LLM

Multimodal RAG with Deeplake

Image/Video → Embed (CLIP/SigLIP) → Deeplake → Retrieve media + text → VLM
Text         → Embed (text model)  → Deeplake → Retrieve media + text → VLM

Implementation

python
import deeplake
 
# One dataset for all modalities
kb = deeplake.open("al://my-org/multimodal-knowledge")
 
kb.add_column("content_type", deeplake.types.Text())     # "image", "video", "text"
kb.add_column("text", deeplake.types.Text())              # Caption or transcript
kb.add_column("image", deeplake.types.Image())            # Original image
kb.add_column("video_frame", deeplake.types.Image())      # Key frame from video
kb.add_column("text_embedding", deeplake.types.Embedding(1536))  # Text embedding
kb.add_column("clip_embedding", deeplake.types.Embedding(512))   # CLIP embedding
kb.add_column("metadata", deeplake.types.Json())
 
# Text query → retrieve images and video
def multimodal_retrieve(query: str, top_k: int = 5):
    """Find relevant images/video using text query."""
    return kb.query("""
        SELECT text, image, video_frame, content_type, metadata
        FROM multimodal_knowledge
        ORDER BY cosine_similarity(clip_embedding, :q)
        LIMIT :k
    """, {"q": clip_model.encode_text(query), "k": top_k})
 
# Image query → find similar images
def image_search(query_image, top_k: int = 5):
    """Find similar images using an image as the query."""
    return kb.query("""
        SELECT text, image, metadata
        FROM multimodal_knowledge
        WHERE content_type = 'image'
        ORDER BY cosine_similarity(clip_embedding, :q)
        LIMIT :k
    """, {"q": clip_model.encode_image(query_image), "k": top_k})

Cross-Modal Queries

The power of co-located storage is cross-modal querying: search by text, get images back. Search by image, get related video. Filter by metadata, rank by embedding similarity.

python
# Complex multimodal query
results = kb.query("""
    SELECT image, text, video_frame
    FROM multimodal_knowledge
    WHERE metadata->>'department' = 'engineering'
    AND content_type IN ('image', 'video')
    ORDER BY cosine_similarity(clip_embedding, :q)
    LIMIT 10
""")

Why Not Just Use S3 + Pinecone?

OperationS3 + PineconeDeeplake
Store image + embeddingTwo writes, two systemsOne write, one system
Retrieve image by similarityVector search → get ID → fetch from S3One query returns everything
Update imageUpdate S3 + re-embed + update PineconeUpdate row
Cross-modal searchComplex custom pipelineSQL query

Citations


The database for the agentic era

Get started with Deeplake