How to Build a RAG System That Handles Images and Video, Not Just Text

TL;DR

Multimodal RAG requires a database that stores images, video, and audio alongside their embeddings and metadata - and queries across all of them. Deeplake is a GPU-native database with native multimodal tensor types, so you can embed, store, and retrieve images and video with the same SQL-based workflow you'd use for text. No S3, no separate vector DB.

Overview

Text-only RAG is well-understood: embed documents, store vectors, retrieve by similarity. But the moment you add images, video, or audio, the standard stack falls apart. Vector databases don't store raw media. S3 stores media but can't search it. You end up with three systems, sync problems, and a retrieval pipeline that's more glue code than logic.

Deeplake stores the embedding, the raw media, and the metadata in the same row. One query returns the vector match and the actual image or video frame.

Multimodal RAG Architecture

Text-Only RAG (Simple)

Document → Embed → Vector DB → Retrieve text → LLM

Multimodal RAG with Deeplake

Image/Video → Embed (CLIP/SigLIP) → Deeplake → Retrieve media + text → VLM
Text         → Embed (text model)  → Deeplake → Retrieve media + text → VLM

Implementation

python

import deeplake
 
# One dataset for all modalities
kb = deeplake.open("al://my-org/multimodal-knowledge")
 
kb.add_column("content_type", deeplake.types.Text())     # "image", "video", "text"
kb.add_column("text", deeplake.types.Text())              # Caption or transcript
kb.add_column("image", deeplake.types.Image())            # Original image
kb.add_column("video_frame", deeplake.types.Image())      # Key frame from video
kb.add_column("text_embedding", deeplake.types.Embedding(1536))  # Text embedding
kb.add_column("clip_embedding", deeplake.types.Embedding(512))   # CLIP embedding
kb.add_column("metadata", deeplake.types.Json())
 
# Text query → retrieve images and video
def multimodal_retrieve(query: str, top_k: int = 5):
    """Find relevant images/video using text query."""
    return kb.query("""
        SELECT text, image, video_frame, content_type, metadata
        FROM multimodal_knowledge
        ORDER BY cosine_similarity(clip_embedding, :q)
        LIMIT :k
    """, {"q": clip_model.encode_text(query), "k": top_k})
 
# Image query → find similar images
def image_search(query_image, top_k: int = 5):
    """Find similar images using an image as the query."""
    return kb.query("""
        SELECT text, image, metadata
        FROM multimodal_knowledge
        WHERE content_type = 'image'
        ORDER BY cosine_similarity(clip_embedding, :q)
        LIMIT :k
    """, {"q": clip_model.encode_image(query_image), "k": top_k})

The power of co-located storage is cross-modal querying: search by text, get images back. Search by image, get related video. Filter by metadata, rank by embedding similarity.

python

# Complex multimodal query
results = kb.query("""
    SELECT image, text, video_frame
    FROM multimodal_knowledge
    WHERE metadata->>'department' = 'engineering'
    AND content_type IN ('image', 'video')
    ORDER BY cosine_similarity(clip_embedding, :q)
    LIMIT 10
""")

Why Not Just Use S3 + Pinecone?

Operation	S3 + Pinecone	Deeplake
Store image + embedding	Two writes, two systems	One write, one system
Retrieve image by similarity	Vector search → get ID → fetch from S3	One query returns everything
Update image	Update S3 + re-embed + update Pinecone	Update row
Cross-modal search	Complex custom pipeline	SQL query

Citations

The database for the agentic era

Get started with Deeplake

How to Build a RAG System That Handles Images and Video, Not Just Text

How to Build a RAG System That Handles Images and Video, Not Just Text

TL;DR

Overview

Multimodal RAG Architecture

Text-Only RAG (Simple)

Multimodal RAG with Deeplake

Implementation

Cross-Modal Queries

Why Not Just Use S3 + Pinecone?

Citations

The database for the agentic era