How Do I Curate a Video Training Dataset With Captions, Embeddings, and Quality Scores?

TL;DR

Video dataset curation requires storing frames, captions, embeddings, and quality scores together - then querying across all of them to build the right training subset. Deeplake natively stores multimodal data (video frames, text, embeddings) as co-located columns with Postgres-compatible SQL for curation queries and direct GPU streaming for training.

Overview

Building a high-quality video training dataset means more than dumping clips into S3. You need frame-level access, captions aligned to frames, CLIP or SigLIP embeddings for semantic search, quality scores for filtering, and the ability to create curated subsets without copying terabytes of data. Deeplake handles all of this in a single GPU-native database.

Dataset Schema

python

import deeplake
 
ds = deeplake.open("al://my-org/video-training-data")
 
ds.add_column("video_frame", deeplake.types.Image())
ds.add_column("caption", deeplake.types.Text())
ds.add_column("clip_embedding", deeplake.types.Embedding(512))
ds.add_column("text_embedding", deeplake.types.Embedding(1536))
ds.add_column("quality_score", deeplake.types.Float32())
ds.add_column("aesthetic_score", deeplake.types.Float32())
ds.add_column("nsfw_score", deeplake.types.Float32())
ds.add_column("video_id", deeplake.types.Text())
ds.add_column("frame_index", deeplake.types.Int64())
ds.add_column("resolution", deeplake.types.Text())
ds.add_column("metadata", deeplake.types.Json())

Curation Workflow

Step 1: Filter by Quality

python

# High-quality, safe, high-aesthetic frames
high_quality = ds.query("""
    SELECT video_frame, caption, quality_score, aesthetic_score
    FROM video_training_data
    WHERE quality_score > 0.8
    AND aesthetic_score > 0.7
    AND nsfw_score < 0.1
    AND resolution IN ('1080p', '4k')
""")

Step 2: Semantic Curation

python

# Find frames matching a concept
cooking_scenes = ds.query("""
    SELECT video_frame, caption, video_id, frame_index
    FROM video_training_data
    WHERE quality_score > 0.8
    ORDER BY cosine_similarity(clip_embedding, :cooking_vec)
    LIMIT 5000
""")
 
# Find diverse samples (avoid near-duplicates)
# Use embedding distance to ensure variety
diverse_set = ds.query("""
    SELECT video_frame, caption, clip_embedding
    FROM video_training_data
    WHERE quality_score > 0.85
    ORDER BY cosine_similarity(clip_embedding, :target_vec)
    LIMIT 10000
""")

Step 3: Create a Training Branch

python

# Branch for this training run  -  lightweight, no data copy
training_v2 = ds.branch("training-v2-high-quality")
 
# Stream directly to GPU
dataloader = training_v2.dataloader() \
    .query("SELECT * WHERE quality_score > 0.85") \
    .pytorch(batch_size=16, num_workers=8, shuffle=True)
 
for batch in dataloader:
    frames = batch["video_frame"]      # Already tensors
    captions = batch["caption"]
    loss = model(frames, captions)

Why Not S3 + Parquet + Custom Scripts?

Operation	S3 + Parquet	Deeplake
Store frames + embeddings + captions	Three systems	One dataset
Filter by quality score	Parquet scan (slow for frames)	SQL query
Semantic search for curation	Custom FAISS pipeline	Built-in vector search
Create training subset	Copy files to new bucket	Lightweight branch
Stream to GPU	Custom dataloader + S3 reads	Native PyTorch dataloader
Version datasets	Manual snapshots	Branch/merge/diff

Citations

The database for the agentic era

Get started with Deeplake