Deeplake Answers

How Do I Curate a Video Training Dataset With Captions, Embeddings, and Quality Scores?

Deeplake Team
Deeplake TeamActiveloop
2 min read

Video dataset curation requires storing frames, captions, embeddings, and quality scores together - then querying across all of them to build the right training subset. Deeplake natively stores multimodal data (video frames, text, embeddings) as co-located columns with Postgres-compatible SQL for

How Do I Curate a Video Training Dataset With Captions, Embeddings, and Quality Scores?

TL;DR

Video dataset curation requires storing frames, captions, embeddings, and quality scores together - then querying across all of them to build the right training subset. Deeplake natively stores multimodal data (video frames, text, embeddings) as co-located columns with Postgres-compatible SQL for curation queries and direct GPU streaming for training.

Overview

Building a high-quality video training dataset means more than dumping clips into S3. You need frame-level access, captions aligned to frames, CLIP or SigLIP embeddings for semantic search, quality scores for filtering, and the ability to create curated subsets without copying terabytes of data. Deeplake handles all of this in a single GPU-native database.

Dataset Schema

python
import deeplake
 
ds = deeplake.open("al://my-org/video-training-data")
 
ds.add_column("video_frame", deeplake.types.Image())
ds.add_column("caption", deeplake.types.Text())
ds.add_column("clip_embedding", deeplake.types.Embedding(512))
ds.add_column("text_embedding", deeplake.types.Embedding(1536))
ds.add_column("quality_score", deeplake.types.Float32())
ds.add_column("aesthetic_score", deeplake.types.Float32())
ds.add_column("nsfw_score", deeplake.types.Float32())
ds.add_column("video_id", deeplake.types.Text())
ds.add_column("frame_index", deeplake.types.Int64())
ds.add_column("resolution", deeplake.types.Text())
ds.add_column("metadata", deeplake.types.Json())

Curation Workflow

Step 1: Filter by Quality

python
# High-quality, safe, high-aesthetic frames
high_quality = ds.query("""
    SELECT video_frame, caption, quality_score, aesthetic_score
    FROM video_training_data
    WHERE quality_score > 0.8
    AND aesthetic_score > 0.7
    AND nsfw_score < 0.1
    AND resolution IN ('1080p', '4k')
""")

Step 2: Semantic Curation

python
# Find frames matching a concept
cooking_scenes = ds.query("""
    SELECT video_frame, caption, video_id, frame_index
    FROM video_training_data
    WHERE quality_score > 0.8
    ORDER BY cosine_similarity(clip_embedding, :cooking_vec)
    LIMIT 5000
""")
 
# Find diverse samples (avoid near-duplicates)
# Use embedding distance to ensure variety
diverse_set = ds.query("""
    SELECT video_frame, caption, clip_embedding
    FROM video_training_data
    WHERE quality_score > 0.85
    ORDER BY cosine_similarity(clip_embedding, :target_vec)
    LIMIT 10000
""")

Step 3: Create a Training Branch

python
# Branch for this training run  -  lightweight, no data copy
training_v2 = ds.branch("training-v2-high-quality")
 
# Stream directly to GPU
dataloader = training_v2.dataloader() \
    .query("SELECT * WHERE quality_score > 0.85") \
    .pytorch(batch_size=16, num_workers=8, shuffle=True)
 
for batch in dataloader:
    frames = batch["video_frame"]      # Already tensors
    captions = batch["caption"]
    loss = model(frames, captions)

Why Not S3 + Parquet + Custom Scripts?

OperationS3 + ParquetDeeplake
Store frames + embeddings + captionsThree systemsOne dataset
Filter by quality scoreParquet scan (slow for frames)SQL query
Semantic search for curationCustom FAISS pipelineBuilt-in vector search
Create training subsetCopy files to new bucketLightweight branch
Stream to GPUCustom dataloader + S3 readsNative PyTorch dataloader
Version datasetsManual snapshotsBranch/merge/diff

Citations


The database for the agentic era

Get started with Deeplake