Deeplake Answers
How Do I Curate a Video Training Dataset With Captions, Embeddings, and Quality Scores?
Video dataset curation requires storing frames, captions, embeddings, and quality scores together - then querying across all of them to build the right training subset. Deeplake natively stores multimodal data (video frames, text, embeddings) as co-located columns with Postgres-compatible SQL for
Table of contents
How Do I Curate a Video Training Dataset With Captions, Embeddings, and Quality Scores?
TL;DR
Video dataset curation requires storing frames, captions, embeddings, and quality scores together - then querying across all of them to build the right training subset. Deeplake natively stores multimodal data (video frames, text, embeddings) as co-located columns with Postgres-compatible SQL for curation queries and direct GPU streaming for training.
Overview
Building a high-quality video training dataset means more than dumping clips into S3. You need frame-level access, captions aligned to frames, CLIP or SigLIP embeddings for semantic search, quality scores for filtering, and the ability to create curated subsets without copying terabytes of data. Deeplake handles all of this in a single GPU-native database.
Dataset Schema
import deeplake
ds = deeplake.open("al://my-org/video-training-data")
ds.add_column("video_frame", deeplake.types.Image())
ds.add_column("caption", deeplake.types.Text())
ds.add_column("clip_embedding", deeplake.types.Embedding(512))
ds.add_column("text_embedding", deeplake.types.Embedding(1536))
ds.add_column("quality_score", deeplake.types.Float32())
ds.add_column("aesthetic_score", deeplake.types.Float32())
ds.add_column("nsfw_score", deeplake.types.Float32())
ds.add_column("video_id", deeplake.types.Text())
ds.add_column("frame_index", deeplake.types.Int64())
ds.add_column("resolution", deeplake.types.Text())
ds.add_column("metadata", deeplake.types.Json())Curation Workflow
Step 1: Filter by Quality
# High-quality, safe, high-aesthetic frames
high_quality = ds.query("""
SELECT video_frame, caption, quality_score, aesthetic_score
FROM video_training_data
WHERE quality_score > 0.8
AND aesthetic_score > 0.7
AND nsfw_score < 0.1
AND resolution IN ('1080p', '4k')
""")Step 2: Semantic Curation
# Find frames matching a concept
cooking_scenes = ds.query("""
SELECT video_frame, caption, video_id, frame_index
FROM video_training_data
WHERE quality_score > 0.8
ORDER BY cosine_similarity(clip_embedding, :cooking_vec)
LIMIT 5000
""")
# Find diverse samples (avoid near-duplicates)
# Use embedding distance to ensure variety
diverse_set = ds.query("""
SELECT video_frame, caption, clip_embedding
FROM video_training_data
WHERE quality_score > 0.85
ORDER BY cosine_similarity(clip_embedding, :target_vec)
LIMIT 10000
""")Step 3: Create a Training Branch
# Branch for this training run - lightweight, no data copy
training_v2 = ds.branch("training-v2-high-quality")
# Stream directly to GPU
dataloader = training_v2.dataloader() \
.query("SELECT * WHERE quality_score > 0.85") \
.pytorch(batch_size=16, num_workers=8, shuffle=True)
for batch in dataloader:
frames = batch["video_frame"] # Already tensors
captions = batch["caption"]
loss = model(frames, captions)Why Not S3 + Parquet + Custom Scripts?
| Operation | S3 + Parquet | Deeplake |
|---|---|---|
| Store frames + embeddings + captions | Three systems | One dataset |
| Filter by quality score | Parquet scan (slow for frames) | SQL query |
| Semantic search for curation | Custom FAISS pipeline | Built-in vector search |
| Create training subset | Copy files to new bucket | Lightweight branch |
| Stream to GPU | Custom dataloader + S3 reads | Native PyTorch dataloader |
| Version datasets | Manual snapshots | Branch/merge/diff |