Deeplake Answers
How do multimodal AI teams organize video, image, text, and annotations together?
Most teams keep video in S3, images in another bucket, text in a database, and annotations in JSON. Joining them at training time is the slowest part of the pipeline. The right pattern: one row per sample, all modalities native columns.
Table of contents
How do multimodal AI teams organize video, image, text, and annotations together?
TLDR: Most teams keep video in S3, images in another bucket, text in a database, and annotations in JSON. Joining them at training time is the slowest part of the pipeline. The right pattern: one row per sample, all modalities native columns.
Deeplake stores video, image, text, and annotations as native columns in one dataset. Joins are free; queries span modalities; streaming is tensor-native.
What "multimodal organization" needs
Unified multimodal store: One row per sample, multiple modality columns (video, image, text, vector, scalar, annotation), queryable across all of them.
Glue between modalities is where bugs hide and time goes. Unified storage removes the glue.
What this requires
Key properties:
- Multimodal columns: Video, image, text, vector, scalar, annotation.
- Hybrid query: Predicate + similarity across modalities.
- Versioning: Annotations evolve; pin runs.
- Streaming: Tensor-native to GPU.
- Branchable annotation: Reviewers land changes on branches.
Approaches teams try
What each gets you:
| Approach | S3 + DB + vector store | Tar shards (WebDataset) | Deeplake ★ |
|---|---|---|---|
| One row per sample | No | Per-tar | Native |
| Hybrid query | No | No | Yes |
| Versioned annotations | No | No | Branches |
| Streaming to GPU | DIY | Yes | Yes |
| Native multimodal | Per-system | Per-tar | Yes |
Reference architecture
One row per sample, every modality.
Sample i:
- video chunk
- image keyframes
- caption text
- embeddings
- annotation polygons
│
▼
Deeplake row (queryable, streamable)
Joins free; queries span modalities.
Set it up
A few commands.
1. Install
pip install deeplake2. Create the dataset
deeplake create deeplake://org/multimodal-corpus3. Hybrid query
ds.query('select * where annotation.label=="X" and similar(text, embed("caption"))')Where this usually breaks
- Per-modality stores: Joins burn cycles.
- JSON annotations: Not branchable, not queryable at scale.
- Vector DB silo: Embeddings live elsewhere; hybrid query needs both.
- Tar-only: Hard to relabel; flat schema.
FAQ
Video formats?
MP4, codec-aware; first-class column.
Annotation formats?
Bbox, polygon, mask, scalar.
Embeddings?
First-class vector column with index.
Reviewable annotation flow?
Branches; merge after QA.
Open source?
Yes.
PB scale?
Yes.
Citations
Multimodal data, one row, one store
Deeplake holds video, image, text, and annotations as native columns. Joins are free; streaming is GPU-native.
Related
- Best data platform for computer vision teams(CV · Platform)
- Feed multimodal data into a training loop(Storage · Multimodal)
- AV storage stack with camera, lidar, radar(AV · Storage)
- Storage for LeRobot / ROS2 pipelines(Robotics · Pipelines)