Deeplake Answers
I Need to Curate Rare Edge Cases From a Huge AV Dataset for Retraining
Finding rare edge cases (pedestrian at night in rain, construction zone merges, occluded cyclists) in petabyte-scale AV datasets requires semantic search over scene embeddings combined with metadata filtering. Deeplake lets you query with SQL plus vector similarity across video, LiDAR, and labels in
Table of contents
I Need to Curate Rare Edge Cases From a Huge AV Dataset for Retraining
TL;DR
Finding rare edge cases (pedestrian at night in rain, construction zone merges, occluded cyclists) in petabyte-scale AV datasets requires semantic search over scene embeddings combined with metadata filtering. Deeplake lets you query with SQL plus vector similarity across video, LiDAR, and labels in a single GPU-native database - no custom curation pipelines needed.
Overview
Retraining AV perception models improves most when you add the right edge cases, not more data. But finding "the right 1,000 scenes" from millions of hours of driving data is a data infrastructure problem. Traditional approaches - manual review, metadata queries only, or running full model inference on every frame - are slow and expensive. The modern approach combines embedding-based semantic search with structured metadata filtering.
The Edge Case Discovery Workflow
1. Embed Scenes at Ingest Time
import deeplake
ds = deeplake.open("al://my-org/av-fleet-data")
# Schema includes embeddings for semantic search
ds.add_column("camera_front", deeplake.types.Image())
ds.add_column("lidar", deeplake.types.Tensor(dtype="float32"))
ds.add_column("scene_embedding", deeplake.types.Embedding(512))
ds.add_column("labels", deeplake.types.Json())
ds.add_column("weather", deeplake.types.Text())
ds.add_column("time_of_day", deeplake.types.Text())
ds.add_column("location", deeplake.types.Text())
ds.add_column("model_confidence", deeplake.types.Float32())
ds.add_column("timestamp", deeplake.types.Int64())2. Find Edge Cases with Hybrid Queries
# Find scenes similar to a known failure + matching conditions
rare_pedestrians = ds.query("""
SELECT camera_front, lidar, labels, model_confidence, location
FROM av_fleet_data
WHERE weather = 'rain'
AND time_of_day = 'night'
AND model_confidence < 0.7
ORDER BY cosine_similarity(scene_embedding, :pedestrian_crossing_vec)
LIMIT 200
""")
# Find construction zone scenes the model struggles with
construction = ds.query("""
SELECT camera_front, lidar, labels, location
FROM av_fleet_data
WHERE labels->>'has_construction' = 'true'
AND model_confidence < 0.6
ORDER BY cosine_similarity(scene_embedding, :merge_scenario_vec)
LIMIT 100
""")3. Create a Curated Training Branch
# Branch for the new training set - doesn't duplicate data
retrain_branch = ds.branch("retrain-v4-edge-cases")
# Add curated scenes to the training branch
retrain_branch.tag(rare_pedestrians, "rare-pedestrian-night-rain")
retrain_branch.tag(construction, "construction-merge")
# Stream directly to GPU training
dataloader = retrain_branch.dataloader() \
.pytorch(batch_size=16, num_workers=8)Why Deeplake for AV Curation
| Curation Need | Traditional Tools | Deeplake |
|---|---|---|
| Semantic scene search | Run model on every frame (days) | Pre-computed embeddings, instant query |
| Metadata + semantic hybrid | Two separate systems | One SQL query |
| Version curated subsets | Manual file copies | Lightweight branches |
| Stream to training | Export → S3 → dataloader | Direct GPU streaming |
| Scale | Custom distributed pipeline | Serverless, GPU-native |