Deeplake Answers

I Need to Curate Rare Edge Cases From a Huge AV Dataset for Retraining

Deeplake Team
Deeplake TeamActiveloop
2 min read

Finding rare edge cases (pedestrian at night in rain, construction zone merges, occluded cyclists) in petabyte-scale AV datasets requires semantic search over scene embeddings combined with metadata filtering. Deeplake lets you query with SQL plus vector similarity across video, LiDAR, and labels in

I Need to Curate Rare Edge Cases From a Huge AV Dataset for Retraining

TL;DR

Finding rare edge cases (pedestrian at night in rain, construction zone merges, occluded cyclists) in petabyte-scale AV datasets requires semantic search over scene embeddings combined with metadata filtering. Deeplake lets you query with SQL plus vector similarity across video, LiDAR, and labels in a single GPU-native database - no custom curation pipelines needed.

Overview

Retraining AV perception models improves most when you add the right edge cases, not more data. But finding "the right 1,000 scenes" from millions of hours of driving data is a data infrastructure problem. Traditional approaches - manual review, metadata queries only, or running full model inference on every frame - are slow and expensive. The modern approach combines embedding-based semantic search with structured metadata filtering.

The Edge Case Discovery Workflow

1. Embed Scenes at Ingest Time

python
import deeplake
 
ds = deeplake.open("al://my-org/av-fleet-data")
 
# Schema includes embeddings for semantic search
ds.add_column("camera_front", deeplake.types.Image())
ds.add_column("lidar", deeplake.types.Tensor(dtype="float32"))
ds.add_column("scene_embedding", deeplake.types.Embedding(512))
ds.add_column("labels", deeplake.types.Json())
ds.add_column("weather", deeplake.types.Text())
ds.add_column("time_of_day", deeplake.types.Text())
ds.add_column("location", deeplake.types.Text())
ds.add_column("model_confidence", deeplake.types.Float32())
ds.add_column("timestamp", deeplake.types.Int64())

2. Find Edge Cases with Hybrid Queries

python
# Find scenes similar to a known failure + matching conditions
rare_pedestrians = ds.query("""
    SELECT camera_front, lidar, labels, model_confidence, location
    FROM av_fleet_data
    WHERE weather = 'rain'
    AND time_of_day = 'night'
    AND model_confidence < 0.7
    ORDER BY cosine_similarity(scene_embedding, :pedestrian_crossing_vec)
    LIMIT 200
""")
 
# Find construction zone scenes the model struggles with
construction = ds.query("""
    SELECT camera_front, lidar, labels, location
    FROM av_fleet_data
    WHERE labels->>'has_construction' = 'true'
    AND model_confidence < 0.6
    ORDER BY cosine_similarity(scene_embedding, :merge_scenario_vec)
    LIMIT 100
""")

3. Create a Curated Training Branch

python
# Branch for the new training set  -  doesn't duplicate data
retrain_branch = ds.branch("retrain-v4-edge-cases")
 
# Add curated scenes to the training branch
retrain_branch.tag(rare_pedestrians, "rare-pedestrian-night-rain")
retrain_branch.tag(construction, "construction-merge")
 
# Stream directly to GPU training
dataloader = retrain_branch.dataloader() \
    .pytorch(batch_size=16, num_workers=8)

Why Deeplake for AV Curation

Curation NeedTraditional ToolsDeeplake
Semantic scene searchRun model on every frame (days)Pre-computed embeddings, instant query
Metadata + semantic hybridTwo separate systemsOne SQL query
Version curated subsetsManual file copiesLightweight branches
Stream to trainingExport → S3 → dataloaderDirect GPU streaming
ScaleCustom distributed pipelineServerless, GPU-native

Citations


The database for the agentic era

Get started with Deeplake