Video Preprocessing: Foundations of Semantic Search

How we prep videos for our AI-video gen assistant

Oct 29, 2025

A guest post from Adithya Thayyil on WAT.ai’s ClipABit team!

You have a video editor with 200 files of raw footage. You need that exact 3-second clip where someone opens the fridge. Clicking through files manually? That’s your whole afternoon gone.

We’re building ClipABit, a semantic search engine that integrates directly with video editors. Upload your footage, query with natural language (“show me fridge scenes”), get results in seconds. Still, before any embeddings or cross-attention happens, there’s a massive preprocessing problem:

1080p @ 30fps = 100 MB/minute

A typical editor’s project folder: 50+ hours of footage = 300 GB. You can’t embed every frame (that’s 5.4 million frames), you can’t store them all, and half are blurry B-roll anyway.

Preprocessing solves this! No transformers, no attention mechanisms, but it’s what makes the semantic search engine actually work.

The Three Problems We Solve

Chunking: Where do moments start and end?

Video is continuous. Search needs discrete chunks. How do you split “cooking dinner” from “eating dinner”?

Approaches tried:

Static chunking (every 10 seconds): Simple but dumb. Splits activities mid-action. Your “pouring coffee” search hits two chunks: “pour—” and “—ing coffee.”
Scene detection (PySceneDetect): Uses histogram differences between consecutive frames to detect cuts. Works great for semantically meaningful boundaries but unpredictable; you get 2-second chunks and 60-second chunks. Critical for aligning chunks with actual content changes [1].
Hybrid (scene detection + constraints): Our winner. Detect scene changes but enforce 5–20 second limits. Semantic boundaries without chaos.

Result: ~20,000 semantic chunks for 100 hours, all 5–20 seconds long. Matches what VideoCLIP and similar video-text models expect [2].

Frame Selection: Which frames actually matter?

30 fps = 300 frames per 10-second chunk. If someone’s sitting still, those 300 frames are basically identical. Waste of storage and compute.

Approaches:

Dense sampling (1 frame/sec): Simple baseline, 10 frames per chunk.
Adaptive sampling (0.5–2 fps based on complexity): Our winner. Analyze the scene: if static (sleeping), sample 0.5 fps. If dynamic (cooking), sample 2 fps.

Example: - Sleeping: complexity = 0.15 → 0.5 fps → 5 frames/10s - Cooking: complexity = 0.75 → 1.8 fps → 18 frames/10s

Result: A lot of storage reduction vs dense sampling, same search quality. Important for downstream CLIP-based video encoders [3].

Quality Filtering: Remove the junk

Not all frames are usable. Some are blurry, overexposed, or too dark.

Blur detection (Laplacian variance):

Exposure check (mean brightness):

Result: Filter out ~19% of extracted frames (blurry, bad exposure, low contrast).

Production Pipeline

flow chart of pipeline

Try It Yourself!

We have an interactive demo :)

You can:

Upload your own videos —Test preprocessing on real footage (mp4, avi, mov, mkv)
Compare chunking strategies — See how static intervals vs scene detection vs hybrid affect your content
Experiment with frame selection — Toggle between keyframe, dense, and adaptive sampling to find the sweet spot
Visualize the pipeline — Interactive timeline shows exactly which frames get selected from each chunk

Or run it locally:

git clone https://github.com/ClipABit/preproc-research.git
uv sync
uv run streamlit run app.py

References

[1] PySceneDetect: Intelligent Scene Detection for Videos
Brandon Castellano. GitHub Repository. https://github.com/Breakthrough/PySceneDetect
Open-source tool for automatic scene boundary detection using content-aware algorithms (histogram differences, adaptive thresholding).

[2] VideoCLIP: Contrastive Pre-Training for Zero-shot Video-Text Understanding
Xu et al. EMNLP 2021.
https://arxiv.org/abs/2109.14084
Demonstrates effective video-text alignment using 5-20 second temporal chunks with sparse frame sampling for efficient multimodal learning.

[3] CLIP: Learning Transferable Visual Models From Natural Language Supervision
Radford et al. OpenAI, 2021.
https://arxiv.org/abs/2103.00020
Foundation model for image-text understanding. Frame selection strategies optimize the trade-off between coverage and computational efficiency for CLIP-based video encoders.

WAT.ai Blog

Discussion about this post

Ready for more?