I Built an AI-Powered Screenshot Search Engine
How I combined OCR, computer vision, and local LLMs to solve the "where the hell did I put that screenshot?" problem
We've all been there. You took a screenshot of some important code, an error message, or a design mockup weeks ago, and now you need to find it. You scroll through hundreds of files in your Screenshots folder, squinting at tiny thumbnails, trying to remember if it was the blue one or the green one. After 10 minutes of digital archaeology, you either find it by pure luck or give up and recreate whatever you needed.
I got so frustrated with this problem that I decided to build a solution: Screenshot Hub - an AI-powered screenshot search engine that lets you find images using natural language queries like "blue login screen," "error message," or "cat drawing." It combines multiple search techniques including OCR text extraction, color analysis, AI image captioning, and visual similarity matching.
The Problem: Screenshots Are Visual Data in a Text World
Modern operating systems treat screenshots like any other file - they're organized chronologically in folders with timestamps for names. But screenshots contain rich visual and textual information that traditional file systems ignore completely. When I need to find "that terminal window with the Docker error," I have to rely on my memory of when I took it and manually browse through files.
The problem gets worse as your screenshot collection grows. I had over 2,000 screenshots accumulated over two years, ranging from code snippets to design references to error messages. Finding anything specific became an exercise in futility.
What I wanted was simple: type what I'm looking for and get relevant results, just like searching through documents or emails. But screenshots aren't documents - they're images containing text, colors, objects, and visual layouts that need to be understood contextually.
The Solution: Multi-Modal AI Search
Screenshot Hub approaches this by extracting multiple types of information from each image and making it all searchable:
1. OCR Text Extraction
Using Tesseract OCR, it extracts any text content from screenshots. This handles the obvious case where you remember some of the text you're looking for. Search for "sudo apt install" and it finds all screenshots containing that command.
2. Color-Based Search
It analyzes the color composition of each image using HSV histograms. Search for "blue interface" and it finds screenshots dominated by blue colors. This is surprisingly useful for finding screenshots of specific applications or UI themes.
3. AI Image Captioning
This is where it gets interesting. Using local AI models (via Ollama), it generates semantic descriptions of what's actually shown in each screenshot. A screenshot of a terminal becomes searchable by "terminal window," "command line," or "black screen with white text." A code editor screenshot becomes findable via "programming," "Python code," or "text editor."
4. Visual Similarity (CLIP Embeddings)
Using OpenAI's CLIP model, it creates vector embeddings that capture visual similarity. This lets you find screenshots that look similar to your query, even if they don't share text or colors. It's particularly good at finding screenshots with similar layouts or UI patterns.
5. Fusion Ranking
The real magic happens when combining all these approaches. The "fusion" search mode runs all methods in parallel and intelligently weights the results:
def combined_search(query: str, limit: int = 20,
weights: Dict[str, float] = None) -> Tuple[List[Dict], str]:
"""Perform fusion search across all modalities."""
# Default fusion weights
if weights is None:
weights = {
'clip': 0.35, # CLIP visual similarity
'filename': 0.3, # Filename fuzzy match
'semantic': 0.25, # AI captions/labels
'text': 0.1 # OCR text
}
all_results = {} # file_id -> result_data
methods_used = []
# Run all search methods in parallel
for method in ['filename', 'semantic', 'text']:
search_func = {
'filename': search_filename,
'semantic': search_semantic,
'text': search_ocr_text
}[method]
results = search_func(query, limit * 2)
if results:
methods_used.append(method)
for result in results:
file_id = result['id']
if file_id not in all_results:
all_results[file_id] = result.copy()
all_results[file_id]['scores'] = {}
# Normalize scores to 0-1 range
score = result.get('confidence', 0) / 100.0
all_results[file_id]['scores'][method] = score
# CLIP embedding search
query_embedding = extract_text_embedding(query)
if query_embedding is not None:
clip_results = search_by_embedding(query_embedding, limit * 2)
if clip_results:
methods_used.append('clip')
for result in clip_results:
file_id = result['file_id']
if file_id not in all_results:
file_data = get_file_by_id(file_id)
if file_data:
all_results[file_id] = file_data
all_results[file_id]['scores'] = {}
# CLIP similarity is already 0-1 normalized
all_results[file_id]['scores']['clip'] = result['similarity']
# Calculate weighted fusion scores
final_results = []
for file_id, result in all_results.items():
total_score = 0.0
active_weight = 0.0
for method, score in result.get('scores', {}).items():
if method in weights and score > 0:
total_score += weights[method] * score
active_weight += weights[method]
if active_weight > 0:
result['fusion_score'] = total_score / active_weight
final_results.append(result)
# Sort by fusion score and return top results
final_results.sort(key=lambda x: x['fusion_score'], reverse=True)
return final_results[:limit], f"fusion ({', '.join(methods_used)})"
Default Weights:
- CLIP visual similarity: 35% weight (most important for layout/visual patterns)
- Filename matching: 30% weight (often very relevant)
- AI semantic content: 25% weight (good for content understanding)
- OCR text: 10% weight (specific but limited scope)
Technical Deep Dive
Architecture Choices
I built Screenshot Hub as a command-line tool in Python, optimizing for local processing and privacy. Here's why:
Local Processing: All AI inference happens on your machine. Your screenshots never leave your computer, which is crucial for privacy-conscious users (especially developers with potentially sensitive code/data).
SQLite + FAISS: Uses SQLite for metadata and full-text search, with a separate FAISS vector index for similarity search. This hybrid approach gives you the benefits of both relational data management and fast vector operations.
# Intelligent search routing based on query analysis
def route_search(query: str, limit: int = 20) -> Tuple[List[Dict], str]:
"""Route search query to most appropriate method."""
query = query.strip()
# Try filename search first (often most relevant)
filename_results = search_filename(query, limit)
if filename_results:
for result in filename_results:
result['search_method'] = 'filename'
return filename_results, "filename (fuzzy)"
# Try semantic search (captions + labels)
semantic_results = search_semantic(query, limit)
if semantic_results:
for result in semantic_results:
result['search_method'] = 'semantic'
return semantic_results, "semantic (AI captions + labels)"
# Fall back to OCR text search
ocr_results = search_ocr_text(query, limit)
for result in ocr_results:
result['search_method'] = 'text'
return ocr_results, "text (OCR)"
# Robust database corruption handling
@handle_db_corruption
def handle_db_corruption(func):
"""Decorator to handle database corruption gracefully."""
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except sqlite3.DatabaseError as e:
if ("malformed" in str(e).lower() or
"corrupt" in str(e).lower() or
"invalid fts5 file format" in str(e).lower()):
logger.warning(f"Database corruption detected: {e}")
logger.warning("Rebuilding database...")
rebuild_database()
# Retry the operation once after rebuild
return func(*args, **kwargs)
else:
raise
return wrapper
Modular Design: Each search method is implemented as a separate module, making it easy to add new techniques or swap out models. Want to try a different vision model? Just modify the caption module.
The AI Pipeline
The most interesting part is the AI captioning pipeline. Initially, I used BLIP-2 from Hugging Face, but it was slow and resource-heavy. The breakthrough came when I integrated Ollama for local LLM inference.
def encode_image_base64(image_path: Path) -> str:
"""Encode and optimize image for Ollama processing."""
with Image.open(image_path) as img:
# Convert to RGB and resize for efficiency
if img.mode != 'RGB':
img = img.convert('RGB')
# Resize if too large (max 1024x1024 for most models)
max_size = 1024
if img.width > max_size or img.height > max_size:
img.thumbnail((max_size, max_size), Image.Resampling.LANCZOS)
# Convert to base64
buffer = io.BytesIO()
img.save(buffer, format='JPEG', quality=85)
image_bytes = buffer.getvalue()
return base64.b64encode(image_bytes).decode('utf-8')
def generate_caption_ollama(image_path: Path, model: str = "qwen2.5vl") -> Tuple[str, float]:
"""Generate AI caption using Ollama vision models."""
try:
# Load data extraction prompt from external file
prompt = load_prompt_template("data_extraction")
image_b64 = encode_image_base64(image_path)
response = ollama.generate(
model=model,
prompt=prompt,
images=[image_b64],
options={'temperature': 0.1} # Low temp for consistency
)
caption = response['response'].strip()
# Try to parse as JSON for structured data extraction
try:
data = json.loads(caption)
# Convert structured data back to searchable text
caption = f"{data.get('description', '')} {' '.join(data.get('text_elements', []))}"
except json.JSONDecodeError:
# Fallback to raw caption
pass
return caption, 0.8 # High confidence for local models
except Exception as e:
logger.error(f"Ollama captioning failed: {e}")
return "", 0.0
This structured approach extracts multiple layers of information from each screenshot, making them searchable across different dimensions.
Performance Optimizations
With thousands of screenshots, performance matters. Here's how I optimized the indexing pipeline:
Incremental Updates: Only processes new or modified files by comparing modification times and file hashes.
def get_file_stats(file_path: Path) -> Tuple[int, float, str]:
"""Get file size, modification time, and hash for change detection."""
stat = file_path.stat()
size = stat.st_size
mtime = stat.st_mtime
# Calculate hash for duplicate detection
hasher = hashlib.sha256()
with open(file_path, 'rb') as f:
for chunk in iter(lambda: f.read(4096), b""):
hasher.update(chunk)
return size, mtime, hasher.hexdigest()
@handle_db_corruption
def should_process_file(file_path: Path) -> bool:
"""Check if file needs processing based on hash and mtime."""
size, mtime, new_hash = get_file_stats(file_path)
existing_file = get_file_by_path(str(file_path))
if not existing_file:
return True # New file
# Check if file changed
return (existing_file['hash'] != new_hash or
existing_file['mtime'] != mtime)
Parallel Processing: OCR, color analysis, and AI captioning run in parallel using ThreadPoolExecutor.
def parallel_extract_features(image_path: Path, config: dict) -> dict:
"""Extract all features in parallel for better performance."""
with ThreadPoolExecutor(max_workers=4) as executor:
# Submit all extraction tasks
futures = {}
if config.get('enable_ocr', True):
futures['ocr'] = executor.submit(extract_text, image_path)
if config.get('enable_color', True):
futures['color'] = executor.submit(extract_color_histogram, image_path)
if config.get('enable_ai', True):
futures['caption'] = executor.submit(generate_caption, image_path)
if config.get('enable_clip', True):
futures['embedding'] = executor.submit(extract_image_embedding, image_path)
# Collect results as they complete
results = {}
for feature, future in futures.items():
try:
results[feature] = future.result(timeout=30)
except Exception as e:
logger.warning(f"Feature extraction failed for {feature}: {e}")
results[feature] = None
return results
Smart Caching: Models are loaded once and kept in memory during batch operations.
# Global model cache to avoid repeated loading
_clip_model = None
_clip_preprocess = None
def load_clip_model(model_name: str = "ViT-B-32") -> Tuple[Any, Any]:
"""Load CLIP model with caching and device optimization."""
global _clip_model, _clip_preprocess
if _clip_model is not None and _clip_preprocess is not None:
return _clip_model, _clip_preprocess
import open_clip
import torch
# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms(
model_name,
pretrained="openai"
)
# Set to evaluation mode
model.eval()
# Move to best available device
device = "mps" if torch.backends.mps.is_available() else "cpu"
model = model.to(device)
# Cache globally for reuse
_clip_model = model
_clip_preprocess = preprocess
return model, preprocess
def extract_image_embedding(image_path: Path) -> Optional[np.ndarray]:
"""Extract CLIP embedding with cached model."""
model, preprocess = load_clip_model() # Uses cached model
with torch.no_grad():
image = preprocess(Image.open(image_path)).unsqueeze(0)
features = model.encode_image(image)
# L2 normalize for cosine similarity
features = features / features.norm(dim=-1, keepdim=True)
return features.cpu().numpy().flatten()
Progressive Indexing: You can start searching as soon as basic indexing (filename + OCR) is complete, while AI features continue processing in the background.
Color Analysis with Perceptual Accuracy:
def extract_color_histogram(image_path: Path) -> Tuple[np.ndarray, str]:
"""Extract HSV color histogram for color-based search."""
img = cv2.imread(str(image_path))
if img is None:
return np.zeros(27), "#000000" # 3x3x3 HSV bins
# Convert BGR to HSV for perceptually uniform color space
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# Calculate 3D histogram (3x3x3 bins for H,S,V)
hist = cv2.calcHist(
[hsv],
[0, 1, 2], # H, S, V channels
None,
[3, 3, 3], # 3 bins each for H, S, V
[0, 180, 0, 256, 0, 256] # ranges
)
# Flatten and normalize to probability distribution
hist_vector = hist.flatten()
hist_vector = hist_vector / (hist_vector.sum() + 1e-7)
# Extract dominant color using K-means
dominant_color = get_dominant_color_kmeans(img)
return hist_vector, dominant_color
def detect_color_words(query: str) -> List[str]:
"""Detect color terms in search queries."""
color_patterns = {
'red': ['red', 'crimson', 'scarlet'],
'blue': ['blue', 'navy', 'azure', 'cyan'],
'green': ['green', 'lime', 'forest', 'mint'],
'yellow': ['yellow', 'gold', 'amber'],
'orange': ['orange', 'coral', 'peach'],
'purple': ['purple', 'violet', 'magenta'],
'black': ['black', 'dark'],
'white': ['white', 'light'],
'gray': ['gray', 'grey', 'silver']
}
detected_colors = []
query_lower = query.lower()
for base_color, variants in color_patterns.items():
if any(variant in query_lower for variant in variants):
detected_colors.append(base_color)
return detected_colors
Real-World Performance
After indexing my 2,000+ screenshot collection, here are some real performance numbers:
Indexing Speed:
- Basic (filename + OCR): ~2 seconds per screenshot
- With AI captioning: ~5-8 seconds per screenshot (depending on model)
- With CLIP embeddings: ~1 additional second per screenshot
Query Speed:
- Text/filename search: <100ms
- Color search: <200ms
- Semantic search: <150ms
- CLIP similarity: <300ms
- Fusion search: <500ms (all methods combined)
Storage Requirements:
- Database: ~50KB per screenshot
- FAISS vector index: ~2KB per screenshot
- Model files: ~400MB (CLIP) + variable (Ollama models)
What I Learned
1. Multi-modal Search Is Genuinely Better
I was skeptical about whether the complexity of multiple search methods would actually improve results. It does, dramatically. Single-method search often misses relevant results that other methods catch. The fusion ranking produces noticeably better results than any individual technique.
2. Local AI Is Ready for Production
Modern vision-language models running locally via Ollama are surprisingly capable. qwen2.5vl correctly identifies code editors, terminal windows, web browsers, and even specific programming languages in screenshots. The quality gap between local and cloud models has narrowed significantly.
3. Color Search Is Underrated
I almost skipped color-based search as a gimmick, but it turned out to be incredibly useful. Searching for "red error" or "blue interface" often finds exactly what you need when you remember the visual appearance but not the specific content.
4. Performance Tuning Matters
Early versions were too slow for daily use. Adding parallel processing, smart caching, and incremental updates made the difference between a neat demo and a practical tool.
Practical Usage Patterns
After using Screenshot Hub for several months, some interesting usage patterns emerged:
Debug Session Recovery: "Find that Redis error from last week" → searches semantic content for Redis-related screenshots
Design Reference: "Show me all the blue login screens" → combines color and semantic search to find UI examples
Code Archaeology: "Where did I screenshot that Python function?" → OCR text search for specific code patterns
Visual Similarity: "Find screenshots similar to this terminal layout" → CLIP embeddings find visually similar command-line interfaces
Future Directions
There's still room for improvement:
Better Context Understanding: Current AI models describe what they see but don't understand relationships between UI elements or the purpose of different screenshot types.
Temporal Queries: "Screenshots from when I was working on project X" requires combining search with timeline analysis.
Duplicate Detection: Using CLIP embeddings to identify and dedupe visually similar screenshots.
Smart Collections: Automatically grouping related screenshots (e.g., all screenshots from the same debugging session).
Open Source and Privacy
Screenshot Hub is open source (MIT license) and designed with privacy as a core principle. All processing happens locally - no cloud APIs, no data collection, no telemetry. You can audit exactly what it does with your screenshots.
The tool is available at [github link] and works on macOS, Linux, and Windows. Installation is straightforward with uv/pip, and optional dependencies let you choose which AI features to enable.
Why This Matters
Screenshot search might seem like a niche problem, but it reflects a broader issue: we're generating more visual data than ever, but our tools for organizing and finding it haven't evolved. Screenshots, photos, designs, charts, and diagrams all contain rich information that current file systems treat as opaque binary blobs.
Screenshot Hub demonstrates what becomes possible when you treat images as queryable, searchable data. The same techniques could apply to photo libraries, design assets, document scanning, or any visual content management problem.
As AI models become more capable and accessible, we'll see more applications that understand and organize our digital lives in contextually meaningful ways. The future of file management isn't alphabetical sorting - it's semantic understanding.
Getting Started
If you want to try Screenshot Hub:
- Install with
uv tool install screenshot-hub
orpip install screenshot-hub
- Install Tesseract OCR for text extraction
- Optionally install Ollama for AI features
- Run
hub index ~/Pictures/Screenshots
to start indexing - Search with
hub search "your query"
The tool is designed to be useful immediately with just basic OCR, then gets more powerful as you add AI capabilities.