I Built an AI-Powered Screenshot Search Engine

11 Aug, 2025

How I combined OCR, computer vision, and local LLMs to solve the "where the hell did I put that screenshot?" problem

We've all been there. You took a screenshot of some important code, an error message, or a design mockup weeks ago, and now you need to find it. You scroll through hundreds of files in your Screenshots folder, squinting at tiny thumbnails, trying to remember if it was the blue one or the green one. After 10 minutes of digital archaeology, you either find it by pure luck or give up and recreate whatever you needed.

I got so frustrated with this problem that I decided to build a solution: Screenshot Hub - an AI-powered screenshot search engine that lets you find images using natural language queries like "blue login screen," "error message," or "cat drawing." It combines multiple search techniques including OCR text extraction, color analysis, AI image captioning, and visual similarity matching.

The Problem: Screenshots Are Visual Data in a Text World

Modern operating systems treat screenshots like any other file - they're organized chronologically in folders with timestamps for names. But screenshots contain rich visual and textual information that traditional file systems ignore completely. When I need to find "that terminal window with the Docker error," I have to rely on my memory of when I took it and manually browse through files.

The problem gets worse as your screenshot collection grows. I had over 2,000 screenshots accumulated over two years, ranging from code snippets to design references to error messages. Finding anything specific became an exercise in futility.

What I wanted was simple: type what I'm looking for and get relevant results, just like searching through documents or emails. But screenshots aren't documents - they're images containing text, colors, objects, and visual layouts that need to be understood contextually.

Screenshot Hub approaches this by extracting multiple types of information from each image and making it all searchable:

1. OCR Text Extraction

Using Tesseract OCR, it extracts any text content from screenshots. This handles the obvious case where you remember some of the text you're looking for. Search for "sudo apt install" and it finds all screenshots containing that command.

2. Color-Based Search

It analyzes the color composition of each image using HSV histograms. Search for "blue interface" and it finds screenshots dominated by blue colors. This is surprisingly useful for finding screenshots of specific applications or UI themes.

3. AI Image Captioning

This is where it gets interesting. Using local AI models (via Ollama), it generates semantic descriptions of what's actually shown in each screenshot. A screenshot of a terminal becomes searchable by "terminal window," "command line," or "black screen with white text." A code editor screenshot becomes findable via "programming," "Python code," or "text editor."

4. Visual Similarity (CLIP Embeddings)

Using OpenAI's CLIP model, it creates vector embeddings that capture visual similarity. This lets you find screenshots that look similar to your query, even if they don't share text or colors. It's particularly good at finding screenshots with similar layouts or UI patterns.

5. Fusion Ranking

The real magic happens when combining all these approaches. The "fusion" search mode runs all methods in parallel and intelligently weights the results:

def combined_search(query: str, limit: int = 20, 
                   weights: Dict[str, float] = None) -> Tuple[List[Dict], str]:
    """Perform fusion search across all modalities."""
    # Default fusion weights
    if weights is None:
        weights = {
            'clip': 0.35,     # CLIP visual similarity 
            'filename': 0.3,  # Filename fuzzy match
            'semantic': 0.25, # AI captions/labels
            'text': 0.1       # OCR text  
        }
    
    all_results = {}  # file_id -> result_data
    methods_used = []
    
    # Run all search methods in parallel
    for method in ['filename', 'semantic', 'text']:
        search_func = {
            'filename': search_filename,
            'semantic': search_semantic, 
            'text': search_ocr_text
        }[method]
        
        results = search_func(query, limit * 2)
        if results:
            methods_used.append(method)
            for result in results:
                file_id = result['id']
                if file_id not in all_results:
                    all_results[file_id] = result.copy()
                    all_results[file_id]['scores'] = {}
                
                # Normalize scores to 0-1 range
                score = result.get('confidence', 0) / 100.0
                all_results[file_id]['scores'][method] = score

    # CLIP embedding search
    query_embedding = extract_text_embedding(query)
    if query_embedding is not None:
        clip_results = search_by_embedding(query_embedding, limit * 2)
        if clip_results:
            methods_used.append('clip')
            for result in clip_results:
                file_id = result['file_id']
                if file_id not in all_results:
                    file_data = get_file_by_id(file_id)
                    if file_data:
                        all_results[file_id] = file_data
                        all_results[file_id]['scores'] = {}
                
                # CLIP similarity is already 0-1 normalized
                all_results[file_id]['scores']['clip'] = result['similarity']

    # Calculate weighted fusion scores
    final_results = []
    for file_id, result in all_results.items():
        total_score = 0.0
        active_weight = 0.0
        
        for method, score in result.get('scores', {}).items():
            if method in weights and score > 0:
                total_score += weights[method] * score
                active_weight += weights[method]
        
        if active_weight > 0:
            result['fusion_score'] = total_score / active_weight
            final_results.append(result)
    
    # Sort by fusion score and return top results
    final_results.sort(key=lambda x: x['fusion_score'], reverse=True)
    return final_results[:limit], f"fusion ({', '.join(methods_used)})"

Default Weights:

CLIP visual similarity: 35% weight (most important for layout/visual patterns)
Filename matching: 30% weight (often very relevant)
AI semantic content: 25% weight (good for content understanding)
OCR text: 10% weight (specific but limited scope)

Technical Deep Dive

Architecture Choices

I built Screenshot Hub as a command-line tool in Python, optimizing for local processing and privacy. Here's why:

Local Processing: All AI inference happens on your machine. Your screenshots never leave your computer, which is crucial for privacy-conscious users (especially developers with potentially sensitive code/data).

SQLite + FAISS: Uses SQLite for metadata and full-text search, with a separate FAISS vector index for similarity search. This hybrid approach gives you the benefits of both relational data management and fast vector operations.

# Intelligent search routing based on query analysis
def route_search(query: str, limit: int = 20) -> Tuple[List[Dict], str]:
    """Route search query to most appropriate method."""
    query = query.strip()
    
    # Try filename search first (often most relevant)
    filename_results = search_filename(query, limit)
    if filename_results:
        for result in filename_results:
            result['search_method'] = 'filename'
        return filename_results, "filename (fuzzy)"
    
    # Try semantic search (captions + labels)
    semantic_results = search_semantic(query, limit)
    if semantic_results:
        for result in semantic_results:
            result['search_method'] = 'semantic'
        return semantic_results, "semantic (AI captions + labels)"
    
    # Fall back to OCR text search
    ocr_results = search_ocr_text(query, limit)
    for result in ocr_results:
        result['search_method'] = 'text'
    
    return ocr_results, "text (OCR)"

# Robust database corruption handling
@handle_db_corruption
def handle_db_corruption(func):
    """Decorator to handle database corruption gracefully."""
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except sqlite3.DatabaseError as e:
            if ("malformed" in str(e).lower() or 
                "corrupt" in str(e).lower() or 
                "invalid fts5 file format" in str(e).lower()):
                logger.warning(f"Database corruption detected: {e}")
                logger.warning("Rebuilding database...")
                rebuild_database()
                # Retry the operation once after rebuild
                return func(*args, **kwargs)
            else:
                raise
    return wrapper

Modular Design: Each search method is implemented as a separate module, making it easy to add new techniques or swap out models. Want to try a different vision model? Just modify the caption module.

The AI Pipeline

The most interesting part is the AI captioning pipeline. Initially, I used BLIP-2 from Hugging Face, but it was slow and resource-heavy. The breakthrough came when I integrated Ollama for local LLM inference.

def encode_image_base64(image_path: Path) -> str:
    """Encode and optimize image for Ollama processing."""
    with Image.open(image_path) as img:
        # Convert to RGB and resize for efficiency
        if img.mode != 'RGB':
            img = img.convert('RGB')
        
        # Resize if too large (max 1024x1024 for most models)
        max_size = 1024
        if img.width > max_size or img.height > max_size:
            img.thumbnail((max_size, max_size), Image.Resampling.LANCZOS)
        
        # Convert to base64
        buffer = io.BytesIO()
        img.save(buffer, format='JPEG', quality=85)
        image_bytes = buffer.getvalue()
        
    return base64.b64encode(image_bytes).decode('utf-8')

def generate_caption_ollama(image_path: Path, model: str = "qwen2.5vl") -> Tuple[str, float]:
    """Generate AI caption using Ollama vision models."""
    try:
        # Load data extraction prompt from external file
        prompt = load_prompt_template("data_extraction")
        image_b64 = encode_image_base64(image_path)
        
        response = ollama.generate(
            model=model,
            prompt=prompt,
            images=[image_b64],
            options={'temperature': 0.1}  # Low temp for consistency
        )
        
        caption = response['response'].strip()
        
        # Try to parse as JSON for structured data extraction
        try:
            data = json.loads(caption)
            # Convert structured data back to searchable text
            caption = f"{data.get('description', '')} {' '.join(data.get('text_elements', []))}"
        except json.JSONDecodeError:
            # Fallback to raw caption
            pass
            
        return caption, 0.8  # High confidence for local models
        
    except Exception as e:
        logger.error(f"Ollama captioning failed: {e}")
        return "", 0.0

This structured approach extracts multiple layers of information from each screenshot, making them searchable across different dimensions.

Performance Optimizations

With thousands of screenshots, performance matters. Here's how I optimized the indexing pipeline:

Incremental Updates: Only processes new or modified files by comparing modification times and file hashes.

def get_file_stats(file_path: Path) -> Tuple[int, float, str]:
    """Get file size, modification time, and hash for change detection."""
    stat = file_path.stat()
    size = stat.st_size
    mtime = stat.st_mtime
    
    # Calculate hash for duplicate detection
    hasher = hashlib.sha256()
    with open(file_path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hasher.update(chunk)
    
    return size, mtime, hasher.hexdigest()

@handle_db_corruption
def should_process_file(file_path: Path) -> bool:
    """Check if file needs processing based on hash and mtime."""
    size, mtime, new_hash = get_file_stats(file_path)
    
    existing_file = get_file_by_path(str(file_path))
    if not existing_file:
        return True  # New file
        
    # Check if file changed
    return (existing_file['hash'] != new_hash or 
            existing_file['mtime'] != mtime)

Parallel Processing: OCR, color analysis, and AI captioning run in parallel using ThreadPoolExecutor.

def parallel_extract_features(image_path: Path, config: dict) -> dict:
    """Extract all features in parallel for better performance."""
    with ThreadPoolExecutor(max_workers=4) as executor:
        # Submit all extraction tasks
        futures = {}
        
        if config.get('enable_ocr', True):
            futures['ocr'] = executor.submit(extract_text, image_path)
            
        if config.get('enable_color', True):
            futures['color'] = executor.submit(extract_color_histogram, image_path)
            
        if config.get('enable_ai', True):
            futures['caption'] = executor.submit(generate_caption, image_path)
            
        if config.get('enable_clip', True):
            futures['embedding'] = executor.submit(extract_image_embedding, image_path)
        
        # Collect results as they complete
        results = {}
        for feature, future in futures.items():
            try:
                results[feature] = future.result(timeout=30)
            except Exception as e:
                logger.warning(f"Feature extraction failed for {feature}: {e}")
                results[feature] = None
                
    return results

Smart Caching: Models are loaded once and kept in memory during batch operations.

# Global model cache to avoid repeated loading
_clip_model = None
_clip_preprocess = None

def load_clip_model(model_name: str = "ViT-B-32") -> Tuple[Any, Any]:
    """Load CLIP model with caching and device optimization."""
    global _clip_model, _clip_preprocess
    
    if _clip_model is not None and _clip_preprocess is not None:
        return _clip_model, _clip_preprocess
    
    import open_clip
    import torch
    
    # Load model and preprocessing
    model, _, preprocess = open_clip.create_model_and_transforms(
        model_name, 
        pretrained="openai"
    )
    
    # Set to evaluation mode
    model.eval()
    
    # Move to best available device
    device = "mps" if torch.backends.mps.is_available() else "cpu"
    model = model.to(device)
    
    # Cache globally for reuse
    _clip_model = model
    _clip_preprocess = preprocess
    
    return model, preprocess

def extract_image_embedding(image_path: Path) -> Optional[np.ndarray]:
    """Extract CLIP embedding with cached model."""
    model, preprocess = load_clip_model()  # Uses cached model
    
    with torch.no_grad():
        image = preprocess(Image.open(image_path)).unsqueeze(0)
        features = model.encode_image(image)
        # L2 normalize for cosine similarity
        features = features / features.norm(dim=-1, keepdim=True)
    
    return features.cpu().numpy().flatten()

Progressive Indexing: You can start searching as soon as basic indexing (filename + OCR) is complete, while AI features continue processing in the background.

Color Analysis with Perceptual Accuracy:

def extract_color_histogram(image_path: Path) -> Tuple[np.ndarray, str]:
    """Extract HSV color histogram for color-based search."""
    img = cv2.imread(str(image_path))
    if img is None:
        return np.zeros(27), "#000000"  # 3x3x3 HSV bins
    
    # Convert BGR to HSV for perceptually uniform color space
    hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
    
    # Calculate 3D histogram (3x3x3 bins for H,S,V)
    hist = cv2.calcHist(
        [hsv], 
        [0, 1, 2],  # H, S, V channels
        None, 
        [3, 3, 3],  # 3 bins each for H, S, V
        [0, 180, 0, 256, 0, 256]  # ranges
    )
    
    # Flatten and normalize to probability distribution
    hist_vector = hist.flatten()
    hist_vector = hist_vector / (hist_vector.sum() + 1e-7)
    
    # Extract dominant color using K-means
    dominant_color = get_dominant_color_kmeans(img)
    
    return hist_vector, dominant_color

def detect_color_words(query: str) -> List[str]:
    """Detect color terms in search queries."""
    color_patterns = {
        'red': ['red', 'crimson', 'scarlet'],
        'blue': ['blue', 'navy', 'azure', 'cyan'],
        'green': ['green', 'lime', 'forest', 'mint'],
        'yellow': ['yellow', 'gold', 'amber'],
        'orange': ['orange', 'coral', 'peach'],
        'purple': ['purple', 'violet', 'magenta'],
        'black': ['black', 'dark'],
        'white': ['white', 'light'],
        'gray': ['gray', 'grey', 'silver']
    }
    
    detected_colors = []
    query_lower = query.lower()
    
    for base_color, variants in color_patterns.items():
        if any(variant in query_lower for variant in variants):
            detected_colors.append(base_color)
    
    return detected_colors

Real-World Performance

After indexing my 2,000+ screenshot collection, here are some real performance numbers:

Indexing Speed:

Basic (filename + OCR): ~2 seconds per screenshot
With AI captioning: ~5-8 seconds per screenshot (depending on model)
With CLIP embeddings: ~1 additional second per screenshot

Query Speed:

Text/filename search: <100ms
Color search: <200ms
Semantic search: <150ms
CLIP similarity: <300ms
Fusion search: <500ms (all methods combined)

Storage Requirements:

Database: ~50KB per screenshot
FAISS vector index: ~2KB per screenshot
Model files: ~400MB (CLIP) + variable (Ollama models)

What I Learned

I was skeptical about whether the complexity of multiple search methods would actually improve results. It does, dramatically. Single-method search often misses relevant results that other methods catch. The fusion ranking produces noticeably better results than any individual technique.

2. Local AI Is Ready for Production

Modern vision-language models running locally via Ollama are surprisingly capable. qwen2.5vl correctly identifies code editors, terminal windows, web browsers, and even specific programming languages in screenshots. The quality gap between local and cloud models has narrowed significantly.

3. Color Search Is Underrated

I almost skipped color-based search as a gimmick, but it turned out to be incredibly useful. Searching for "red error" or "blue interface" often finds exactly what you need when you remember the visual appearance but not the specific content.

4. Performance Tuning Matters

Early versions were too slow for daily use. Adding parallel processing, smart caching, and incremental updates made the difference between a neat demo and a practical tool.

Practical Usage Patterns

After using Screenshot Hub for several months, some interesting usage patterns emerged:

Debug Session Recovery: "Find that Redis error from last week" → searches semantic content for Redis-related screenshots

Design Reference: "Show me all the blue login screens" → combines color and semantic search to find UI examples

Code Archaeology: "Where did I screenshot that Python function?" → OCR text search for specific code patterns

Visual Similarity: "Find screenshots similar to this terminal layout" → CLIP embeddings find visually similar command-line interfaces

Future Directions

There's still room for improvement:

Better Context Understanding: Current AI models describe what they see but don't understand relationships between UI elements or the purpose of different screenshot types.

Temporal Queries: "Screenshots from when I was working on project X" requires combining search with timeline analysis.

Duplicate Detection: Using CLIP embeddings to identify and dedupe visually similar screenshots.

Smart Collections: Automatically grouping related screenshots (e.g., all screenshots from the same debugging session).

Open Source and Privacy

Screenshot Hub is open source (MIT license) and designed with privacy as a core principle. All processing happens locally - no cloud APIs, no data collection, no telemetry. You can audit exactly what it does with your screenshots.

The tool is available at [github link] and works on macOS, Linux, and Windows. Installation is straightforward with uv/pip, and optional dependencies let you choose which AI features to enable.

Why This Matters

Screenshot search might seem like a niche problem, but it reflects a broader issue: we're generating more visual data than ever, but our tools for organizing and finding it haven't evolved. Screenshots, photos, designs, charts, and diagrams all contain rich information that current file systems treat as opaque binary blobs.

Screenshot Hub demonstrates what becomes possible when you treat images as queryable, searchable data. The same techniques could apply to photo libraries, design assets, document scanning, or any visual content management problem.

As AI models become more capable and accessible, we'll see more applications that understand and organize our digital lives in contextually meaningful ways. The future of file management isn't alphabetical sorting - it's semantic understanding.

Getting Started

If you want to try Screenshot Hub:

Install with uv tool install screenshot-hub or pip install screenshot-hub
Install Tesseract OCR for text extraction
Optionally install Ollama for AI features
Run hub index ~/Pictures/Screenshots to start indexing
Search with hub search "your query"

The tool is designed to be useful immediately with just basic OCR, then gets more powerful as you add AI capabilities.

#ai #projects