Fast and Simple 'Similar Topics' Recommendations with TF-IDF and Python

24 Feb, 2025

I have a hobby project called HN Distilled, where I needed to add a "Similar Topics" section at the bottom of each post summary. I implemented it by using a bit of Python code, which calculated cosine similarity between vectorized representations of each post. It might sound complex, but in essence, it was just a few methods from the scikit-learn. This is how it worked:

Post Summaries

HN Distilled is an web application which summarize each post discussions and extract main themes from it. Each post I stored in the DB had the following fields: id - HN id, title, summary, and some other meta information. Summary content was generated by LLM, which took the Hacker News post by id, retrieved the comments, and extracted main themes from it with direct quotes where necessary. So, to build Similar Topics, I started from the simplest approach.

Overlapping Words

I took all the words from one text and overlapped them with another post's words. The bigger number of similar words, the more similar the topics should be. In reality, it was the opposite. After I implemented this approach, I noticed that "similar" topics were not relevant at all, almost random.

def compute_similarity(text1: str, text2: str) -> float:
    """Compute similarity between two texts using a simple word overlap method"""
   
    words1 = set(re.findall(r'\w+', text1.lower()))
    words2 = set(re.findall(r'\w+', text2.lower()))
    
    # Compute Jaccard similarity
    intersection = len(words1.intersection(words2))
    union = len(words1.union(words2))
    
    return intersection / union if union > 0 else 0.0

Embeddings

My next idea was to use a proper, de-facto standard approach: build embeddings for each post, store them in the DB, and do some SQL querying to get the most similar posts. It would have worked, but I wanted something even simpler than that, plus I didn't want to pay an external API for generating embeddings.

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) quantifies a word's importance in a document relative to a collection of documents. It increases when a word appears frequently in a document but decreases when the word is common across many documents. This helps identify words that are characteristic and relevant to a specific document within a larger corpus. In return, documents can be easily compared with the significant keywords of one another.

After a few minutes of googling, I found this library: scikit-learn, which could transform a document (the post) into a matrix of TF-IDF features, which later on could be used for calculating cosine similarity.

This is how I used it:

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)

where docs was a list of post summaries concatenated with a title.

Once that was done, I could calculate similar posts:

cosine_sim = cosine_similarity(tfidf_matrix)

The output of this method was a matrix:

[[1.         0.         0.         ... 0.         0.05609762 0.        ]
 [0.         1.         0.01893518 ... 0.         0.         0.        ]
 [0.         0.01893518 1.         ... 0.01913076 0.02921412 0.04122173]
 ...
 [0.         0.         0.01913076 ... 1.         0.06757316 0.00756559]
 [0.05609762 0.         0.02921412 ... 0.06757316 1.         0.0212254 ]
 [0.         0.         0.04122173 ... 0.00756559 0.0212254  1.        ]]

Where each post was compared to each other on similar content and then laid out in the form of a matrix. So [0][0] post was similar to itself by 100% (in the matrix it's 1). The closer the number to 1, the more similar the content by meaning.

What was left was to get the maximum N topics by sorting values in descending order (don't forget to exclude the current post):

# ids - HN post ids
recommendations = []
for idx, current_id in enumerate(ids):
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Exclude self by filtering out the identical index
    sim_scores = [ (i, score) for i, score in sim_scores if i != idx ]
    # Sort scores in descending order of similarity
    sim_scores.sort(key=lambda x: x[1], reverse=True)
    
    # Capture the top N recommendations
    top_indices = [i for i, score in sim_scores[:TOP_N_RECOMMENDATIONS]]
    recommended_ids = [ids[i] for i in top_indices]

The beauty of this approach is that everything is done in memory, it works very fast (calculation for 1000 posts takes around 2 seconds).

Automation

Once I had the main logic implemented and ready to be used, the last thing left was to automate this task. I was hosting all my projects on Fly.io, so it meant that I would need to have a periodically running job there as well.

You may think that Cron jobs might be a good choice here, but from my experience, it's always been an awkward experience, especially when it comes to debugging a job. This time I decided to stick to a Python-only approach and used the APScheduler library:

def start_scheduler():
    scheduler = BlockingScheduler()
    scheduler.add_job(
        update_recommendations,  # The function to execute
        CronTrigger(minute='*/1'),  # Run every minute
        name='update_recommendations',  # Human-readable identifier for the job
        max_instances=1,  # Only allow one instance to run at a time
        coalesce=True,  # If multiple executions are missed, only run once
        misfire_grace_time=300  # Allow job to run up to 300 seconds (5 mins) late
    )
    logging.info("Starting scheduler...")
    try:
        scheduler.start()
    except (KeyboardInterrupt, SystemExit):
        logging.info("Scheduler stopped...")

Only thing left was to prepare a Docker file and proper Fly.io configuration:

FROM python:3.13-slim

WORKDIR /app

# Copy and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code
COPY update_recommendations.py .

# Environment variables will be provided by Fly.io
ENV SUPABASE_URL=""
ENV SUPABASE_KEY=""

# Run the script
CMD ["python", "update_recommendations.py"]

Fly.io:

app = 'update-recommendations'
primary_region = 'waw'

[build]
  dockerfile = 'Dockerfile'

[processes]
  app = 'python update_recommendations.py'

[[vm]]
  memory = '1gb'
  cpu_kind = 'shared'
  cpus = 1
  min_machines_running = 1  # Ensures at least one instance is always running

#ai #projects