How related posts are computed

⋅ 4 minute read


The number next to related posts at the bottom of each page is the advertised post’s “similarity” to the currently viewed page (from 1.0 to -1.0).

I am using the following process to compute related posts locally:

  1. Summarize every post and TIL using local Llama 3.1 (llama3.1:8b-instruct-q5_0 via Ollama) with the following prompt:
Prompt

You are an analyst and editor with many years of experience in reading and synthesizing content.

Here is a blog post:

<BLOGPOST>

{ blog_post }

</BLOGPOST>

Please create a comprehensive and concise summary of the blog post. Focus on the main concepts, key details, and central arguments.

<INSTRUCTIONS>

  • Include any specific technologies, methods, or frameworks mentioned.
  • Don’t use more than 7 sentences.
  • Respond in plaintext. Don’t add formatting or linebreak characters to your response.
  • Don’t repeat the instructions of the task. Respond directly.

</INSTRUCTIONS>

  1. Embed the summary using ChromaDB’s default embedding model all-MiniLM-L6-v2 and store the embedding and metadata about the post in a persistent ChromaDB (link) vector database (a file on my computer).
  2. Compute the cosine similarity (link) between the embeddings for each pair of posts. A score of 1.0 indicates proportional vectors, a score of 0.0 orthogonal vectors, and a score of -1.0 opposite vectors.
  3. Write a yaml file that includes for every post a link to the most similar post and their similarity.
  4. Use a Hugo partial to include the data in the yaml file about the most relevant posts at the bottom of each page.

Noteworthy

  • Maybe not surprising, but Llama 3.1 8b performed much better than Llama 3.2 3b. The latter would hallucinate quite a few parts of the summary.
  • I had to explicitly instruct the model not to restate the task. Otherwise, every summary would have started with “Here is your concise blog post summary in not more than 7 sentences.” which would create some artificial similarity in the embeddings.
  • Most blog posts are shorter than 15000 characters or 4000 tokens which I had to work out to configure the model context correctly.
  • ChromaDB’s default embedding model all-MiniLM-L6-v2 has 384 dimensions and a maximum input sequence length of 256 BERT tokens (~100–150 words). I checked some of the summaries and they were around 160 – 220 BERT token. There is the option to use another embedding model that has a higher maximum sequence length. A good embedding model leaderboard to help make a choice is available here (link) .

Example

For the blog post Book notes: Four Thousand Weeks by Oliver Burkeman the LLM came up with this summary which was then embedded:

Model: llama3.1:8b-instruct-q5_0 - response

“The blog post discusses the book “Four Thousand Weeks” by Oliver Burkeman, which explores the concept of time and how to make the most of our limited lives. The author argues that modern society’s focus on productivity and efficiency is flawed, as it leads to a never-ending cycle of work and stress. Instead, he advocates for embracing our finitude and focusing on the present moment. Key concepts include “atelic activities” (enjoying leisure time for its own sake), “radical incrementalism” (making small progress towards goals), and “cosmic insignificance therapy” (putting life’s problems into perspective by considering the vastness of the universe). The author also suggests tools for embracing finitude, such as keeping a done list to focus on completed tasks. Overall, the book encourages readers to reevaluate their priorities and find meaning in the present moment.”

Embeddings

I can visualize the embeddings in two dimensions using t-SNE (link) :

Embeddings in 2D space
Figure 1. Plot of blog post summary embeddings in t-SNE space.

The plot looks somewhat reasonable. The code-heavy jupyter notebook posts are at the bottom, clustered around duckdb-large-datasets. The book reviews (4000-weeks, how_big_things-get-done, how-to-win-friends) are fairly close together. SQL related posts are clustered at the top. On the other hand, I would have expected reading-and-note-taking to be closer to writing-well.

Code

I am using this langchain script (link) to compute the recommendations.


If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.

Tags:
#data-engineering   #llm   #llama