RAG as Recommendation Systems

June 17, 2025·Bryan Lai

Technical RAG LLM

RAG as Recommendation Systems

RAG works better when you think like a search engineer.

Simple Vector Search

Retrieve: Search the knowledge base for text relevant to the query.
Augment: Combine query + relevant text.
Generate: Generate the answer from query + relevant text.

Simple Vector Search Is Not Enough

Embeddings find similar meaning.

They can miss exact names, IDs, error codes, and product terms.

Use hybrid search:

Run keyword search.
Run vector search.
Merge the results.
Rank them with RRF, which rewards documents that rank well in both lists.

Reranking

The first retrieval pass is a guess.

The top 20 candidates might include only three useful docs. Sending all 20 to the model adds noise and cost.

Use a second cross-encoder to rerank candidates: compare the query with each candidate and output a relevance score.

Types Of Reranking

Relevance-based: Score each (query, document) pair, then sort by score.
Diversity-based: Avoid sending five versions of the same doc. Use MMR when the top results repeat the same point.

Improving Chunks

Chunks often lose meaning.

"50% improvement overall" means nothing unless you know "overall" of what.

Parent Document Retrieval (Preferred)
- Embed small chunks for precise search.
- When one chunk matches, return the larger section around it.
Contextual RAG / Prepositional Indexing
- Add a short document summary to each chunk before embedding.
- The chunk now carries more context into search.

Improving The User Query

Users ask vague questions.

They also ask three questions in one sentence.

One vector search can fail.

Multi Query: Have the LLM generate multiple versions of the query and run searches for each.
Step-back Prompting: Ask the LLM to generate a more general question from the user query.
- Example:
  - User: "Who's the second person to walk on the moon?"
  - Step-back: "What were the Apollo moon landing missions and their crews?"
  - Retrieve with both queries to get general and specific chunks.
HyDE (Hypothetical Document Embeddings):
- Generate a fake answer.
- Embed the fake answer.
- Search for real documents that look like it.
Decomposition (Multi-hop):
- Break complex questions into sub-queries:
  - "Compare feature A with product B price."
  - Sub-queries: "Feature A" and "Product B price."
  - Run retrieval for each sub-query, then synthesize the comparison.

Evaluating RAG Pipelines

Build a set of questions with known good answers. Then measure:

Precision: Percentage of retrieved chunks that are relevant.
- LLM-as-a-judge: "Is this chunk relevant to the user's query? Yes/No."
Recall: Did retrieval fetch all necessary chunks?
Faithfulness: Did the model stick to the provided context?
- LLM-as-a-judge: "Does the provided context support this statement in the final answer? Yes/No."
Answer Relevance: Did the final answer address the user's query?

RAG is a recommendation system for context.

There are many ways to surface relevant content without embeddings:

Keyword/full-text search
Metadata filters
Hybrid approaches

These methods often beat pure embeddings.

Beware vendor lock-in with proprietary embedding services. Prefer open-source models when they are good enough.

Today, vendors earn more from LLMs than from embedding services. Query decomposition and multi-step retrieval can increase per-query costs. New technologies are expensive at first; as models improve and competition grows, prices tend to fall.