RAG as Recommendation Systems
RAG as Recommendation Systems
RAG works better when you think like a search engineer.
Simple Vector Search
- Retrieve: Search the knowledge base for text relevant to the query.
- Augment: Combine query + relevant text.
- Generate: Generate the answer from query + relevant text.
Simple Vector Search Is Not Enough
Embeddings find similar meaning.
They can miss exact names, IDs, error codes, and product terms.
Use hybrid search:
- Run keyword search.
- Run vector search.
- Merge the results.
- Rank them with RRF, which rewards documents that rank well in both lists.
Reranking
The first retrieval pass is a guess.
The top 20 candidates might include only three useful docs. Sending all 20 to the model adds noise and cost.
Use a second cross-encoder to rerank candidates: compare the query with each candidate and output a relevance score.
Types Of Reranking
- Relevance-based: Score each
(query, document)pair, then sort by score. - Diversity-based: Avoid sending five versions of the same doc. Use MMR when the top results repeat the same point.
Improving Chunks
Chunks often lose meaning.
"50% improvement overall" means nothing unless you know "overall" of what.
-
Parent Document Retrieval (Preferred)
- Embed small chunks for precise search.
- When one chunk matches, return the larger section around it.
-
Contextual RAG / Prepositional Indexing
- Add a short document summary to each chunk before embedding.
- The chunk now carries more context into search.
Improving The User Query
Users ask vague questions.
They also ask three questions in one sentence.
One vector search can fail.
- Multi Query: Have the LLM generate multiple versions of the query and run searches for each.
- Step-back Prompting: Ask the LLM to generate a more general question from the user query.
- Example:
- User: "Who's the second person to walk on the moon?"
- Step-back: "What were the Apollo moon landing missions and their crews?"
- Retrieve with both queries to get general and specific chunks.
- Example:
- HyDE (Hypothetical Document Embeddings):
- Generate a fake answer.
- Embed the fake answer.
- Search for real documents that look like it.
- Decomposition (Multi-hop):
- Break complex questions into sub-queries:
- "Compare feature A with product B price."
- Sub-queries: "Feature A" and "Product B price."
- Run retrieval for each sub-query, then synthesize the comparison.
- Break complex questions into sub-queries:
Evaluating RAG Pipelines
Build a set of questions with known good answers. Then measure:
- Precision: Percentage of retrieved chunks that are relevant.
- LLM-as-a-judge: "Is this chunk relevant to the user's query? Yes/No."
- Recall: Did retrieval fetch all necessary chunks?
- Faithfulness: Did the model stick to the provided context?
- LLM-as-a-judge: "Does the provided context support this statement in the final answer? Yes/No."
- Answer Relevance: Did the final answer address the user's query?
RAG is a recommendation system for context.
There are many ways to surface relevant content without embeddings:
- Keyword/full-text search
- Metadata filters
- Hybrid approaches
These methods often beat pure embeddings.
Beware vendor lock-in with proprietary embedding services. Prefer open-source models when they are good enough.
Today, vendors earn more from LLMs than from embedding services. Query decomposition and multi-step retrieval can increase per-query costs. New technologies are expensive at first; as models improve and competition grows, prices tend to fall.