RAG as Recommendation Systems
Simple Vector Search
- Retrieve: Search knowledge base for relevant text to query
- Augment: Combine query + relevant text
- Generate: Generate answer based on query + relevant text
Simple Vector Search Is Not Enough
- Hybrid Search (Keyword + Vector)
- Embeddings are good at finding similar things but struggle with specific keywords.
- Solution: Do both! Postgres has full-text search (BM25) and vector extension.
- Keyword search top n → Vector search top m → Combine results and rank with RRF (Reciprocal Rank Fusion: rank docs based on position in each result list).
Reranking
The above methods are best guesses; the top 20 candidates might include only three relevant ones. Sending all 20 to the LLM adds noise and cost.
Use a second cross-encoder to rerank candidates: compare the query with each candidate and output relevance scores.
Types of reranking:
- Relevance based: Initial retrieval is sememantically similar, but not necessarily contextually relevant.
- Use cross encoder, take (query, document) pair from initial top n. Cross encoder output a single value. Sort by score.
- Diversity based: New problem, top 3 most relevant docs might be the exact smae thing, wasting context and models attention.
- Selects subset of documents highly relevant and dissimilar from each other, commonly using MMR(Maximal Marginal Relevance).
Improving Chunks
Chunks may lack context (e.g., “50% improvement overall” — overall of what?). Two approaches:
-
Parent Document Retrieval (Preferred)
- Embed small chunks for precise search; when a chunk is retrieved, return its parent chunk.
- Example: Embed sentence-level chunks, retrieve a sentence chunk, then return the entire section.
-
Contextual RAG / Prepositional Indexing
- Enrich chunks with summaries before embedding.
- Example: Ask the LLM to summarize the document, then embed
(chunk + summary)to add context.
Improving the User Query
Users often ask vague or multi-part questions, and a single vector search can fail.
- Multi Query: Have the LLM generate multiple versions of the query and run searches for each.
- Step-back Prompting: Ask the LLM to generate a more general question from the user query.
- Example:
- User: “Who’s the second person to walk on the moon?”
- Step-back: “What were the Apollo moon landing missions and their crews?”
- Retrieve with both queries to get general and specific chunks.
- Example:
- HyDE (Hypothetical Document Embeddings):
- LLM generates a fake answer to the user question, embed this fake answer, and find semantically similar documents.
- The fake answer is more document-like than the query.
- Decomposition (Multi-hop):
- Break complex questions into sub-queries:
- “Compare feature A with product B price.”
- Sub-queries: “Feature A” and “Product B price.”
- Run retrieval for each sub-query, then synthesize the comparison.
- Break complex questions into sub-queries:
Evaluating RAG Pipelines
Build a gold-standard answers dataset. Evaluate using:
- Precision: Percentage of retrieved chunks that are relevant.
- (LLM-as-a-judge: “Is this chunk relevant to the user’s query? Yes/No.”)
- Recall: Did the retrieval fetch all necessary chunks?
- Faithfulness/Groundedness: Did the LLM stick to provided context or hallucinate?
- (LLM-as-a-judge: “Does the provided context support this statement in the final answer? Yes/No.”)
- Answer Relevance: Did the final answer address the user’s query?
It can be helpful to think of RAGs as recommendation systems.
There are many ways to surface relevant content without embeddings:
- Keyword/full-text search
- Metadata filters
- Hybrid approaches
These methods often deliver extremely accurate and specific retrievals.
Beware vendor lock-in with proprietary embedding services. Whenever possible, favor open-source models (e.g., BGE over OpenAI). Cloud-based embedding providers are still emerging—Fireworks AI is one option, but their models may lag behind the latest open-source offerings.
Today, vendors earn more from LLMs than from embedding services. Query decomposition and multi-step retrieval can increase per-query costs. New technologies are often expensive at first; however, as models improve and competition grows, prices *tend** to fall.