← Back

RAG as Recommendation Systems

·Bryan Lai

Simple Vector Search

  1. Retrieve: Search knowledge base for relevant text to query
  2. Augment: Combine query + relevant text
  3. Generate: Generate answer based on query + relevant text

Simple Vector Search Is Not Enough

  1. Hybrid Search (Keyword + Vector)
    • Embeddings are good at finding similar things but struggle with specific keywords.
    • Solution: Do both! Postgres has full-text search (BM25) and vector extension.
      • Keyword search top n → Vector search top m → Combine results and rank with RRF (Reciprocal Rank Fusion: rank docs based on position in each result list).

Reranking

The above methods are best guesses; the top 20 candidates might include only three relevant ones. Sending all 20 to the LLM adds noise and cost.
Use a second cross-encoder to rerank candidates: compare the query with each candidate and output relevance scores.

Types of reranking:

  1. Relevance based: Initial retrieval is sememantically similar, but not necessarily contextually relevant.
    • Use cross encoder, take (query, document) pair from initial top n. Cross encoder output a single value. Sort by score.
  2. Diversity based: New problem, top 3 most relevant docs might be the exact smae thing, wasting context and models attention.
    • Selects subset of documents highly relevant and dissimilar from each other, commonly using MMR(Maximal Marginal Relevance).

Improving Chunks

Chunks may lack context (e.g., “50% improvement overall” — overall of what?). Two approaches:

  1. Parent Document Retrieval (Preferred)

    • Embed small chunks for precise search; when a chunk is retrieved, return its parent chunk.
    • Example: Embed sentence-level chunks, retrieve a sentence chunk, then return the entire section.
  2. Contextual RAG / Prepositional Indexing

    • Enrich chunks with summaries before embedding.
    • Example: Ask the LLM to summarize the document, then embed (chunk + summary) to add context.

Improving the User Query

Users often ask vague or multi-part questions, and a single vector search can fail.

  1. Multi Query: Have the LLM generate multiple versions of the query and run searches for each.
  2. Step-back Prompting: Ask the LLM to generate a more general question from the user query.
    • Example:
      • User: “Who’s the second person to walk on the moon?”
      • Step-back: “What were the Apollo moon landing missions and their crews?”
      • Retrieve with both queries to get general and specific chunks.
  3. HyDE (Hypothetical Document Embeddings):
    • LLM generates a fake answer to the user question, embed this fake answer, and find semantically similar documents.
    • The fake answer is more document-like than the query.
  4. Decomposition (Multi-hop):
    • Break complex questions into sub-queries:
      • “Compare feature A with product B price.”
      • Sub-queries: “Feature A” and “Product B price.”
      • Run retrieval for each sub-query, then synthesize the comparison.

Evaluating RAG Pipelines

Build a gold-standard answers dataset. Evaluate using:

  1. Precision: Percentage of retrieved chunks that are relevant.
    • (LLM-as-a-judge: “Is this chunk relevant to the user’s query? Yes/No.”)
  2. Recall: Did the retrieval fetch all necessary chunks?
  3. Faithfulness/Groundedness: Did the LLM stick to provided context or hallucinate?
    • (LLM-as-a-judge: “Does the provided context support this statement in the final answer? Yes/No.”)
  4. Answer Relevance: Did the final answer address the user’s query?

It can be helpful to think of RAGs as recommendation systems.

There are many ways to surface relevant content without embeddings:

  • Keyword/full-text search
  • Metadata filters
  • Hybrid approaches

These methods often deliver extremely accurate and specific retrievals.

Beware vendor lock-in with proprietary embedding services. Whenever possible, favor open-source models (e.g., BGE over OpenAI). Cloud-based embedding providers are still emerging—Fireworks AI is one option, but their models may lag behind the latest open-source offerings.

Today, vendors earn more from LLMs than from embedding services. Query decomposition and multi-step retrieval can increase per-query costs. New technologies are often expensive at first; however, as models improve and competition grows, prices *tend** to fall.