← Back

Personal Notes/Guide for AI Engineering

·Bryan Lai

(Synthesized from insights shared by experts at Traceloop, QuotientAI, HoneyHive, Halluminate, and OpenPipe)

I. Introduction: The Imperative of Evaluation

Building applications with Large Language Models (LLMs) introduces a paradigm shift from traditional deterministic software engineering. LLMs are inherently non-deterministic, creative, and sometimes unpredictable ("magical pony sometimes works, sometimes doesn't" - Dhruv Singh). This makes rigorous evaluation not just a best practice, but a fundamental necessity for building reliable, trustworthy, and valuable AI systems, especially complex ones like agents where errors can compound catastrophically.

  • The Core Challenge: Moving beyond "vibe checks" and subjective assessments to systematically measure and improve performance against defined objectives. You cannot improve what you cannot measure.
  • Why It's Critical:
    • Quality & Reliability: Ensure outputs meet user expectations and business requirements. Reduce hallucinations, errors, and inconsistencies.
    • Trust & Safety: Verify outputs are grounded, factual (when needed), unbiased, and safe for users.
    • Performance Optimization: Identify bottlenecks, compare models/prompts, and guide improvements (including fine-tuning).
    • Cost Management: Justify resource allocation and potentially enable moves to smaller, cheaper models via fine-tuning validation.
    • Understanding the System: Gain insights into failure modes, user interactions, and model capabilities/limitations.
  • The Mindset Shift: Evaluation is not just QA or testing; it's an integral part of the AI development lifecycle – R&D, product definition, and continuous improvement. Treat it as a first-class citizen, not an afterthought ("Evals are part of your product" - Freddie Vargas).

II. Foundational Principles for Effective Evaluation

Several core philosophies emerged consistently across the discussions:

  1. Start Simple, Iterate Rigorously: Don't over-engineer your application or your evals initially. Get a basic version working, deploy it (even to a small group), collect real data, and then build out more sophisticated evals based on observed failures and needs. ("Start simple... only if that doesn't work you can make things more complex" - Nir Gazit). Avoid the temptation to build the most complex system upfront.
  2. Eval-Driven Development: Define your performance expectations before or during development, not just after. Your understanding of what "good" looks like shapes the system you build. ("Your eval is what you're building" - Dhruv Singh).
  3. Decompose Complexity: Break down complex tasks and workflows into smaller, manageable components. Evaluate each component individually (unit tests) before evaluating the integrated system (integration tests). This applies to both the application logic (e.g., RAG retrieval vs. generation) and the evaluations themselves. ("Evaluate for one thing at a time" - Wyatt Marshall).
  4. Data is King (Real Data is Gold): Evaluations must be based on data that accurately reflects real-world usage patterns.
    • Log Everything: Capture inputs, outputs, intermediate steps, tool calls, user feedback, timestamps, model versions, etc. Structured logging (e.g., OpenLLMetry) is crucial for analysis. ("Log everything that you can because you don't know what will be useful" - Nir Gazit).
    • Prioritize Production Data: Use real user queries and interactions whenever possible. Synthetic data has limitations and may not capture the true input distribution or user intent. ("Real data... that is the gold truth" - Kyle Corbitt).
    • Golden Datasets: Curate a high-quality, representative set of examples with known good outputs or desired properties. This is essential for consistent benchmarking and regression testing. Start small, but ensure it covers critical paths and known edge cases.
  5. Confidence Over Certainty: In the non-deterministic world of LLMs, aim for high confidence that your system meets requirements within acceptable thresholds, rather than absolute certainty. Evaluation is about building this confidence systematically. (Wyatt Marshall).
  6. Metrics Matter, But Context is Crucial: Choose metrics relevant to your objectives. Understand the limitations of automated metrics (especially proxies) and always correlate them with end-user value and human judgment.

III. The Evaluation Lifecycle: A Practical Workflow

Building a robust evaluation system is an iterative process:

  1. Define Objectives & Expectations:
    • What specific task(s) should the system perform?
    • What defines a successful outcome? (e.g., correct classification, relevant summary, completed task by agent)
    • What defines failure? (e.g., hallucination, incorrect tool use, task abandonment, harmful content)
    • What are the acceptable performance thresholds? (e.g., accuracy > 95%, latency < 500ms, task success rate > 90%)
    • Consider different dimensions: Accuracy, Relevance, Coherence, Faithfulness/Groundedness, Safety, Tone, Conciseness, Completeness, Task Completion, Tool Use Correctness, etc.
  2. Establish a Baseline:
    • Start with the strongest available prompted model (e.g., GPT-4o, Claude 3.5 Sonnet) to understand the task's feasibility and gather initial data. (Kyle Corbitt)
    • This provides a benchmark against which to measure future improvements (e.g., from fine-tuning or better prompting).
  3. Gather & Prepare Data:
    • Implement comprehensive logging from day one.
    • Collect real user interactions if possible. If not, carefully curate or simulate representative data.
    • Create a "Golden Dataset": Manually verify or create high-quality input/output pairs or inputs with desired properties for key scenarios and edge cases. This is your ground truth for regression testing.
  4. Design & Implement Evaluation Techniques (The Toolbox - See Section IV):
    • Choose appropriate methods based on the task and objectives. Start with simpler, deterministic checks.
    • Develop prompts for LLM-as-judge carefully, focusing on single criteria and clear instructions.
    • Set up an evaluation harness or use a platform (like HoneyHive, Traceloop, QuotientAI, Halluminate) to automate running evals.
  5. Run Evaluations Iteratively:
    • Fast Evals (Inner Loop): Run quick checks (e.g., unit tests, cheap LLM judges on small datasets) during development/prompt tuning for rapid feedback. (Kyle Corbitt)
    • Slow Evals (Outer Loop): Run comprehensive evaluations on larger datasets or using more expensive methods (e.g., human review, complex LLM judges) periodically or pre-deployment. Measure against business/product KPIs. (Kyle Corbitt)
  6. Analyze Failures & Identify Patterns:
    • Don't just look at aggregate scores. Dive deep into where and why the system fails.
    • Use observability tools to trace errors back through multi-step processes.
    • Cluster inputs/outputs to find patterns in failures (e.g., specific user intents, topics causing hallucinations). (Freddie Vargas)
    • Look at distributions (e.g., tool usage frequency) for anomalies. (Nir Gazit)
  7. Iterate and Improve:
    • Use failure analysis to guide improvements: refine prompts, improve RAG retrieval, add/fix tools, fine-tune models, update training/eval data.
    • Update your evaluation suite itself based on new failure modes discovered.
  8. Monitor Continuously in Production:
    • Track key metrics over time to detect performance degradation or data drift. (Kyle Corbitt, Wyatt Marshall)
    • Sample production traffic for ongoing evaluation. Set up alerts for significant drops in performance or changes in distributions.
    • Continuously collect user feedback (explicit thumbs up/down, implicit signals like regeneration, edits).

IV. Evaluation Techniques: The Toolbox

Choose the right tool(s) for the job, often layering multiple techniques:

  1. Deterministic Checks (Code-Based Evals):
    • What: Use traditional code assertions.
    • Examples: Regex matching, keyword presence/absence, output length constraints, JSON schema validation, checking API call success codes.
    • Pros: Fast, cheap, reliable, objective.
    • Cons: Limited to easily quantifiable aspects; brittle for complex/subjective outputs.
    • When: Essential sanity checks, validating structured output, basic guardrails. Start here. (Dhruv Singh)
  2. Reference-Based Evals:
    • What: Compare the generated output against a known "golden" or reference answer.
    • Examples: Exact match (for classification), BLEU/ROUGE (for summarization/translation - use with caution), Semantic Similarity (using embeddings).
    • Pros: Objective if the reference is truly golden.
    • Cons: Requires high-quality reference data (hard to get/maintain), surface-level metrics like BLEU/ROUGE can be misleading.
    • When: Classification, extraction with known answers, regression testing against a golden set.
  3. LLM-as-Judge Evals:
    • What: Use another LLM (the "judge") to evaluate the output based on given criteria.
    • Pros: Can evaluate subjective qualities (tone, coherence, relevance, safety), flexible.
    • Cons: Can be slow/expensive, subject to the judge LLM's own biases and limitations, requires careful setup for reliability. Do not treat as an infallible black box.
    • Best Practices (CRITICAL):
      • Decompose Criteria: Evaluate one specific criterion per judge call (e.g., one call for factual accuracy, another for tone). Avoid vague, multi-faceted prompts like "Is this response good?". (Multiple speakers)
      • Clear Prompting: Provide precise instructions, context, definitions of criteria, and few-shot examples (good/bad).
      • Binary Outputs: Frame criteria as yes/no questions where possible for easier aggregation and clearer results. (Eugene Yan via speakers)
      • Focus on Explanations: Prompt the judge LLM to explain its reasoning (Chain-of-Thought). Align this reasoning with human rationale, don't just trust the score/label. This is key to tuning the evaluator. (Dhruv Singh's crucial insight)
      • Mitigate Bias: Be aware of self-preference bias (LLMs favoring their own output style) and positional bias. Randomize order when comparing outputs.
      • Use Multiple Judges (Jury): Have several different judge models evaluate and aggregate results (e.g., majority vote) to improve robustness and reduce single-model bias. (Wyatt Marshall)
      • Fine-tune Judges: For specific, high-volume evaluation tasks, consider fine-tuning a dedicated judge model on human-aligned evaluation data. (Wyatt Marshall)
      • Reflection: Ask the judge LLM to review or critique its own evaluation ("Are you sure?") – surprisingly effective small boost. (Wyatt Marshall)
    • When: Evaluating subjective qualities, complex outputs where deterministic checks fail, large-scale evaluation where human review is impractical.
  4. Human Evaluation:
    • What: Humans review and score outputs based on guidelines.
    • Pros: Gold standard for subjective quality, nuance, and alignment with user needs. Essential for creating golden datasets and validating automated evals.
    • Cons: Slow, expensive, potentially inconsistent (requires clear guidelines and annotator training), doesn't scale easily.
    • Best Practices:
      • Use domain experts for complex tasks.
      • Focus human effort on reviewing/editing model outputs rather than generating from scratch. (Kyle Corbitt)
      • Use comparative judgments (A vs. B) rather than absolute scores where possible (more reliable).
      • Gather feedback directly from end-users within the application (thumbs up/down, feedback forms, implicit signals).
    • When: Establishing ground truth, evaluating highly subjective or safety-critical aspects, validating LLM-as-judge systems, handling complex edge cases.

V. Evaluating Agentic Systems: The Next Frontier

Agents introduce significant complexity due to their multi-step, multi-turn nature, tool use, and potential for long-running interactions.

  • Key Challenges: Compounding errors, evaluating intermediate steps, assessing tool use, long-term goal alignment, getting stuck in loops.
  • Evaluation Strategies:
    1. Step-wise Evaluation: Apply the techniques above (deterministic, LLM-judge, human) to each step of the agent's process (e.g., evaluating the chosen tool, the inputs to the tool, the processing of the tool's output). Treat these as "unit tests." (Nir Gazit, Freddie Vargas)
    2. Trajectory Evaluation: Evaluate the entire sequence of actions (the trajectory) taken by the agent, not just the final outcome. Did it take an efficient path? Did it get stuck? Did it achieve sub-goals correctly? (Dhruv Singh)
    3. Tool Use Evaluation: Specifically check:
      • Was the correct tool selected for the current sub-task/intent? (Requires intent detection or ground truth).
      • Were the parameters/inputs provided to the tool correct and well-formed?
      • Was the output of the tool correctly interpreted and used in the next step? (Freddie Vargas)
    4. Task Completion Evaluation: Did the agent successfully achieve the overall goal given the initial instructions? This is the ultimate "end-to-end test."
    5. Simulation: Test agents in controlled, simulated environments that mimic production conditions. Run many simulations to understand behavior across different scenarios and measure success rates. Analogy: Self-driving cars in simulators. (Dhruv Singh)
    6. Supervisor Agents (Eval Agents): Build a separate agent whose job is to monitor and evaluate the primary agent's trajectory in real-time or post-hoc.
      • It reviews the primary agent's steps, reasoning, and tool calls against predefined criteria or learned patterns.
      • Crucially, align the supervisor agent's reasoning with human expert judgment. (Dhruv Singh, Wyatt Marshall)
      • Can potentially provide real-time correction or guidance (Test-Time Search / Verifier idea). (Dhruv Singh)
      • Can escalate to humans when uncertain or encountering critical failures (Human-in-the-Loop supervision). (Wyatt Marshall)
    7. Loop Detection & Analysis: Monitor how often agents get stuck repeating steps. Analyze the inputs/states that cause loops to identify areas needing improvement (e.g., better tools, different prompts). (Nir Gazit)

VI. Evaluating RAG Systems

Retrieval-Augmented Generation (RAG) is a common multi-step pattern requiring specific evaluation focus:

  • Decomposition is Key: Evaluate the Retrieval step and the Generation step separately before evaluating end-to-end. (Wyatt Marshall)
  • Retrieval Evaluation:
    • Metrics: Hit Rate (Is relevant context retrieved?), Mean Reciprocal Rank (MRR), Precision@K, Recall@K.
    • Method: Use a golden dataset of queries with known relevant document chunks. Check if the retriever finds these chunks within the top K results.
  • Generation Evaluation (Conditioned on Retrieved Context):
    • Metrics: Faithfulness / Groundedness (Does the answer accurately reflect the provided context? Is it free of hallucination based on the context?), Relevance (Does the answer address the query?).
    • Method: Use LLM-as-judge (prompted to check support only within the given context snippets) or human evaluation.
  • End-to-End Evaluation: Measure overall answer quality, relevance, and faithfulness, understanding that failures can originate in either retrieval or generation.

VII. Tooling and Infrastructure

  • Observability & Logging: Essential for capturing the data needed for evaluation. Use structured logging platforms that understand LLM workflows (e.g., Traceloop/OpenLLMetry, HoneyHive, LangSmith, etc.).
  • Evaluation Platforms: Tools designed to manage datasets, run evaluations (code-based, LLM-based, human), track results, compare experiments, and visualize outcomes (e.g., HoneyHive, Traceloop, QuotientAI, Halluminate, LangSmith, OpenPipe for fine-tuning evals).
  • Visualization: Tools to visualize agent trajectories, distributions, and failure patterns can be invaluable for debugging. (Nir Gazit mentioned Traceloop's capability).

VIII. Organizational Considerations

  • Dedicated Effort: Treat evaluation seriously. Consider dedicating engineer time or even forming a small team focused on evaluation infrastructure and quality, especially for complex or critical applications. (Dhruv Singh, Freddie Vargas)
  • Balanced Approach: Don't fall into "eval paralysis." Use time-boxed efforts (like sprints) to build out eval suites iteratively. Focus on the 80/20 – cover the most critical paths and failure modes first. (Freddie Vargas)
  • Cross-Functional Collaboration: Evaluation requires input from product managers (defining goals), engineers (implementing evals), and potentially domain experts/users (providing ground truth).

IX. Common Pitfalls and Misconceptions

  • Starting Too Late: Trying to bolt on evaluations after a complex system is already built is much harder.
  • Evaluating on Poor/Unrepresentative Data: Your evals are only as good as the data they run on.
  • Over-Reliance on Single Metrics/Methods: Use a diverse set of evaluations; understand the limitations of each metric (e.g., BLEU/ROUGE).
  • Treating LLM-as-Judge as Ground Truth: They are helpful tools but biased and imperfect. Always validate against human judgment, especially initially. Focus on aligning the reasoning.
  • Not Decomposing Complexity: Trying to evaluate too many things at once leads to unreliable results.
  • Ignoring Production Monitoring: Models drift, data drifts. Continuous evaluation is necessary.
  • Thinking Fine-tuning Solves Everything: Fine-tuning helps, but it needs good data and evaluation to be effective. It won't magically fix foundational issues if the base model fundamentally can't perform the task or the data lacks signal. (Kyle Corbitt)
  • Skipping the Baseline: Not knowing how a simple prompted approach performs makes it hard to measure the value of more complex solutions like fine-tuning or agents.

X. Conclusion: Building Trust Through Rigor

Evaluating LLM applications and agents is a challenging but essential engineering discipline. It requires a shift in mindset, a commitment to data quality, a structured approach, and continuous iteration. By embracing these principles and techniques, teams can move beyond brittle prototypes and build robust, reliable, and truly valuable AI systems that users can trust. The journey involves starting simple, decomposing complexity, measuring what matters, learning from failures, and relentlessly aligning system behavior with desired outcomes through the powerful lens of evaluation.