Personal Notes/Guide for AI Engineering

April 13, 2025·Bryan Lai

Personal Notes for AI Engineering

If you are building LLM products without evals, you are shipping vibes.

Synthesized from insights shared by experts at Traceloop, QuotientAI, HoneyHive, Halluminate, and OpenPipe.

Evals Are The Product

LLM apps are not normal software.

The same prompt can pass nine times and fail the tenth. Agents make this worse because one bad step can poison the next five.

So evaluation is not polish.

It is the product boundary.

An eval tells you what "good" means. It tells you when a prompt change helped, when a model swap hurt, and when a demo is not ready for users.

Start Small

Do not begin with a giant eval platform.

Start with one real task and one real failure.

Build a basic version.
Put it in front of a small group.
Log everything.
Find the failures.
Turn the failures into tests.

The eval suite should grow from real mistakes, not imagined ones.

Log Everything

You cannot evaluate what you did not save.

Log:

User input.
Model output.
Retrieved context.
Tool calls.
Tool results.
Intermediate steps.
User feedback.
Model version.
Prompt version.
Timestamp.

Real production data beats synthetic data. Synthetic data is useful for bootstrapping, but it rarely captures how users actually ask, misunderstand, rush, typo, or change their minds.

Define The Baseline

Before you fine-tune, build agents, or add routing, run the simplest strong baseline.

Use a strong prompted model. Save the outputs. Score them.

That baseline becomes the thing every later change must beat.

Without a baseline, "better" is a feeling.

Build A Golden Set

A golden set is a small set of inputs with trusted answers or trusted properties.

Start small:

Critical happy paths.
Known failure cases.
Edge cases.
Safety cases.
High-value customer examples.

Do not chase coverage before you have quality.

Ten sharp examples beat 500 sloppy ones.

Evaluate One Thing At A Time

Bad evals ask:

Is this answer good?

Good evals ask:

Did it answer the question?
Did it use the provided context?
Did it avoid unsupported claims?
Did it follow the required format?
Did it call the right tool?
Was the tone acceptable?

One judge. One criterion.

If you evaluate five things at once, you will not know what broke.

Use The Right Eval

Code Checks

Use code when the answer should be exact.

Good for:

JSON schema.
Required fields.
Forbidden words.
Regex checks.
API success codes.
Output length.

Code checks are fast and cheap. They are weak for taste.

Reference Checks

Compare the output to a known answer.

Good for:

Classification.
Extraction.
Regression tests.
Tasks with clear expected answers.

Be careful with BLEU, ROUGE, and similarity scores. They can reward text that looks close while missing the point.

LLM Judges

Use a model to judge another model.

Good for:

Tone.
Relevance.
Safety.
Faithfulness.
Coherence.

But an LLM judge is not truth.

It can be biased. It can prefer its own style. It can be inconsistent. It can over-score confident nonsense.

Make judges useful:

Give one criterion.
Ask for yes/no when possible.
Require a short explanation.
Randomize answer order in comparisons.
Compare against human judgment early.
Use multiple judges for important tasks.

Human Review

Use humans when taste, safety, money, or customers matter.

Humans are slow and expensive. They are also still the best source of judgment.

Use domain experts for domain tasks. Prefer A/B comparisons over vague scores when possible.

Agents Need Different Evals

Agents are harder because the answer is not the whole story.

The path matters.

Check:

Did it pick the right tool?
Did it pass the right parameters?
Did it understand the tool result?
Did it recover from errors?
Did it loop?
Did it finish the task?
Did it stop when it should have asked a human?

For agents, evaluate the path and the final message.

The final answer can look fine while the agent took a dangerous path to get there.

RAG Needs Two Evals

RAG fails in two places:

Retrieval.
Generation.

Evaluate them separately.

For retrieval, measure whether the system fetched the right chunks.

Useful metrics:

Hit rate.
MRR.
Precision@K.
Recall@K.

For generation, measure whether the model used the retrieved context correctly.

Ask:

Is the answer supported by the context?
Did it answer the user?
Did it add unsupported claims?

If the final answer is bad, first find whether search failed or writing failed.

Production Monitoring Is Not Optional

Models drift.

Data drifts.

Users find new ways to ask.

Track:

Quality score.
Task success.
Latency.
Cost.
Refusals.
Regenerations.
Edits.
Thumbs up/down.
Tool failures.

Aggregate scores are not enough. Read the failures. Cluster them. Find the pattern.

Common Mistakes

Adding evals after the system is already complicated.
Using fake data for too long.
Treating clicks as quality.
Treating LLM judges as ground truth.
Evaluating too many things at once.
Ignoring production drift.
Fine-tuning before proving the baseline.
Skipping human review for high-stakes output.

Bottom Line

Trust does not come from a better prompt.

Trust comes from seeing the system fail, naming the failure, writing the eval, and making sure it does not fail that way again.

Start simple.

Log real data.

Build a baseline.

Evaluate one thing at a time.

Read the failures.

That is how LLM demos become products.