What is the difference between offline and online evaluation?

Offline evaluation runs the bot against a fixed golden dataset on every change, ideally in your CI pipeline, to catch regressions before release. Online evaluation happens on real users through A/B tests, feedback buttons, and guardrail metrics. You need both: offline protects you before shipping, while online reveals what only real usage exposes.

How many examples do I need in my test set?

Start with 50 to 100 well-chosen examples that cover your most important and riskiest scenarios, not thousands of random ones. A small, representative, well-labeled set beats a large noisy one, and you grow it over time by adding real failures from production.

Can I rely on an AI model to evaluate itself?

LLM-as-a-judge is fast, scalable, and widely used, but it carries biases (such as favoring longer answers) and must be calibrated against human judgment on a sample. Use it for qualitative dimensions, keep a periodic human review, and never treat its score as ground truth without checking.

Which metric should I start with?

Task success — whether the user actually achieved their goal — because it is closest to business value. For systems grounded on your documents (RAG), add faithfulness to prevent hallucination, then layer in safety, cost, and latency.

How often should I re-evaluate?

On every meaningful change — a prompt edit, a model upgrade, or a knowledge-base update — and continuously in production. Evaluation is not a one-time launch gate; it is an ongoing process that tells you whether each change helped or hurt.

How to Evaluate Your AI Chatbot: A Practical Guide to LLM Evaluation

Why Evaluating Your Chatbot Is Not Optional

Launching an AI chatbot is easy today; knowing whether it is actually good is the hard part. Language models are probabilistic by nature: the same question can yield two different answers, and "I tried it and it looked fine" on a handful of prompts does not mean it is ready for thousands of customers. Evaluation is the systematic way to answer one question: does the bot do its job accurately, safely, and consistently — before you ship it, and after every change?

The rule is simple: you cannot improve what you do not measure. Without an evaluation system, every prompt tweak or model upgrade becomes a gamble where you do not know whether you improved something or quietly broke something else.

Step Zero: Define What "Success" Means for You

Before any metric, answer this: what task is the bot supposed to accomplish? A support bot succeeds when it resolves the issue or routes it correctly. A sales bot succeeds when it qualifies the lead and collects the right data. An internal employee bot succeeds when it answers accurately and grounded in your policies. Tie evaluation to a business goal, not to a vague sense that the answer reads nicely.

Build a "Golden Dataset"

The foundation of any serious evaluation is a fixed set of examples you test against every time. Collect it from three sources:

Real questions from actual chat logs (after redacting personal data).
Edge cases: vague, incomplete, or out-of-scope questions, or different dialects.
Adversarial cases: attempts to push the bot past its limits or leak information.

Start with 50–100 examples that cover your most important scenarios rather than thousands of random ones. For each, define — as much as possible — the reference answer or the criterion that makes a response correct.

An analytics dashboard with charts and performance metrics, symbolizing measuring AI chatbot performance — Evaluation turns bot quality from a gut feeling into numbers you can track and improve.

What Exactly Do You Measure? The Core Dimensions

"Quality" is a fuzzy word, so break it into measurable dimensions:

Task success: did the user actually achieve their goal? The most important metric and the closest to business value.
Faithfulness (no hallucination): in systems grounded on your sources (RAG), is the answer built only on the retrieved documents, or did the bot invent something?
Relevancy: does the answer actually address the user's question, or just talk around the topic?
Retrieval quality: in RAG, did the system pull the right passages from your knowledge base? (context precision and recall). Most RAG failures come from poor retrieval, not the model itself.
Correctness: does the answer match a known reference?
Safety and compliance: does it refuse dangerous requests? Does it protect personal data? Does it hold the brand's tone and policies? Does it resist jailbreak attempts?
Cost and latency: a great answer that takes 30 seconds or costs too much per conversation can be a commercial failure. Always measure tokens and time.

The Three Evaluation Methods

Human evaluation: the gold standard for quality, but slow, expensive, and hard to scale. Use it to build your reference and to calibrate the automated methods.
LLM-as-a-judge: you use a strong model to score the bot's answers against a clear rubric, or by comparing two answers. Fast and scalable, and the most common approach today — provided you calibrate it against human judgment and watch for its biases (such as favoring the longer answer).
Reference-based automated metrics: such as exact match for closed-answer questions, or semantic similarity via embeddings. Cheap and instant, but limited for open-ended answers.

In practice: mix all three. Fast automated metrics as a first guard, an LLM judge for the qualitative dimensions, and a periodic human review of a sample.

Offline vs. Online Evaluation

There are two complementary types, and you need both:

Offline: you run the bot against your golden dataset on every change, ideally inside a CI/CD pipeline so no update reaches production if the scores drop. This is the regression test that stops you from breaking what already worked.
Online: on real users — A/B testing between two versions, a feedback button (👍/👎), and guardrail metrics (handoff-to-human rate, repeated questions, conversation abandonment). Monitor production continuously, because real user behavior reveals what no test set can.

A Practical Workflow to Launch Evaluation

Define the 3–5 dimensions that truly matter to you (e.g., task success, faithfulness, safety).
Build a small but representative golden set (50–100 cases).
Write a clear scoring rubric for each dimension: what earns a 5 and what earns a 1.
Automate the run: a script that runs the bot over the set and computes the scores (human + LLM judge).
Put it in the deployment pipeline: any score below the threshold blocks the release.
Add a production loop: collect user feedback and failed cases, and keep adding them to your golden set.

Common Mistakes That Make Evaluation Worthless

Testing only the happy path and ignoring edge and adversarial cases.
A test set that is too small or not representative of your real users.
Relying on an LLM judge without calibrating it against human judgment.
Ignoring cost and latency and focusing on quality alone.
Evaluating once at launch instead of making it a continuous process with every update.

The value is not in shipping a bot that talks — it is in owning a system that tells you, in numbers, whether it got better or worse after every change.

Tools That Help

There are open-source and commercial tools that shortcut building an evaluation pipeline, including: Ragas and DeepEval for RAG systems; LangSmith, Braintrust, and Arize Phoenix for monitoring and evaluation; and OpenAI Evals and promptfoo for writing tests. Pick the tool that fits your stack, but remember that the methodology matters more than the tool.

Origami's Role

At Origami we are a technology company that builds AI systems and measures them before handing them over: we design a golden dataset tailored to your domain, build an automated evaluation pipeline wired into deployment, and add production monitoring that catches regressions early and keeps cost under control. The goal is a bot whose numbers you can trust, not just a demo.

Sources

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023): arxiv.org/abs/2306.05685
Ragas documentation for RAG evaluation metrics (faithfulness, relevancy, context quality): docs.ragas.io
LangSmith evaluation documentation: docs.smith.langchain.com
OpenAI Evals: github.com/openai/evals
Anthropic guidance on building evaluations: docs.anthropic.com

How to Evaluate Your AI Chatbot: A Practical Guide to LLM Evaluation

Why Evaluating Your Chatbot Is Not Optional

Step Zero: Define What "Success" Means for You

Build a "Golden Dataset"

What Exactly Do You Measure? The Core Dimensions

The Three Evaluation Methods

Offline vs. Online Evaluation

A Practical Workflow to Launch Evaluation

Common Mistakes That Make Evaluation Worthless

Tools That Help

Origami's Role

Sources

Frequently Asked Questions

Rate this article

Related Articles

Weekly newsletter

Looking for a software solution for your business?

One session. Twenty minutes. No commitments.