Have you ever tried to measure fog with a ruler? Probably not. But evaluating AI chatbots can feel a lot like that. Fog is undefined, slippery, and without clear edges—making the ruler a pretty useless tool. When you think you’ve nailed down the perfect metric or found “the right answer,”—poof—the fog shifts, and you’re back at square one. Still, if you care about building trustworthy AI, learning to navigate this fog is essential.

The perfect evaluation system: Imagine one magic metric that perfectly reflects user satisfaction and business impact. You’d simply tweak a parameter, rerun the evaluation, and if the score goes up, you deploy. That’s the dream—sadly I have learned it’s just that: a dream.

Drawing of a man, he is looking confused into the fog, while holding a ruler. In the fog the text 'ChatBot Performance' is visible. — I sometimes feel I am getting lost in the fog, trying to figure out if our chatbot is getting better or worse. Now I think I have found a way forward, I will share with you.

Evaluating an AI chatbot sounds simple: ask questions, collect reference answers, compare results, and track progress. But in practice, benchmarking conversational AI is much messier. The process is full of shifting expectations, ambiguous targets, and evolving requirements.

After spending a lot of time designing and refining a chatbot evaluation system through several iterations, I’ve come to appreciate both the necessity and the nuance of this work. In this post, I’ll walk you through the real-world challenges, some lessons learned, and practical considerations from my experience.

In the next sections, I’ll start sharing about the challenge of building a solid evaluation dataset, including why “golden standard” answers are often elusive and how to handle questions that shouldn’t be answered. Then we look at the practicalities of running evaluations — including the issues with randomness, managing costs, and selecting tooling. Finally how to turn all those results into insights that actually help you improve your chatbot and communicate clearly with your team and stakeholders. Let’s dive in.

Building and maintaining a reliable evaluation dataset

A solid evaluation process starts with a strong dataset of representative questions and high-quality target answers. But this is much harder than it sounds. We put a lot of effort into curating relevant questions from real users and working with domain experts to define suitable answers.

Creating and maintaining a reliable evaluation dataset is an ongoing challenge. Reference answers often need expert input and regular updates, since user needs and “truth” shift over time. Instead of chasing perfection, focus on keeping your dataset relevant and flexible—collaborate with domain experts, but don’t expect static answers to cover every scenario.

The many faces of “correct” answers

One of the first things I learned is that there’s rarely a single “correct” answer to any question posed to our chatbot. What’s “correct” depends on the user’s expectations, the level of detail they want, and even their location and background.

The elements of an evaluation system and how different user profiles, expectations and non-determinism make it challenging.

In our work with product data chatbots, I’ve seen identical questions require different answers for reasons like:

Some users want concise, to-the-point summaries, while others expect comprehensive, detail-rich explanations.
Even among subject-matter experts, opinions often diverge on what makes the “best” answer, especially for broad or nuanced topics. Does a “correct” answer mean that all the statements are factually accurate, or that it is helpful, complete, or easy to understand?
Both user needs and “truth” evolve: new products are launched, existing ones are retired, and best practices change over time.
The same question can require different answers depending on the user’s location—what’s available or relevant in Denmark may differ from India due to local products, terminology, or business priorities.

Sometimes, the real user need isn’t clearly stated in the question—the implicit context can be just as important as the words used. When gaps in knowledge aren’t specified, the LLM may fill them with guesses that aren’t always correct. If this “multi-intent ambiguity” isn’t considered when designing your evaluation system, you risk punishing improvements simply because they differ from a narrow definition of the “correct” answer.

These variations make it extremely challenging to create an evaluation dataset based mainly around a “ground truth” / “golden standard” answer, that is both comprehensive and fair. Achieving true objectivity is often impossible—so the goal becomes building consensus, capturing the diversity of real-world expectations, and being transparent about ambiguities. High-quality, reviewed reference answers for evaluation take time to produce, and maintaining quality is an ongoing effort. Your evaluation dataset needs continual review and updating to stay relevant. All of this made me realize:

Aha-moment: I do not think it is worth investing significant resources into generating perfect “golden standard” answers. Instead, we should focus on other metrics and reference-free ways of evaluating performance.

It’s certainly helpful to have reference answers for some questions, especially when you can provide a range of rated responses rather than just one. However, it’s important not to make reference answers the central focus of your evaluation strategy. Instead, consider designing evaluation methods that don’t depend heavily on static ground truth answers — more on that later in this post.

When saying nothing is the smartest move

Not every question deserves—or should get—an answer from your chatbot. A sophisticated evaluation system acknowledges that sometimes, the most accurate or responsible action is for the AI to decline to answer.

Personal reflection: In life I have learned that saying “I do not know” or “I have not made up my opinion on that matter” is an important and admirable skill. Too many people have strong opinions on topics they know little about, and in these cases sharing an opinion just creates noise and misdirection. Sometimes the best answer is “I don’t know” — your chatbot should learn that too.

Here are some scenarios where this principle applies for an AI chatbot:

Out-of-scope queries: The chatbot may be designed specifically for product-related questions, and not, say, for weather forecasts, trivia, or unrelated advice.
Abuse attempts or inappropriate input: The model should avoid engaging with offensive, nonsensical, or manipulative prompts.
Incomplete data: Sometimes, the user might ask questions that are in scope but where the knowledge source or action hasn’t been integrated yet. In this case, the chatbot should acknowledge the gap and indicate that adding the data or functionality is planned for future updates.

In our journey, we found paying careful attention to these scenarios is vital for a fair and realistic evaluation:

We deliberately included such “should-not-answer” cases in our evaluation datasets, with the expectation that the chatbot either replies with a suitable refusal or avoids speculation.
Our metrics and dashboards were expanded to acknowledge these cases—not treating them as failures, but as successes when the AI respectfully withholds an answer.
We also differentiated between true “out of scope,” temporarily unanswerable (e.g., data not yet available), and clear cases of abuse or inappropriate requests.

This nuanced approach not only reflects real-world usage but encourages the chatbot to error on the side of accuracy, safety, and compliance rather than guessing wildly or fabricating responses. Practically, it also results in a more honest dialogue when reporting capabilities and limitations to stakeholders—and it reduces the risk of over-promising what the chatbot can actually deliver.

By incorporating and clearly tracking these scenarios, we ensure our evaluation process rewards appropriate restraint as much as it rewards correct answers. Sometimes, the best thing a chatbot can say is: “I can’t help with that”—and that’s not just acceptable, it’s preferable.

Keeping up with change: versioning and traceability

As datasets, product requirements, and evaluation practices evolve, so must the evaluation system itself. We quickly learned the importance of versioning both datasets and code, logging details about each test run, and keeping a record of what was changed and why.

This approach makes it possible to compare apples to apples, even as the system matures or shifts in scope. Maintaining good traceability means you can revisit prior experiments, analyze long-term progress, and understand the real impact of specific changes. Two evaluation runs cannot be compared if the input evaluation data have changed, or the evaluation metrics have changed. So we make sure to log all of this information so we know which runs are comparable.

For organizations where evaluation is a team effort, clear labeling of a test runs (with a description and tags) so everyone knows and can remember what’s being tested when looking in the evaluation report is essential. Additionally, the ability to filter, rename, or delete runs help collaborators avoid confusion, since clutter will inevitably accumulate over time, and this clutter can make the report harder to navigate.

Screenshot of terminal showing the help text for command line interface (CLI) for managing evaluation runs in a chatbot evaluation system. It includes commands for deleting and renaming runs, allowing users to keep the evaluation report organized by removing unnecessary clutter and providing clear descriptions for each run. — CLI we created to make it easy to delete and rename evaluation runs. Deleting runs that are not useful or no longer needed, and having good descriptions is essential for keeping the report organized.

Without these practices, it’s easy to lose track of what was tested, why certain results occurred, or how to interpret changes over time. Good versioning and traceability are the backbone of a reliable evaluation system.

Running the evaluation: practical considerations and pitfalls

Once you have a solid dataset, the next step is to design and implement the actual evaluation process. This involves selecting appropriate metrics, orchestrating model calls, and aggregating results in a meaningful way. Here are some practical considerations based on our experience.

The challenge of randomness in LLM outputs

A key trait of large language models is that their answers aren’t always the same (outputs are non-deterministic). If you ask the same question twice, you might get different responses—even if all the settings are identical.

Why does this matter for evaluation? If small changes in model output can affect measured performance, then repeatability and reliable benchmarking become more difficult. Some of this randomness is essential for creativity or fluency, and even attempts to eliminate it may have undesirable effects on the utility of the chatbot.

This introduces a layer of uncertainty. For us, it meant that performance results can fluctuate—even when nothing else changes. In practice, this makes it harder to confidently say whether an update has made things better or worse, especially for incremental improvements. Experiments with running the same input dataset with the same code 10 times showed an range of accuracy of 8 percentage points!

Tough realization: I have spent much time tweaking parameters, changing the prompts or trying out advanced RAG methods, but when looking at the evaluation results, I sometimes saw small improvements and then we deployed the change. Often I saw no improvements or even accuracy regressions for changes I had high expectations for. Now I have realized that due to the accumulated randomness in the system, the signal got lost in the noise, and I was actually just navigating randomly in the fog.

So, evaluation strategies must account for variance: averaging across runs, being cautious when comparing small differences, and focusing on significant trends rather than single numbers.

Why evaluation can get expensive fast

When you’re excited about improving your chatbot, it’s tempting to throw every possible metric and a huge set of evaluation questions at it. But here’s a friendly warning from experience: the cost of running these evaluations can spiral out of control—especially if you’re using large language models that charge per token.

Each question you ask, every answer generated, and all the extra context you include in prompts add up to more tokens. Multiply that by dozens of metrics and hundreds (or thousands) of test cases, and suddenly your evaluation budget is looking less like a rounding error and more like a serious line item.

Here’s what we learned about keeping costs under control:

Be selective with metrics: Focus on the ones that truly matter. More metrics don’t always mean better insights, and each extra calculation can mean more model calls. We made it configurable to easily toggle metrics on and off, depending on needs.
Limit the size of your evaluation set: Start with a representative sample, not every possible question. You can always expand later if needed.
Batch and reuse where possible: If you can, reuse model outputs across multiple metrics. Avoid redundant calls.
Track and forecast costs: Keep an eye on how many tokens you’re burning through and estimate costs before running large-scale evaluations.

We actually track cost in the run overview in our evaluation report, so we can see how much each run costs and make informed decisions:

Screenshot of evaluation dashboard displaying the cost of each chatbot evaluation run. This helps teams monitor and manage evaluation expenses effectively. — A small section of our evaluation report showing costs for each evaluation run, making it easy to track and compare expenses over time.

By being intentional about what you measure and how often you run evaluations, you’ll save money—and keep your focus on what really drives improvement. After all, the goal is smarter chatbots, not runaway bills!

Why relying too much on commercial orchestration platforms can slow you down

As we scaled our evaluation efforts, we found ourselves increasingly dependent on a commercial orchestration platform (we used promptflow in Azure ML) to manage the runtime and the evaluation workflows. My experience is that, while these platforms offer powerful tools for automating evaluations and aggregating results, they also come with their own set of challenges.

One major pain point is the lack of flexibility. Commercial and open source platforms often impose rigid structures and workflows that can stifle experimentation and innovation. We found ourselves constrained by the platform’s limitations, which made it difficult to adapt our evaluation processes as our needs evolved. I find that too much unnecessary abstraction makes it hard to understand what is really going on, to debug issues, and to customize behavior for our specific use cases. For instance, we wanted to slightly modify how telemetry was collected, but this ended up creating a lot of unnecessary extra work and complexity, since the logging mechanism is hidden behind abstraction.

LLM orchestration platforms with built-in evaluation are only useful for getting to a PoC fast. However, for a mature application team with a system in production, LLM Orchestration platforms just create unnecessary abstractions, making changes and fixes difficult.

It’s fine to start with an LLM Evaluation Framework or an LLM Orchestrator to get up and running quickly, but as your evaluation needs grow and become more complex, you may find that building custom solutions in-house provides greater control and adaptability.

I tried to defined the difference between an LLM Evaluation Framework and an LLM Orchestrator in the card below, however it is a fuzzy distinction and there is a lot of overlap between the two:

LLM Evaluation Framework: A framework is a software library that provides tools and structure for building and evaluating AI models. Frameworks offer metrics (e.g., BLEU, ROUGE, perplexity), dataset-based evaluation (TruthfulQA, MMLU), prompt testing, and integration with eval libraries. Examples: Hugging Face evaluate, Geval, TruLens, PromptLayer, Ragas.
LLM Platform / Orchestrator: An LLM Platform / Orchestrator is a system that manages and automates workflows, coordinating agents, tools, and evaluation processes across complex AI tasks. Orchestrators support multi-agent evaluation, tool usage tracking, workflow observability, and feedback loops. Examples: AutoGen, CrewAI, LangSmith, Azure Foundry, PromptFlow, n8n

Ultimately, we learned the importance of maintaining a balance between leveraging commercial tools and building in-house capabilities. We have shifted toward building our own capabilities. By developing our own evaluation frameworks and processes instead of relying on opensource or commercial solutions, we were able to retain flexibility and control over our evaluation efforts.

Transforming raw evaluation data into actionable insights: why numbers alone aren’t enough

Finally, let’s talk about what really matters: turning raw evaluation results into insights that actually help you improve your LLM RAG or Agentic system. It’s easy to get lost in metrics and charts, but the real value comes when you connect the dots between what the numbers say and what your users actually experience. In this section, I’ll share how we moved beyond just collecting data—and started using it to guide smarter decisions, spot hidden issues, and celebrate real wins.

Navigating a maze of evaluation metrics

As mentioned earlier, when we discussed managing costs, it’s important to be selective about which metrics to track, but this is easier said than done. There are countless ways to measure chatbot performance, and not all metrics are created equal.

You can read about some of our evaluation metrics my old blog post on evaluation metrics for RAG systems.

Many of the metrics is calculated is using the llm-as-a-judge approach, where we prompt an LLM to evaluate the quality of another LLM’s output based on certain criteria. It is important to use a totally different model for evaluating than you use for generating the answers, to avoid bias. The approach only gets you so far, thats why I recommend having inner loops where we evaluate certain components of the system in isolation, which allows for more tailored metrics that are more robust and less noisy, read more in my blog post on inner and outer development loops for AI projects.

In our pipeline, we experimented with almost twenty different metrics—some computed automatically. We had humans evaluate the answers, rating them between 1 and 5, and then we analyzed which of the automated metrics correlated best with human judgment. We also categorized metrics into different groups, such as “Retrieval”, “Generation”, and “Overall” to help make sense of the data.

Screenshot of annotation app showing a user interface for rating chatbot answers from 1 to 5. The app helps collect human ratings to validate and calibrate automated evaluation metrics. — A screenshot of our simple annotation app that let us quickly rate the answers from the chatbot with one to five stars. This made it easy to collect human ratings of the answers, which we then used to validate and calibrate our automated metrics. The data shown is mocked demo content - not real data.

We found, however, that more metrics didn’t always translate to deeper understanding. The most valuable indicators were those that correlated well with expert human judgment or provided actionable insights about why certain answers were failing, so we know where to put our effort.

It’s tempting to track every possible metric, but this often leads to analysis paralysis or misleading signals. Instead, invest time in identifying which metrics are robust, easily interpretable, and truly reflect user value. Good metrics should have a strong correlation with business value.

Not only seeing what, but also why

Understanding the “why” behind the numbers is crucial for driving meaningful improvements. This means digging deeper into the data, conducting root cause analyses, and engaging with users to gather qualitative feedback.

It’s important to be able to drill down and explore the context by examining individual metrics alongside dimensions such as user intent, conversation context, selected sources, and message history. By tagging each entry in our evaluation dataset with details like question type, expected knowledge source, user profile, and other relevant metadata, we can uncover patterns and correlations that might otherwise go unnoticed. This richer context makes it much easier to understand why certain answers succeed or fail, and helps guide targeted improvements to the chatbot.

In my mind, there is a separate report dashboard for the developers and another dashboard for the product managers and business stakeholders. Below you see a screenshot of how our developer view looks currently. It’s worth noting that it can be difficult to understand if the numbers are good or not without comparing them to previous benchmarks.

Screenshot of evaluation dashboard displaying detailed metrics and breakdowns for data scientists, helping them analyze chatbot performance at a granular level. — A section of our evaluation report showing detailed metrics for use of the data scientists in the project.

Interesting, one of the most useful tools was actually not the complicated evaluation methods described above. Instead, it was a simple “vibe-coded” comparison app, that let us quickly compare the answers from two different evaluation runs side-by-side from the evaluation set. It also shows the expected answer in a separate column and links to the full prompt. This made it easy to spot differences in behaviour, identify regressions, and validate improvements. It also helped us to get a real intuition about the system instead of just looking at numbers, and understand more clearly what is changing from one version to the next. We could save the html file and share it with in an email with the business stakeholders and get their feedback or approval on the changed behaviour.

Screenshot of comparison app displaying side-by-side answers from two chatbot evaluation runs, including expected answers and prompt links, to help spot differences and validate improvements. It shows the differences between the two answers with color coding for easy comparison. — A screenshot of our simple comparison app that let us quickly compare the answers from two different evaluation runs side-by-side from the evaluation set. This example is showing mocked demo content - not real data.

Presenting results stakeholders can understand

A good evaluation system doesn’t just generate numbers—it delivers insights that different audiences can understand and trust. Overly complex dashboards or walls of percentage points tend to alienate business stakeholders and risk creating more confusion than clarity.

Our solution was to focus on intuitive, aggregated measures (such as overall “accuracy”, “response rate”, and “faithfulness”), and to present information in clear groupings, each with concise descriptions of what it means and how it is calculated. We needed to apply a binary threshold to an average weighted score between several metrics. It’s much easier to communicate that a certain percentage of answers are “good enough” rather than trying to explain the nuances of each individual metric.

Screenshot of evaluation dashboard for managers, displaying simplified key metrics and aggregated results to help stakeholders quickly understand chatbot performance. — A section of our stakeholder evaluation report showing simplified key metrics for managers. If you read the descriptions in the image, you should be able to understand how the metrics are calculated.

It is important to communicate that accuracy does not tell you much without context about how we evaluate. If only easy questions are added to the evaluation set, accuracy will be high, but it does not mean that the chatbot is good. If only hard questions are added to the evaluation set, accuracy will be low, but it does not mean that the chatbot is bad. Accuracy is only meaningful when you know what kind of questions are in the evaluation set.

Metrics often only make sense when compared to previous benchmarks or baselines. A single number without context is rarely informative.

Ultimately, effective reporting bridges the gap between technical evaluation and business outcomes, allowing teams to iterate with confidence.

Lessons learned

Over time, a few lessons emerged that guide how we approach evaluation today:

🌫️ accept that some uncertainty is healthy: If your evaluation feels a bit foggy, that’s normal. Focus on learning and improving, not chasing unattainable certainty.
🚫 don’t chase perfect answers: Perfect “golden standard” is a myth. Prioritize metrics that do not depend on static ground truth references.
🤝 trust human judgment: Automated scores aren’t everything. Regular manual review of chatbot responses reveals real progress and valuable insights. Create tools to do this efficiently.
🧩 don’t let tools dictate your process: Use frameworks and platforms to accelerate early progress, but don’t be afraid to build custom solutions when you outgrow them.
🧭 embrace ambiguity, don’t fight it: Accept that conversational AI will be inherently non-deterministic and what is correct and what is wrong answers is fuzzy. Build your evaluation systems such that they embrace this uncertainty.

Looking ahead, we’re exploring ways to become less reliant on static “golden standard” answers, and instead have a list of reference answers. Also we need to better account for the inherent randomness in large language models. This requires yet another big restructuring of our evaluation system, but I believe it’s a necessary step toward more robust and meaningful assessments. I will share more about this in a future post.

Evaluating conversational AI may never become as simple as running a spellchecker, even though many futile attempts are being made to do just that (like the evaluation frameworks mentioned earlier). The challenge of capturing the nuances of human conversation and implicit context is exactly what keeps the work of evaluating conversational AI engaging and rewarding. Through careful data curation, thoughtful metric selection, transparent reporting, and a willingness to adapt, it’s possible to make meaningful progress—even when the finish line seems a bit (or a lot) blurry.

If you’re working on or using chatbots, I hope these insights help you build evaluation frameworks that are fair, robust, and genuinely informative. And if you still sometimes feel like you’re trying to measure fog with a ruler—trust me, you’re not alone.

Until then: keep your rulers sharp, your dashboards legible, and don’t be afraid of a little fog.

Building and maintaining a reliable evaluation dataset#

The many faces of “correct” answers#

When saying nothing is the smartest move#

Keeping up with change: versioning and traceability#

Running the evaluation: practical considerations and pitfalls#

The challenge of randomness in LLM outputs#

Why evaluation can get expensive fast#

Why relying too much on commercial orchestration platforms can slow you down#

Transforming raw evaluation data into actionable insights: why numbers alone aren’t enough#

Navigating a maze of evaluation metrics#

Not only seeing what, but also why#

Presenting results stakeholders can understand#

Lessons learned#