The messy reality of evaluating conversational AI: lessons from the fog

Have you ever tried to measure fog with a ruler? Probably not. But evaluating AI chatbots can feel a lot like that. Fog is undefined, slippery, and without clear edges—making the ruler a pretty useless tool. When you think you’ve nailed down the perfect metric or found “the right answer,”—poof—the fog shifts, and you’re back at square one. Still, if you care about building trustworthy AI, learning to navigate this fog is essential. ...

October 7, 2025 · 19 min · Martin Møldrup

Evaluating Large Language Driven Systems for Chat or QnA Systems: A Comprehensive Guide

Introduction As more and more business problems can be solved using large language models (LLMs) in chat or QnA systems, the question of how to evaluate them has become increasingly important. Without proper evaluation, it is difficult to know if the system are providing real value to the business and users, or just misleading them and potentially inflicting harm. ...

August 25, 2023 · 16 min · Martin Møldrup