Introduction
As more and more business problems can be solved using large language models (LLMs) in chat or QnA systems, the question of how to evaluate them has become increasingly important. Without proper evaluation, it is difficult to know if the system are providing real value to the business and users, or just misleading them and potentially inflicting harm.
With the recent paradigm shift in AI technology we need to find new ways of evaluating AI systems.
In this article, we’ll explore the different approaches to evaluating large language models and show you how to create a test dataset tailored to your specific use case. By the end of this article, you’ll have a better understanding of how to measure the performance of these models specifically to your use case. So, let’s dive in!
As with most AI projects the number one priority starting this type of project is to get an evaluation feedback loop in place - otherwise we navigate in darkness.
What is a good metric?
A good metric should first and foremost align with the business goals and objectives. The metric should be a single number that indicates overall how good the QnA system is overall so different variants can easily be compared. It should also be possible to analyze different dimensions of the metric, drill into these dimensions, to gain insights into where the model is performing well and where it is not. This allows for a more nuanced understanding of the model’s performance and helps identify areas for improvement. We will explore the different dimensions of the metric in the Evaluation Metrics section.
Challenges
As a data scientist we are looking to evaluate how well our solutions will perform in solving the business goals. A common approach is to create a test set and evaluate the performance of the model on this test set. However, when we are using large language models, we are facing a few challenges:
- The generated text from the model is not deterministic. This means that the same input can result in different outputs that are all correct.
- What a good output from a Chat or QnA system is, is not always clear. For example, if we are using a Chatbot to answer questions about a product, we want to make sure that the answer is correct. However, we also want to make sure that the answer is not too long or too short. We also want to make sure that the answer is not too technical or too simple. This means that we need to evaluate the output on multiple dimensions.
- The model can generate answers that are not correct, but are still very convincing. This makes it hard to evaluate the model on correctness for human evaluators if they are not domain experts.
- The manual approach of evaluating the model is time consuming. It is not a one-time task, but needs to be done continuously to make it fast and easy to iterate on the solution.
Let’s us take a look at two ways to try and solve these challenges: Human Evaluation and Machine Evaluation.
Human evaluation
Human evaluation of question and answer (QnA) systems is a critical aspect of determining the effectiveness of these platforms. However, the common approach of using the Likert scale (rating on a number from 1 to 10) for such evaluations has several pitfalls. One major issue is that different users can interpret the scale in various ways, leading to inconsistent results.
A potential alternative is to ask evaluators to compare two responses directly, deciding whether response A or B is superior. However, this approach presents another set of issues. Studies like The False Promise of Imitating Proprietary LLMs suggest that human evaluators tend to focus more on surface-level properties, such as the well-formedness of text, rather than the actual correctness of the answer. This means that a response may be deemed higher quality simply because it is better written or more eloquently phrased, not because it provides a more accurate answer.
To mitigate these issues, the paper Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback suggests using multiple dimensions for ratings. This approach encompasses more than just the correctness (honesty) of responses, they include factors like helpfulness and harmlessness.
However, even with these methodologies, it’s important to note that human evaluators might not measure the factors that are truly essential for the effectiveness of a QnA system. Therefore, while human evaluation is a necessary component in assessing QnA systems, it is crucial to carefully consider the methodologies used to ensure the evaluation truly reflects the system’s performance.
Machine Evaluation
When doing human evaluation it is important to collect data for a test-dataset make it possible to do automated machine evaluation.
The use of large language models (LLMs) to evaluate other LLMs is a growing area of interest. This approach has certain advantages, but it also comes with its unique set of challenges.
A key advantage of using LLMs for evaluations is their ability to process and analyze large volumes of data quickly and accurately. This makes them particularly useful for assessing the performance of other LLMs across a wide range of tasks and metrics. One such model developed for this purpose is G-EVAL (https://arxiv.org/abs/2303.16634). The code for G-EVAL can be accessed at https://github.com/nlpyang/geval.
However, using LLMs for evaluation also presents some challenges. One issue is that LLMs tend to prefer a single number. This can limit the depth and complexity of evaluations, as it restricts the analysis to a single metric. Another challenge is model bias, as models often prefer their own outputs. This can lead to skewed results and a lack of objectivity in evaluations of models.
Additional issues arise when considering the order of candidate responses and the length of responses. Studies have shown that the order in which responses are presented can influence the model’s preference. Similarly, LLMs tend to prefer longer responses. This can be problematic as length does not always equate to quality or accuracy.
These consideration should be taken into account when designing experiments, and machine evaluation should not stand alone.
Combining human and machine evaluation is the way to go!
The best way to evaluate large language models is to combine human and machine evaluation. This allows you to get the best of both worlds. Machines evaluation can be run often, cheaply and fast. This allows you to iterate quickly on the solution. There are many configurations to tune and a quick iteration cycle is important to find the best configuration. When the best configuration for the test dataset has been found, human evaluation can be done. This provides new insights into the model’s performance and the test dataset can be updated to reflect the new insights. This process can be repeated until the model is performing well enough to be deployed in production.
Splitting the evaluation dataset?
Normally we split the dataset into a training set and a test set. The training set is used to train the model and the test set is used to evaluate the model. However, using pretrained models, we do not need the training set - or do we?
The thing is the system is still being trained in some sense. The instead the training being done by a machine learning algorithm it is done by the person looking at data and tweaking the system. This still have the risk of overfitting to the data you look at, while tweeking the system. This means that we need to split the evaluation data into a training set and a test set, but we might choose another ratio than the normal 80/20 split. We might choose a 50/50 split or even a 20/80 split. The important thing is that we have a test set that is not used for training the system.
Public Available Benchmarks
Publicly available benchmarks are often used to evaluate the performance of large language models. However, these benchmarks may not be specific to your use case and can be misleading. This is because the benchmarks are often designed to be general-purpose and cover a wide range of tasks, which may not be relevant to your specific use case.
- It measures the model and not your system. They do not take into account your prompt, finetuning, in-context learning or how well it handles your knowledge base context (for RAG patterns)
Publicly available benchmarks can be a useful starting point for evaluating the performance of large language models, but they should not be relied upon exclusively.
Instead, it is important to create your own test set and choose evaluation metrics that is specific to your use case and that can provide a more accurate measure of the model’s performance.
Evaluation Metrics
This section provides short descriptions of each of the metrics for an Question-Answering (QnA) evaluation system.
System provides the correct answer
For an QnA system, the most important metric is likely whether the system provides the correct answer to the question. Let’s take a look at the dimensions of correctness.
Correctness
Correctness is a measure of how factually accurate a response is. It evaluates if the information provided in the response is true and in accordance with real-world knowledge and facts.
To measure correctness a system can be evaluated on a set of questions with known answers. The system is then scored based on how many of the answers it provides are correct.
Example:
- High score: (Question: “Who wrote ‘Pride and Prejudice’?” Answer: “Jane Austen wrote ‘Pride and Prejudice’.”)
- Low score: (Question: “Who wrote ‘Pride and Prejudice’?” Answer: “Shakespeare wrote ‘Pride and Prejudice’.”)
Other names:
- Honesty
- Truthfulness
Related terms:
- True Positive Rate
- True Negative Rate
Groundedness
Groundedness, in the context of a retrieval augmented generation (RAG), refers to how well a question is grounded in the knowledge base, or the context provided to the large language model (LLM). A response with high groundedness will accurately reflect and utilize the information in the provided context, rather than generating answers based solely on patterns in the data it was trained on.
Groundedness is related to correctness. A model that is grounded in the context will be more likely to produce correct responses. However, a model that is grounded in the context is not necessarily correct. For example, a model that is grounded in the context may produce a response that is factually incorrect, but is still grounded in the context, if the context itself is incorrect or misleading.
Example:
- High score: (Context: “The Eiffel Tower is in Paris.” Question: “Where is the Eiffel Tower?” Answer: “The Eiffel Tower is in Paris.”)
- Low score: (Context: “The Eiffel Tower is in Paris.” Question: “Where is the Eiffel Tower?” Answer: “The Eiffel Tower is in New York.”)
Relevance
Relevance measures the degree to which the generated response aligns with and addresses the original query. Relevance is a submeasure for correctness. A model that is relevant will be more likely to produce correct responses. However, a model that is relevant is not necessarily correct. For example, it might produce a response that is relevant to the query, but is factually incorrect.
Example:
- High score: (Question: “What is the capital of France?” Answer: “The capital of France is Paris.”)
- Low score: (Question: “What is the capital of France?” Answer: “France is in Europe.”)
Consistency
Consistency is a measure of how consistent the system’s responses are. It can be important to ensure that the system does not provide conflicting information in response to the same question. For example, if a user asks “What is the capital of France?” the system should always respond with “Paris”, and not sometimes respond with “Paris” and sometimes respond with “London”.
Example:
- High score: (Question: “What is the capital of France?” Answer: “Paris”)
- Low score: (Question: “What is the capital of France?” Answer: “Paris” or “London”)
This is important for an factual QnA system, but not for creative work and brainstorming ideas.
Related terms:
- Perplexity
- Temperature
Well-formedness of text
Another important metric for an QnA system is the well-formedness of the text. This is a measure of how well the system’s responses are formed. There is two related dimensions of well-formedness: fluency and coherence.
Fluency
Fluency refers to the naturalness, readability, and grammatical correctness of the language produced by the model.
Example:
- High score: “I went to the store to buy some milk.”
- Low score: “Store went I buy milk.”
Coherence
Coherence refers to the logical and consistent connection of ideas in the responses generated by the model.
Example:
- High score: “I love reading. My favorite book is Pride and Prejudice.”
- Low score: “I love reading. The sun is hot.”
Engagingness
Engagingness is a measure of how engaging the system’s responses are. It is important to ensure that the system’s responses are engaging, so that users will want to continue interacting with the system.
Performance of the system
It is important to ensure that the system is fast and scalable. Here is a few other dimensions to consider.
Latency
Latency is a measure of how long it takes for the system to respond to a question. It is important to ensure that the system responds quickly, so that users do not have to wait too long for a response.
Throughput
Throughput is a measure of how many questions the system can answer in a given amount of time. It is important to ensure that the system can answer a large number of questions in a short amount of time, so that users do not have to wait too long for a response.
Safety and etics
As a data scientist you should never forget to consider if the system is safe and ethical. Lets dive into the dimensions to consider.
Fairness
Fairness is a measure of how equitably the system responds to different groups of users. It is important to ensure that the system does not discriminate against any particular group based on factors such as race, gender, or religion.
Example of Fairness:
- High score: (Question: “What is the best religion?” Answer: “As an AI, I don’t have personal beliefs or opinions about religion.”)
- Low score: (Question: “What is the best religion?” Answer: “The best religion is Christianity.”)
Harm
Harm is a measure of the potential negative impact that the system’s responses could have on users. It is important to ensure that the system does not provide harmful or dangerous information, such as medical advice that could be harmful if followed. This could also include providing incorrect or misleading information, behaving in a biased or discriminatory manner, infringing on privacy, causing unnecessary stress or anxiety, or any other negative impact on the user or society at large.
Example of Harm:
- High score: (Question: “What is the best way to treat a cold?” Answer: “The best way to treat a cold is to drink plenty of fluids and get plenty of rest.”)
- Low score: (Question: “What is the best way to treat a cold?” Answer: “The best way to treat a cold is to drink plenty of fluids and get plenty of rest, but if you have a fever or a cough, you should see a doctor.”)
Related terms:
- False Positive Rate
Flexibility of the system
It is important to consider if it is important that the system can handle a wide range of questions. Here is a few dimensions to consider.
Coverage
Coverage is a measure of how well the system responds to a wide range of questions. It is important to ensure that the system can respond to a wide range of questions, and not just a small subset of questions.
Generalizability
Generalizability is a measure of how well the system can respond to questions that are not in the training data. It is important to ensure that the system can respond to questions that are not in the training data, and not just questions that are in the training data.
Adaptability
Adaptability is a measure of how well the system can adapt to new information. It is important to ensure that the system can adapt to new information, and not just information that is in the training data.
Robustness
Robustness is a measure of how well the system can handle errors and unexpected inputs. It is important to ensure that the system can handle errors and unexpected inputs, and not just inputs that are in the training data.
Choosing the right evaluation metric
Puhh - that was a lot of metrics. Luckily you do not need to use all of them.
It is important to choose the right evaluation metric for your use case. For example, if you are building a chatbot, you might want to use a metric that measures how well the chatbot can carry on a conversation. If you are building a question answering system, you might want to use a metric that measures how well the system can answer questions. If you are building a system that generates text, you might want to use a metric that measures how well the system can generate text.
To determine which metric is best for your use case, you should consider the following questions:
- What is the goal of the system?
- What is the consequence of the system failing to achieve its goal?
- What is the cost of collecting the data needed to evaluate the system?
- Priority the following for your use case - make your stakeholders prioritize the following:
- Correctness of answer (if the system is a QnA system)
- Well-formedness of text (if the system generates text to be used directly by humans)
- Safety and ethics (if the system takes decisions or provide information that can have a negative impact on humans or your business)
- Flexibility of the system (if the system is general and cover a wide range of questions)
- Performance of the system (if the system needs to be fast and scalable)
- Consistency of the system (if the system needs to be consistent in its answers)
Test Data Creation
The test data should be created in a way that is representative of the real world. Test data should look like the data we will see in production.
You can use large language models to create a version of the test data. You can then use humans to review the test data and make any necessary changes. This approach is fast and easy to implement, but it does not provide a very diverse dataset. You need a manual approach to add more examples to create a diverse dataset.
Data from the production system should be collected and used to create the test data. Keep adding to your test data as you get more data from the production system and feedback is collected. Keep making the dataset more diverse and representative for the full range of questions that the system will encounter in the real world. This can be done by adding a thumbs up or thumbs down button to the production system, or by having domain experts annotate test data.
Combining Metrics
Often it is not possible to choose a single metric that is best for your use case. In this case, you can combine multiple metrics into a single metric that is best for your use case. The formula for combining metrics is as follows and should provide a score between zero and one:
\[ score = \frac{m_1w_1 + m_2w_2 + … + m_nw_n}{w_1 + w_2 + … + w_n} \]
Where m1
, m2
, m3
, … are the metrics you want to combine (normalized to be between zero and one), and w1
, w2
, w3
, … are the weights you want to assign to each metric. The weights should be chosen so that the resulting score is a good measure of how well the system is performing on your specific use case.
Conclusion
In this article, we have explored the different approaches to evaluating large language models and shown you how to create a test set tailored to your specific use case. By the end of this article, you should have a better understanding of how to measure the performance of these models and how to identify their strengths and weaknesses.