BLOG
Evaluating RAG Part I: How to Evaluate Document Retrieval
A guide to the evaluation of components in retrieval augmented generation
01.11.23
Retrieval Augmented Generation (RAG) is a prominent AI technique to elicit more accurate, customized results from a large language model (LLM). As more organizations look to incorporate RAG into their products, the question of evaluation becomes increasingly relevant.
But how do you go about evaluating a complex system like RAG? What are the methods for evaluating RAG pipelines – and how can you choose the right one? Is there any satisfactory way to evaluate generative LLMs, which are characterized by their linguistic creativity? And what role does user feedback play?
In this mini-series, we'll address all of these questions and shed light on how we at deepset approach the evaluation of RAG systems. We'll talk about some general evaluation concepts, such as annotation and metrics, and how they apply to the evaluation of RAG.
The second part of our series will deal with the particularly difficult task of evaluating LLM-generated content. But before that, let's turn our attention to the first part of a RAG pipeline, namely the document retriever.
The role of retrieval in RAG
In its simplest form, a RAG pipeline is a composite system that combines document retrieval with an LLM. The retrieval component first matches incoming queries against a large database of documents, selecting those documents that are most likely to contain relevant information. The pipeline then packages these documents in a prompt to the LLM, asking it to base its answer on them.
This description illustrates why the retrieval component plays such an important role in the overall performance of the pipeline: if it returns the right documents to the LLM, the likelihood of a good answer is much higher than if it performs poorly and returns a less appropriate selection of documents. That's why, when we find that our RAG pipelines are not meeting our expectations, we usually look to evaluate the retriever first.
What is evaluation?
Evaluation provides a protocol for assessing the quality of a system. Note that "quality" is relative, as it refers to how well your application solves your particular use case. The outcome of an evaluation often determines whether you want to further improve your system before deploying it to production.
When building with LLMs (and machine learning models in general), it's wise to adopt a practice of fast iteration, which consists mainly of rapid prototyping, evaluation, and tweaking the system according to the results. This means that the evaluation must be automated in some way - otherwise it will not be reproducible and will take too long. A scalable approach to evaluation is provided by metrics.
Quantitative evaluation through metrics
Think of metrics as a proxy for human judgment about how well your system performs for your use case. They are not the evaluation itself - just a very useful and scalable tool to complement other, mostly qualitative evaluation methods such as user feedback.
The method you use to evaluate your RAG pipeline is highly individualized to your particular project and its ultimate goals. When it comes to evaluating the retriever component, there are a number of metrics you can use. Before we look at those in detail, let's talk about annotation.
Annotating retrieval data
To use metrics, you need annotated, or labeled, data. Ideally, this data should represent the variety of real-world use cases that your application is likely to encounter in production. For the retrieval task, labeling is relatively straightforward: for a given query-document pair, your annotators must determine whether the document is relevant to the query or not.
Metrics and their applications
The purpose of evaluating your retriever is to determine the relevance of its document selection relative to a query. For example, if all the documents that it retrieved and passed to the LLM do not help answer our query, we want our evaluation metric to reflect that the retriever did a poor job. On the other hand, if the retriever selected only relevant documents, we want our metric to say that it did extremely well.
For a representative result, we'd like to run our retrieval module on the entire annotated evaluation dataset. By comparing the retrieved selection to the annotated ground truth, the various metrics compute a score between zero (the retriever didn't return any relevant documents) and one (the retriever returned a perfect selection of documents). The final score, an average of all scores across the dataset, is a convenient way to judge the overall performance of the retriever.
But not all metrics are the same, though. They differ mainly in their focus and permissiveness. Here we take a look at some of the most commonly used metrics for evaluating retrieval. (Visit our metrics documentation page for a more detailed explanation.)
Recall
What is it? Recall measures the percentage of correct documents returned by the retriever in response to a query. This metric treats each returned document the same, regardless of its position in the result list. It is sensitive to the total number of documents returned: the greater the number of documents the retriever selects, the greater the likelihood that the correct documents will be among them.
When should I use it? Recall is a useful metric if the order of your results doesn’t matter. For example, if you’re going to use all your results in a downstream application, they are likely to be reordered anyway.
Mean reciprocal rank (MRR)
What is it? Unlike recall, mean reciprocal rank does care about the order in which the documents are returned. It computes a score that indicates how high up in the list the first correctly retrieved document is.
When should I use it? You should use MRR when your primary concern is that a relevant result appears as high up in the list as possible. For example, if your users are looking for an answer or tool for a specific problem, it is most important that a helpful result be returned at the top.
Mean average precision (mAP)
What is it? Mean average precision is a metric that describes the position of each correctly retrieved document in the list of results.
When should I use it? This metric is useful if your system is order-aware and needs to retrieve multiple documents in each run. For example, the more relevant results you can show users of a recommendation system, the more likely they are to click on those results.
After evaluation
So you've evaluated your retrieval model and found it lacking. What happens next? Thanks to advances in composite AI systems, there's a wide variety of things you can do. Which one you choose also depends on the performance of the different evaluation metrics. Here are two possible scenarios and how to respond to them:
- If your retriever scores high on recall but not on MRR, it's selecting the right documents but not returning them in the ideal order. One solution might be to reorder your results using a ranking model.
- If recall is low, it means your retriever isn't finding the right documents. Improving your retrieval method – perhaps by choosing a stronger model, a hybrid retrieval approach, or even by fine-tuning a model on your own data – can go a long way toward achieving a more powerful RAG system.
Dive deeper with our webinar
Evaluation is a critical component of any successful machine learning project. If you want to learn more about evaluating RAG systems, check out this excellent webinar by Rob.
In the webinar, Rob, who works as an applied NLP engineer here at deepset, provides useful insights into how to think about RAG evaluation in practice. He offers practical examples of different scenarios that require different scoring methods and walks us through best practices for data annotation.
Stay tuned to our blog for the next post on RAG evaluation, where we'll talk about the challenges of evaluating LLMs, scalable approaches to measuring the quality of AI-generated responses, and the importance of combining quantitative metrics with qualitative user feedback.