Metrics to Evaluate a Question Answering System
Use quantifiable metrics coupled with a labeled evaluation dataset to reliably evaluate your Haystack question answering system
30.09.21
If you want to draw conclusions about a system’s quality, subjective impressions are not enough. Rather, you’d want to use quantifiable metrics — coupled with a labeled evaluation dataset — to reliably evaluate your question answering (QA) system.
Having an evaluation pipeline in place allows you to:
- Conduct informed assessments of your system’s quality,
- Compare the performance of different models, and
- Identify underperforming components of your pipeline.
In this tutorial, we’ll explain the concepts behind setting up an evaluation protocol for your QA system. We’ll then go through an example QA system evaluation.
Evaluation of QA Systems: An Overview
If you’ve already used the Haystack NLP framework, you might appreciate its modular approach to building a working pipeline for extractive QA. Nonetheless, the result is a complex system for natural language processing (NLP) that poses some unique challenges in terms of evaluation.
An extractive QA system consists of one Retriever and one Reader model that are chained together in a Pipeline object. The retriever chooses a subset of documents from a (usually large) database in response to a query. The reader then closely scans those documents to extract the correct answer.
During evaluation, you’ll want to judge how the pipeline is performing as a whole, but it’s equally important to examine both components individually to understand whether one is underperforming. If the reader shows low performance, you may need to fine-tune it to the specifics of your domain. But if the retriever is causing a bottleneck, you might increase the number of documents returned, or opt for a more powerful retrieval technique.
Datasets for evaluation
Your evaluation system should be based on manually annotated data that your system can be checked against. In a question answering context, annotators mark text spans in documents that answer a given query. (If you want to learn more about annotation in QA, check out our Haystack annotation tool guide!) Some datasets provide one answer per question, while others mark multiple options.
When a document does not contain the answer to a query, the annotators mark “None” as the correct answer to be returned by the evaluated system.
Open vs. closed domain
There are two evaluation modes known as “open domain” and “closed domain.”
Closed domain means single document QA. In this setting, you want to make sure the correct instance of a string is highlighted as the answer. So you compare the indices of predicted against labeled answers. Even if the two strings have identical content, if they occur in different documents, or in different positions in the same document, they count as wrong. This mode offers a stricter and more accurate evaluation if your labels include start and end indices.
Alternatively, you should go for open domain evaluation if your labels are only strings and have no start or end indices. In this mode, you look for a match or overlap between the two answer strings. Even if the predicted answer is extracted from a different position (in the same document or in a different document) than the correct answer, that’s fine as long as the strings match. Therefore, open domain evaluation is generally better if you know that the same answer can be found in different places in your corpus.
Retriever metrics
To evaluate your system’s quality, you’ll need easily interpretable metrics that mimic human judgment. Because the reader and retriever have different functions, we use different metrics to evaluate them.
When running our QA pipeline, we set the top_k parameter in the retriever to determine the number of answer candidates that the retriever returns. To evaluate the retriever, we want to know whether the document containing the right answer span is among those candidates.
Recall measures how many times the correct document was among the retrieved documents. For a single query, the output is binary: either a document is contained in the selection, or it is not. Over the entire dataset, the recall score amounts to a number between zero (no query retrieved the right document) and one (all queries retrieved the right documents).
In contrast to the recall metric, mean reciprocal rank takes the position of an answer (the “rank”) into account. It does this to account for the fact that a query elicits multiple responses of varying relevance. Like recall, MRR can be a value between zero (no matches) and one (the system retrieved the correct document as the top result for all queries).
Reader metrics
When evaluating the reader, we want to look at whether, or to what extent, the selected answer passages match the correct answer or answers. The following metrics can evaluate either the reader in isolation or the QA system as a whole. To evaluate only the reader node, we skip the retrieval process by directly passing the document that contains the answer span to the reader.
The name says it all. Exact match (EM) measures the proportion of documents where the predicted answer is identical to the correct answer. For example, for the annotated question answer pair “What is Haystack? — A question answering library in Python,” even a predicted answer like “A Python question answering library” would yield a zero score because it does not match the expected answer 100 percent.
The F1 score is more forgiving than the EM score, and more closely resembles human judgment as far as the similarity of two answer strings. It measures the word overlap between the labeled and the predicted answer. Thus, the two answers in the example above would receive an F1 score of one.
The accuracy metric is used in closed domain evaluation and a Reader will score 1 if the predicted answer has any word overlap with the label answer. Consider the pair of answers “San Francisco” and “San Francisco, California”. While F1 and EM would penalise these for not being exactly the same, accuracy would give them a perfect score. This metric is more reflective of the experience of the end user since in many use cases, the context around the predicted answer will also be provided to the user.
Semantic answer similarity
While F1 is more flexible than EM, it still does not address the fact that two answers can be equivalent even if they don’t share the same tokens. For example, both scores would rate the answers “one hundred percent” and “100 %” as sharing zero similarity. But as humans, we know that the two express exactly the same thing.
To make up for this shortcoming, a few members of the deepset team recently introduced the semantic answer similarity (SAS) metric. (The paper got accepted into the EMNLP conference 2021, which we’re very excited about!) It uses a Transformer-based cross-encoder architecture to evaluate the semantic similarity of two answers rather than their lexical overlap. SAS has become available in Haystack with the latest release and an article on how to use it is coming soon. (Updated 07/11/2022: The article on SAS was published here.)
Evaluating a QA System
Updated 07/11/2022: For an updated version of this section, please check our new article here.
Optimizing Question Answering Pipeline Performance with Haystack
Now that you know how to approach a QA pipeline performance evaluation in Haystack, it’s time to build high-quality question answering systems that are tailored to your use case!
Start by heading over to our GitHub repository. If you like what you see, give us a star :)