Troubleshooting Haystack Pipelines
Pipelines in Haystack allow you to quickly design and implement the desired neural search architecture. In this article we describe how to troubleshoot a question answering system.
10.08.21
If you’ve already worked with the Haystack framework, then you know that it can streamline your setup of a fully-functional question answering (QA) system. Whatever your system’s degree of intricacy, pipelines allow you to quickly design and implement your architecture of choice. However, they may also introduce some added complexity if your code stops working.
In this article, we’ll go over some common issues you may face when building neural search and question answering pipelines in Haystack, before showing you how to debug them.
What Is a Haystack Pipeline?
Between accessing a database, retrieving documents that match your query, and extracting the relevant answer passages, modern question answering systems require you to carefully orchestrate many complex processes. That’s no simple task.
Haystack Pipelines make that easier: they’re tools that allow you to piece together the different components of a QA system. Instead of executing the components individually and having to take care of the data flow between them, pipelines let you automate this process. By taking over most of the computational heavy lifting, they eliminate unnecessary bookkeeping and ensure that your program always runs smoothly.
Under the hood, a Pipeline is defined as a DAG — a directed acyclic graph whose nodes correspond to different Haystack components, such as the Retriever, the Ranker or the Reader. The order of the components within a graph controls how data flows through that graph.
Haystack comes with several predefined pipeline classes to cover the most common setups, but you can also design and implement your own pipelines to fit your use case. Learn more by reading our Pipelines documentation page.
When Should You Debug a Haystack Pipeline?
As your component pipeline grows in complexity, so does the potential for bugs. Errors can occur during initialization or runtime, for instance when pairing two components whose inputs and outputs are incompatible.
How to Debug a Haystack Pipeline
Below, we cover a few solutions to some of the most common problems regarding pipelines in Haystack.
Solution 1: Run Each Component in Isolation
Debugging an entire pipeline is less straightforward than debugging each element individually. When components are combined in a pipeline, they’re invoked via a single call: the pipeline’s run() method. In the order defined by the graph, run() calls the individual components one by one through their own run() methods. These, in turn, call the component’s main methods: for the Retriever, this method is called retrieve(); for the Reader, it’s predict(). We encourage you to look at the source code and how Node.run() is defined, and see which other Node methods are called to understand what is going on.
Executing multiple function calls one after another within a pipeline can result in a nested and complicated stack trace that contains multiple layers of execution. This can complicate debugging significantly. To simplify things, you could run each component individually (i.e., outside of the pipeline) and test its functionality in isolation. You could achieve this by calling the components’ main methods, like so:
retrieved_docs = retriever.retrieve(query="Who opened the Chamber of Secrets?", top_k=10)
>>> retrieved_docs
[{'text': 'He also reveals that the Chamber of Secrets has been opened before and immediately punishes himself, as he is not supposed to reveal anything. After Dobby disappears, Dumbledore, McGonagall, and Madam Pomfrey enter with Colin Creevey,...}]
ranked_docs = ranker.predict(query="Who opened the Chamber of Secrets?", documents=retrieved_docs, top_k=1)
>>> ranked_docs
[{'text': 'At least part of the legend was revealed to be true in 1943, when Tom Marvolo Riddle, the heir of Slytherin, opened the Chamber and used the Basilisk to attack Muggle-borns...}]
Oftentimes, a component on its own may work as expected but fail when plugged into a pipeline. When that happens, the culprit might be the format of the data that’s being passed from one component to another.
Some components such as Retrievers and Rankers are designed to properly fit each other’s input and output formats, but not all components can be combined straight out-of-the-box, as we’ll see in an example. When in doubt, feel free to check our documentation pages or look at the components’ source code.
Initialization vs. Execution
It’s worth noting the difference between pipeline methods pertaining to execution versus initialization. If an error occurs during the run() method call, then it’s related to execution. Errors occurring at any other time likely reflect an initialization issue. The timing of the error should tip you off on whether you should check the components’ input and output formats, or alternatively, your database connections.
Solution 2: Check the Error Printout
If you’re not a fan of cryptic error messages that seemingly withhold useful information, you’re not alone. Unfortunately, learning how to read error printouts goes a long way towards building bug-free programs. The good news is that Haystack makes errors human-readable.
When an error occurs within a node in a pipeline, Haystack captures the error and prints out a message that mentions the relevant node and input arguments, and a stack trace that includes the list of calls leading to the error. Consider the following program where we implement a custom Retriever-Ranker-Generator pipeline.
We’ll start by importing the modules and initializing the components:
from haystack.retriever.sparse import ElasticsearchRetriever
from haystack.ranker import FARMRanker
from haystack.generator import RAGenerator
retriever = ElasticsearchRetriever(document_store=document_store, top_k=10)
ranker = FARMRanker(model_name_or_path="sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking", top_k=5)
generator = RAGenerator(model_name_or_path="facebook/rag-token-nq", top_k=1)
Next, we’ll combine the elements into a pipeline:
from haystack import Pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name='Retriever', inputs=['Query'])
pipeline.add_node(component=ranker, name='Ranker', inputs=['Retriever'])
pipeline.add_node(component=generator, name='Generator', inputs=['Ranker'])
So far, so good. Now, let’s query our system:
query = "Who opened the Chamber of Secrets?"
result = pipeline.run(query=query)
Running the query against our pipeline produces the following error:
Exception: Exception while running node `Generator` with input `{'query': 'Who opened the Chamber of Secrets?', 'documents': [{'text': "==First opening of the Chamber of Secrets== Slytherin's Basilisk slithering through the Chamber of Secrets In 1943, the Chamber of Secrets was opened by a Slytherin Fifth year, named Tom Rid..."}]
AttributeError: _prepare_passage_embeddings need a DPR instance as self.retriever to embed document
Because the error occurs within a node in a pipeline, it’s captured and the program prints out information that’s specific enough to begin debugging right away. The message prints out the name of the node where the error occurred and the specific input that caused the error. You could take this input, retrieve the node by passing its name to the pipeline’s get_node() method (i.e., call get_node(“Generator”)), and call Node.run() on the retrieved node to test it.
(Hint: The solution to the above error would be to swap the ElasticsearchRetriever for a dense retriever such as the DensePassageRetriever, as required by Generators.)
Additionally, Python’s stack trace can help you pinpoint the exact line that raised the error. Here’s the full Python stack trace printed just below the Exception:
full stack trace: Traceback (most recent call last):
File "/home/ubuntu/haystack/haystack/pipeline.py", line 131, in run
node_output, stream_id = self.graph.nodes[node_id]["component"].run(**node_input)
File "/home/ubuntu/haystack/haystack/generator/base.py", line 29, in run
results = self.predict(query=query, documents=documents, top_k=top_k_generator)
File "/home/ubuntu/haystack/haystack/generator/transformers.py", line 239, in predict
passage_embeddings = self._prepare_passage_embeddings(docs=documents, embeddings=flat_docs_dict["embedding"])
File "/home/ubuntu/haystack/haystack/generator/transformers.py", line 183, in _prepare_passage_embeddings
raise AttributeError("_prepare_passage_embeddings need a DPR instance as self.retriever to embed document")
AttributeError: _prepare_passage_embeddings need a DPR instance as self.retriever to embed document
Even a relatively simple pipeline like the one above is still complex enough to raise an error with a verbose stack trace. As a rule of thumb, we recommend reading the stack trace from the bottom up. The most recent function calls are those most relevant to the error. Because these show up at the bottom of the stack, we recommend starting there.
In our example, you can see that the bottom of the stack trace contains the AttributeError, whereas the line above it points to the module and line of code responsible for the error. Common errors including those related to out-of-memory issues, SIGTERMs, and CUDA performance all appear in the stack trace.
Solution 3: Visualize the Graph
Finally, you can also visualize your pipeline — it’s as easy as calling the draw() method. The resulting visualization is a PNG file that shows the graph’s structure.
Visualizing your pipeline can serve as a sanity check for uncovering logical errors in it.
Need Help Debugging Haystack Question Answering System?
Don’t let bugs get in the way! If you run into any issue that’s not covered in this post, we encourage you to file a GitHub Issue or reach out by joining our Discord community. We’re happy to help!