BLOG

Customizing Retrieval Augmented Generation (RAG) Systems

Introduction to Advanced RAG: How to extend basic generative AI pipelines for any use case

Retrieval augmented generation (RAG) has become the standard for generative AI applications with LLMs. It's a smart, safe, and cost-effective way to improve LLM performance and reduce hallucinations, while providing the ability to dynamically expand the knowledge base at any time.

But business leaders and product managers take note: While the basic RAG setup is straightforward, it should be considered a starting point rather than a full-fledged solution. To truly harness the power of generative AI and gain a competitive advantage, you must customize and extend the basic setup to meet your specific business needs. 

The difference between a generic solution and a customized, production-ready RAG system is the key to unlocking the value of your AI initiatives.

In this blog post, we provide an introduction to the most common RAG-based paradigms. We explore how Compound AI supports these advanced implementations and provide guidance on mapping your unique use case to the ideal technology implementation. For decision makers looking to stay ahead in the AI race, understanding these customization strategies isn't just beneficial – it's mission critical.

Basic RAG: The foundation

A classic RAG setup includes a retriever, a prompt component, and an LLM API connector:

  • The retriever acts like a smart search engine. When a user asks a question, the retriever selects documents from a database to build the context required to answer that question.
  • The prompt component takes the user's question and the retrieved documents and formats them in a way the LLM can understand.
  • The LLM API connector sends this formatted input to an LLM (like GPT-4) and gets back a generated response.

This setup helps ensure that the AI's answers are based on specific, verified information, not just what it "knows" from its training data. It’s ideal for basic question-answering systems where the primary goal is to ground the AI's answers in a knowledge base. But that's not always enough. Fortunately, the modularity principle embodied in Compound AI means that we don't have to stick with the standard setup.

Modularity, extensibility, customization: Compound AI's recipe for success

Compound AI is the dominant paradigm in AI system design today. It provides a way to build complex AI systems from multiple components that work together to solve tasks. These components can be driven by LLMs, by other machine learning models, or they can even be rule-based. With Compound AI's modular approach, additional components can be added to the basic RAG setup to create more sophisticated systems. This inherent adaptability enables AI teams to create LLM-based solutions for each unique business use case.

Customizing RAG: The most common setups

To help you navigate the world of components and pipeline configurations, here is an overview of advanced RAG setups. For every position in a basic RAG pipeline, we’ll present a typical component, its purpose, implementation, use cases, and other components that could replace or complement it.

Before retrieval 

Component name: Query classifier.

Purpose: A query classifier serves to evaluate the incoming query to decide which route in the pipeline graph it should take. It can be used to deter attempted prompt injections, or to detect queries that are off-topic for the specific pipeline. In an advanced RAG system with multiple routes, a classifier can introduce Agentic behavior to the pipeline by determining which branch a query should be sent to.

Prompt injection occurs when users make requests to an LLM-based application not as the system was designed, but with malicious intent (revealing sensitive information or damaging brand reputation).

Implementation overview: Depending on the complexity of the task, query classifiers can leverage small classification models or LLMs to assess the query and determine its further trajectory in the RAG graph.

Typical use cases: Applications that need to make intelligent routing decisions based on the nature of incoming queries.

Alternative components: At this position in the pipeline, you could also include components to expand or decompose the query, sending more information to the retriever(s) in your pipeline for more relevant results.

During retrieval

Component name: Hybrid retrieval

Purpose: Your RAG pipeline needs a high quality retrieval component to work well. But different retrievers have different strengths and weaknesses. Semantic retrievers select documents based on meaning, while keyword-based retrievers identify matching documents based on common words.

Implementation overview: To make your retrieval more efficient, you can use the results of different methods to build the context for your LLM. This is called hybrid retrieval. A hybrid retrieval component consists of multiple retrievers whose results are then combined by an additional document-joining component. Because of its low effort and high reward, it's become a staple of RAG pipeline setups.

Typical use cases: Search systems dealing with diverse content types or multiple domains.

Alternative components: Another popular technique to improve retrieval and increase the chance of finding relevant information is to generate hypothetical document vectors based on the query, known as HyDE. In addition to text-based retrieval, you can also do multimodal retrieval. A multimodal RAG setup might include specialized components for matching a query to images, audio files, or business intelligence tables (in text to SQL). 

After retrieval

Component name: Ranker.

Purpose: Retrievers return the documents that are most relevant to the query, depending on the retrieval method used. However, they may not be as accurate in determining relevance as we would like. Therefore, it is good practice to add a ranking component to the pipeline. This component helps narrow down the documents we send to the LLM by more accurately ranking the retrieved documents by relevance. Rankers can be model-based or metadata-based.

Implementation overview: Because they deal with far fewer documents than retrievers, rankers can use more powerful models. Or they can use metadata such as a document's subject, author, or publication date to re-rank results based on recency, diversity, and other criteria. In a typical setup, the retriever sends 40-60 documents to the ranker, which re-ranks them and then sends only the top 20 to the LLM. This saves cost, cuts latency, and reduces noise in the data received by the LLM, improving its performance.

Typical use cases: In a hybrid retrieval pipeline or when retriever recall is low.

Recall in information retrieval is the fraction of relevant documents that are successfully retrieved out of all relevant documents in the collection.

Post generation

Component name: Reference prediction

Purpose: In RAG, the LLM should use only the information contained in the retrieved documents to generate its answer. Source references make it easier to check the LLM's claims.

Implementation overview: You can ask the LLM to include references to the source documents in its answer. However, you can get more accurate results by using a specialized model. This model takes the answer generated by the LLM and the previously retrieved documents. It then predicts for each sentence which documents, if any, it is based on. Again, modularity reigns supreme.

Typical use cases: Any application where the results of the RAG pipeline are presented directly to the user. Each phrase then comes with a source annotation that the user can easily verify.

Alternative components: In industry applications, it’s common to have not just one prompt in a RAG pipeline, but several. Using the principle of modularity, this allows the system to process the query using different LLMs – taking advantage of their different strengths. For example, you could use one prompt + LLM connector block to generate a response, and another to translate that response into different languages, add additional context, or modify it in any other way.

Custom RAG in the real world

The examples above are all blueprints of what an advanced RAG setup can look like. However, when using RAG for real-world applications, things can quickly get more complex. Just take a look at this setup from one of our customers:

As the diagram shows, this customer is using a custom RAG setup that includes elements like query expansion, query transformation, and a hybrid retrieval setup consisting of sparse and dense retrieval. Keep in mind, however, that the goal is not to build the most complex system possible, but rather to create a RAG setup that efficiently and effectively meets your specific business needs.

A common way to do this is to take an iterative approach: Start with a basic RAG setup and evolve it through multiple cycles of prototyping, feedback, and adjustments to the system. Read O'Reilly's report on LLM Adoption in the Enterprise to learn more about best practices for product development with AI.

How RAG is evolving

In recent months, RAG has seen some exciting improvements that do not fit neatly into any one place in the basic RAG setup, because they span the entire RAG pipeline and often modify it in multiple ways at once. These approaches go beyond the unidirectional RAG graph to a much more complex setup with loops and more. So in this final section, we'll look at perhaps the most advanced RAG setups available today: Agentic RAG and GraphRAG.

Agentic RAG: Dynamic problem-solving

We briefly discussed agents in the section on query classifiers. These AI systems are equipped with multiple tools that they can dynamically invoke based on the task at hand. Tools can be individual components, combinations of components, and even entire RAG pipelines! For example, a question-answering agent could be empowered to answer questions based on either a proprietary, static database of documents or, if the information needs to be current, by browsing the Web. It could then retrieve, combine, and synthesize information using an LLM.

This approach excels at solving open-ended, complex problems that require adaptive reasoning, seamless integration of disparate information sources, and strategic use of multiple tools or methods. With Agentic RAG, AI teams can build more powerful systems that can dynamically adapt to different tasks.

GraphRAG: Connecting the dots for more insightful responses

GraphRAG extends RAG by using LLMs to construct knowledge graphs from document sets. It maps entities and their relationships, providing both high-level and granular views of information. GraphRAG excels at answering abstract questions, understanding cross-document relationships, and uncovering hidden insights. It's useful for complex domains such as financial analysis, legal document review, and medical research, where interconnected information is critical.

Let’s talk RAG and more

At deepset, we help AI teams build solutions to real-world problems. So naturally, RAG and its many flavors have quickly become our daily bread :) Talk to us if you want to learn how this awesome technology can be applied to your individual business use case.

Another important customization method related to RAG is the data preparation and indexing process. We'll take a closer look at this in our next blog. We also have an upcoming blog post on the role of metadata in RAG customization. Stay tuned!