Always Retrieval Augment Your Large Language Models
Why retrieval augmented generation (RAG) is the key to safer applications with large language models (LLMs)
07.07.23
Large Language Models (LLMs) are more powerful than anything we’ve seen before in Natural Language Processing (NLP). Large language models like GPT-4 and Falcon display an impressive command of language and knowledge, and can masterfully manage different tasks, from translation to essay writing. But they’re also volatile: the output of LLMs to the same prompt can vary, and they often make up facts and present them to you with a straight face.
If you want to use generative AI in production – be it as a virtual customer assistant, a company-internal search interface, or a content generation engine – you cannot rely on the LLM black box alone.
In this blog post, we’ll talk about how retrieval augmented generation (RAG) can help you channel the power of large language models in a way that truly benefits users. So if you’re interested in using LLMs for business applications that are reliable, helpful, and safe, then this article is for you.
The problem with LLMs
When you use an LLM out of the box – for instance, in the ChatGPT browser interface, or by sending your queries to the OpenAI API – your instruction is embedded in a prompt and fed to the model. The model then generates a response based on the input.
The output of these models is well-formed, conversational, and often helpful. But there’s a problem.
By default, the model bases its responses on the textual content it has ingested during training – and there’s no telling exactly what went into that data, or how the model recombines it to generate novel text. LLMs are also notoriously bad at recognizing when they don't know something. Since they're trained to complete text sequences, they'll try to answer anyway. The result is often the infamous “hallucination.”
Using LLMs in business applications
If you’re building an application that uses AI, you probably don’t want to leave the quality of its responses up to the model itself. And why should you? Most organizations own valuable text data to draw on. Here are just a few examples of the data that some companies have access to, inspired by existing deepset clients:
- A legal publishing house has assembled millions of legal documents, including court rulings and contractual clauses.
- An aircraft manufacturer owns thousands of pages from technical manuals for pilots.
- An acclaimed daily news publisher has stockpiled an ever-growing database of articles.
If you can bring the LLM to base its answer on those documents, the quality of its output will increase dramatically. So, in addition to avoiding hallucinations in general, we also want to encourage responses that are based on the vetted content that we control, rather than some undefined source from the training data.
As we’ve seen above, the “prompt” is a bit more than just the user query – it’s a guideline that commonly embeds the query and instructs the LLM to answer it. And here’s the trick: we can include more than just the query in our prompt. While the exact length of context permitted depends on the LLM we’re using, every model provides some limited space for additional information.
Rather than only passing our user query to the LLM, we can therefore include more context in the prompt – for example, a legal document, a technical manual, or a collection of newspaper articles. We can then instruct the model to generate its answer based on that data.
And if we don’t know in advance which document contains the answer to our user’s query? Enter retrieval augmentation.
What is retrieval augmented generation (RAG)?
Let’s say your textual database of reviewed content contains around 1000 documents. Not only would this amount of text exceed the context windows of most models, but it's likely that only a few of these files actually contain relevant context per query. Selecting these documents manually is tedious and impractical, especially for a production system. Instead, you can use a powerful document search technique called retrieval.
In RAG, you delegate the context-finding step to a retrieval module. Its task is to identify the documents that are most likely to contain the relevant information and pass those to the prompt together with the user’s query.
With this setup, you are not only providing much better conditions for the model to come up with a well-grounded, factually correct answer. You also make it much easier for the user to fact-check an answer, since they can look at the documents that the answer was based on.
The advantages of combining generative AI and retrieval
As we’ve seen, the retrieval step acts as a filter that can quickly find the right documents, no matter how big your database is. Through a variety of search techniques, it can identify those documents based on the user’s query.
The generative model then uses these documents as a sort of ground truth upon which to base its response. At the same time, it can still use its powerful abstractive abilities: it can rephrase, summarize, or even translate the information contained in your documents.
RAG thus turns your LLM into a highly skilled domain expert who can provide reliable information within seconds.
Leveraging the power of LLMs in real-world products
At deepset, we see LLMs as an – admittedly awesome – tool to revolutionize the handling of textual data.
Our managed LLM platform deepset Cloud takes the hassle of managing a complex AI project out of our customers’ hands. Even teams with limited knowledge of the complexities of modern LLMs can use deepset Cloud to build powerful, slick natural-language interfaces on top of their data.
To learn more about retrieval augmented generation, check out our CTO Malte's webinar on Connecting GPT with Your Data. In the session with O'Reilly Media, Malte talks about battling hallucinations with RAG and building production-ready applications with retrieval augmented LLMs.