Long-Context LLMs and RAG

Context windows are growing. What does that mean for retrieval augmented generation (RAG)?

By

The deepset Team

,

Published on

August 20, 2024

12

min read

TLDR

Key Metrics:

Large language models (LLMs) continue to evolve. Recently, we've seen a dramatic increase in the size of context windows – the space reserved for passing data to an LLM to base its response on. In this blog post, we explore how Long-Context Language Models (LCLMs) could impact approaches to retrieval augmented generation (RAG), which has been the de facto standard setup for eliciting useful and fact-based responses from LLMs. We've also set up a demo so you can see for yourself the difference between LCLM RAG and basic RAG with shorter text chunks.

Steadily growing context windows

A context window is the maximum amount of text that an LLM can process at once. It determines how much information the model can consider when generating a response. Larger context windows allow LLMs to handle more extensive and complex inputs, potentially leading to more informed and coherent outputs.

‍

In the past year, leading LLMs have increased their context windows. GPT-4, Claude 3, and Gemini now have context windows of hundreds of thousands to millions of tokens. A context window of 1 million tokens is about 750,000 words, or a book with 1,500 pages. This is a big change from earlier LLMs, which could only handle a few thousand tokens at a time.

‍

Longer context windows mean that more documents and data formats can be sent to the LLM with each request. The larger context windows can contain different types of documents, such as long text, structured data, and code. This allows the LLM to process more complex information in a single request.

‍

Especially for tasks that need to understand and combine a lot of information, this has a big impact. For example, an LLM with a large context window could process an entire contract or legal precedent at once, making it more accurate and able to see the big picture. In science, it could look at many research papers at once, seeing links and ideas across studies. For writing or content creation, it could maintain a consistent theme and detail across long stories or technical documents.

RAG

Retrieval augmented generation (RAG) has established itself as the go-to solution for enhancing LLMs with external knowledge. RAG breaks large documents into smaller chunks, usually between 100 and 1,000 tokens. The chunks are then indexed and stored in a database. When a query is received, RAG retrieves the most relevant chunks based on semantic similarity to the query. The retrieved chunks and the original query are passed to the LLM for processing. This chunking process allows RAG to handle large datasets that exceed the LLM's context window.

Long context and RAG: A demo

A recent study by Li et al. (2024) found that LCLMs slightly outperformed short-context RAG when it came to multi-hop reasoning or understanding of implicit queries in long narratives. This is likely because LCLMs have access to a larger and more coherent context and therefore can draw connections between various pieces of information scattered throughout a document. LCLMs often provide more detailed answers than simple RAG implementations. This is especially true for complex queries that require combining information from multiple sources. Check out our latest demo comparing long-context RAG and basic RAG and see how the two approaches handle queries side by side.

The pros and cons of long context

In an LCLM setup, the entire document or dataset can be fed directly into the model without the need for chunking. This allows the model to consider the full context of the information, potentially leading to more coherent and contextually accurate responses. The key difference in preprocessing is that while RAG with smaller context windows requires significant upfront work in chunking and indexing documents, LCLMs often require less preprocessing.

‍

But keeping token counts low isn't just a workaround for small context windows. It is also an effective way to keep cost and latency under control. This is true even for long-context LLMs. Since processing more tokens means consuming more computational resources and spending more money and time, it would be a bad idea to dump our entire datasets into each API call to the LLM. However, the trend toward larger context models opens up the possibility of feeding more data into the LLM when traditional RAG isn't enough. LCLMs can be useful when multiple documents need to be compared, or when more continuous context is needed – for example, an entire document rather than just chunks of it.

Long context on demand

Long context isn’t always needed but can be beneficial in answering more complex questions. Recent research therefore proposes hybrid approaches that leverage both LCLMs and RAG in one setup. For instance, the paper mentioned above by Li et al. introduces a method called "Self-Route," which dynamically chooses between RAG and LCLM based on the model's self-assessment of whether it can answer a query using only the retrieved information. This approach achieved performance comparable to LCLMs while significantly reducing computational costs.

Longer or shorter contexts?

As the LLM universe continues to grow, we encourage you to explore different approaches and see how they perform in different scenarios. Try it out for yourself and let us know what you think! 🚀