BEST PRACTICES

Leveraging Metadata in RAG Customization

Overlooked and underestimated: the many ways in which metadata can improve generative AI applications

16.10.24

Retrieval augmented generation (RAG) systems are a powerful approach to extending the capabilities of large language models (LLMs). They make LLM output more reliable by providing a knowledge base of fact-checked data in response to a user query.

While the focus is usually on the data used in these systems, an important element is often overlooked: the metadata. This "data about data" has the potential to significantly improve the performance and versatility of RAG-based AI solutions, and the best part is that it's often readily available at no additional cost. On the contrary, when used properly, metadata can save you tons of time and money!

Understanding how to identify valuable metadata and where to apply it in the RAG process is critical to realizing these benefits. In this post, we will explore the role of metadata in RAG customization and how it can be used to improve AI systems. We'll examine strategies for integrating metadata to improve preprocessing, document retrieval, answer quality, and overall system functionality in RAG implementations, using real-world examples from our customer base.

What is metadata?

Metadata is data that provides additional information about a data point itself. It is often a byproduct of the process that created the data in the first place. For example, a digital photo has a timestamp and location, while an online article has the date and time it was published, the author, and the department that published it. New metadata can also be created to enrich the dataset, for example, product reviews can be classified as negative or positive.

What is a RAG system?

Retrieval augmentation means including a document retrieval component before the generation step that uses an LLM. The retrieval component identifies documents that match the query and includes them in the prompt to the LLM, instructing it to base its answer on the information contained in the documents. If this is still a bit unclear, be sure to read our blog post about RAG!

A complete RAG system consists not only of the query pipeline itself, but also of the preprocessing step that precedes it. As we'll see, metadata can improve the system at both preprocessing and query time.

How can metadata improve RAG?

Metadata can be an asset to RAG systems at various stages of the process, from preprocessing to retrieval to final output generation by the LLM.

During preprocessing

The preprocessing phase is where we prepare our documents for future retrieval. Raw text data is extracted, cleaned, broken into chunks, and indexed in a database. One way metadata can be used in preprocessing is as a filter.

One of our customers in the legal industry uses document categories to discard documents that are unnecessary for their use case. This ensures that only valuable content is added to the database, saving costs and reducing noise in the dataset.

During retrieval

The retrieval step is where relevant data is pulled from the database to augment the query to the LLM. Metadata can be used in a number of ways to enhance this process. First, it can act as a filter to narrow the search results. Such a filter can be hard-coded into the system (for example, a news-oriented system could automatically exclude articles older than a certain date), or user-controlled (allowing users to specify metadata-based filters or extract them from the query itself, providing more precise and relevant results).

One of our clients, a VC firm, has built metadata filtering into their internal search system for their investment opportunities. Users have the ability to add a filter to their query that limits the search to specific companies.

In addition to acting as a filter, metadata can also be used to enrich the search process and find more precise matches from the database. In a keyword-based retrieval system (such as BM25), metadata can be embedded in specific search fields, allowing the system to find matches beyond the document content.

In semantic search, metadata can be embedded along with the document content itself. This creates a richer representation of the documents, potentially leading to more accurate results.

One of our clients uses social media posts to measure sentiment across demographic groups. They embed not only the text, but also metadata that provides valuable context, such as time, date, region, and topics of the posts. 

If you want to learn more about the different types of document search and how to combine them for better performance, read our article about hybrid retrieval.

During ranking

Many RAG configurations use a ranking component after retrieval. Rankers can reorganize the retrieved documents so that the most relevant ones are at the top of the list and the others can be discarded. Using metadata to rank retrieved documents can help surface the most relevant information for each query.

An NGO has built a tool that allows its employees to search over internal policy documents. Since different iterations of a policy may exist in the database, they use the "date" metadata on these documents to rank more recent versions of a policy higher.

In the LLM prompt

The prompt is the instruction that LLM receives and uses to generate output. The more information our prompt contains, the better. So naturally, we want to use metadata at this point in the RAG pipeline as well. In addition to providing valuable context to the LLM, metadata can also help it generate more helpful responses, for example, when referencing documents, it can refer to their title, page number, publication date, and so on.

A news platform uses the genre of an article to help the LLM interpret it. In addition, the LLM is instructed to calculate the article's recency using metadata to better contextualize the information it contains.

MLOps and the evaluation process

It's hard to overstate the usefulness of metadata for the core RAG pipeline. Beyond that, it can play an important role in the broader operational and evaluation processes:

  • Performance tracking: You can analyze system performance on different segments of your data by document type, time period, or other relevant categories.
  • Continuous improvement: Insights from metadata analysis can help you prioritize areas for model fine-tuning or dataset enrichment.

Include metadata in your RAG strategy from the start

As we have seen, metadata can be extremely helpful in building high-performance RAG pipelines. So it pays to think about how to curate metadata along with your data points from the very beginning of your project. Here are a few tips:

  1. Data collection: Think not only about the source files you want to use in your project, but also about the metadata you want to collect along with them. Sometimes valuable metadata can be inferred from seemingly mundane details such as folder structure. Collecting such metadata after the fact may be painful or impossible.
  2. Data engineering: Think about what metadata could be retroactively added or inferred from other metadata fields. For example, you may want to include a "year" metadata field and realize that it can be extracted from the "published_date" metadata field. It's also important to think about the metadata generated during preprocessing and indexing itself. For example, when you're chunking the source files into smaller documents, as is often the case, you should retain information to associate it with the source file, such as the page number.
  3. Metadata usage: Which metadata should you use at which points in the pipeline? As with most decisions, some of this can be based on obvious need, intuition, or experimentation. In general, it helps to lean toward adding and using more of the metadata fields, as they are more likely to help than hurt performance.

Metadata: A silver bullet for RAG system improvement

In information processing systems, context is everything. Metadata is a cheap and powerful tool for enhancing the context of your data across the board. In our experience, if your RAG system isn’t using any metadata, you’re doing something wrong.

Metadata offers improvements at every stage of the RAG process. By thoughtfully integrating metadata into your AI solution, you can create more robust applications that deliver higher-quality results to end users.

Want to maximize your data strategy for AI? Contact us