AI in the Cloud and On-Prem: Going the Extra Mile for Our Customers

How our partnerships with NVIDIA and Quest1 allow us to implement customized solutions—from utilizing existing local infrastructure to our fully managed service—to meet each customer's needs

How we implement LLM-based systems—like Retrieval Augmented Generation (RAG), Agents, or Intelligent Document Processing (IDP)—in production depends a lot on the customer's specific situation. It doesn't matter if they're using services from companies like Anthropic, OpenAI, and AWS Bedrock, or if they're hosting open-source LLMs themselves —we're here to make sure they get the best, most seamless implementation possible. These days, we're seeing more and more customers wanting to deploy all or part of their system on premises, and we're excited to help them tackle the unique challenges that come with it. 

Together with our partners Quest1 and NVIDIA, we ran a benchmarking exercise to determine the best combination of text embedding model and hardware for a customer who was looking to embed billions of documents on premises. Let’s take a closer look at different implementation paradigms, and how we work with partners to ensure the most seamless experience possible for each and every customer.

Implementation Options between Cloud and Ground

Implementing an LLM-based solution in the cloud means running it on remote servers using a cloud provider like AWS, Google Cloud, or Microsoft Azure. Running it locally (on the ground) means deploying it in an organization's own infrastructure—giving them full control over the application. This latter implementation is often referred to as on premises or "on-prem". 

To understand the different solutions, we need to understand the difference between types of language models. While LLMs such as GPT-4 and Claude have been receiving massive attention online, they are often run in tandem with much smaller models, usually embedding models. 

Embedding models typically have way fewer parameters than LLMs, meaning they’re easier to store and cheaper to use. These smaller models can help index, retrieve, rank, and classify data. Embedding models and LLMs work together in RAG, IDP, and most agentic applications. However, due to the modular nature of modern composable AI systems, they can still run in different locations within an implementation and communicate remotely.

100% cloud

Purely cloud-based deployment is the default in deepset Cloud. Having a managed service means you don't need to worry about infrastructure setup, maintenance, or scaling as your needs grow—the platform handles all of this automatically. Especially during development, working in the cloud is the only viable option because of the need to spin up different complex pipelines and run them side by side.

On-prem

In data-driven solutions like RAG, an organization indexes documents in vector databases for use with LLMs. While this is commonly done in the cloud, not all organizations can expose their data in this way. They must keep their data on-premises and restrict its transfer to building context for the external LLM. 

To run any model on-prem, an organization must have the right hardware resources. As we'll see below, deepset provides the expertise and resources to methodically test our customers' hardware to ensure that it is up to the task they intend to run on it.

Cloud to ground

Developing a production pipeline consumes many times the computing resources needed once it's in production. You're constantly reindexing the same data using different models, and you want to evaluate and optimize different pipelines at the same time. That's why virtually all AI products are developed in the cloud, even if the production pipeline is later moved to on-premises in a "cloud to ground" scenario.  deepset Cloud can be used to develop pipelines in the cloud before deploying them on-prem.

When you're working with cloud-to-ground, you often want to prevent data from leaving local silos during development. The solution is to use synthetic data that is very similar in form and content to the real data, so the system can be tuned without ever seeing it. This ensures that sensitive customer data doesn't interact with the LLM pipeline until it is in-house.

Open hybrid cloud

Many organizations need the scalability and advanced services of public clouds, but they also need tight control over sensitive data and legacy systems. They worry about becoming dependent on a single cloud provider's ecosystem, which can be costly and difficult to adapt as their needs change. Without the right approach, they end up locked into proprietary tools and APIs that only work with specific cloud providers, limiting their ability to move workloads or adopt better solutions as they emerge. That's where an open hybrid cloud strategy comes in. 

By combining private on-premises resources, public cloud services, and a foundation of open source software, this approach ensures that organizations can use the same tools and interfaces no matter where their applications run. deepset Cloud seamlessly integrates to enable and enhance hybrid architectures.

Case study: Embedding 7 billion documents locally

One of our knowledge management customers had the following problem: They wanted to index their large knowledge base of seven billion library records using an embedding model. Since they already had NVIDIA GPUs—valuable hardware optimized for AI inference—they wanted to know if it was possible to perform this task locally rather than in the cloud. They also wanted to evaluate which embedding model would work best on their hardware. Unfortunately, there were no publicly available benchmarks for this generation of GPUs.

This is a very common "real world" problem for many organizations. Your hardware may be a little outdated, does that mean you have to spend tens of thousands of dollars on new hardware (which will be obsolete in the future)? Or that you can't use your local infrastructure at all for AI inference (even though it was built for it)?

Cloud solutions are often easier to use because the hardware is always up-to-date and the resources scale automatically in the background. But we believe that if an organization has an AI inference infrastructure in place, it should make the most of it.

Together with Quest1, we conducted a benchmarking exercise based on existing NVIDIA benchmarks to answer the customer's questions in a reliable and methodical way. Not only did the benchmark give the customer confidence that their hardware could handle the billions of documents, but we also helped them identify the most appropriate model for the task. We will soon be publishing the detailed results in an in-depth technical report.

Talk to an AI Expert

By combining our expertise in designing, building, and deploying custom systems with LLMs and Quest1's focus on helping customers integrate AI into their applications, deepset provides customers with powerful solutions that are perfectly tailored to their needs, both at the business level and at the implementation level.

Do you also have unique infrastructure or security requirements? Maybe you're sitting on some hardware you're not using and want to find out how best to use it to bring AI to your business? Or maybe you just want to learn more about the differences between the various deployment options we outlined above—and understand which one is best suited to your organization's unique needs and conditions. We'd be happy to help. Book a meeting here.