RAGs to Riches: The New Era of AI Inferencing

We are entering the age of AI inference, which, according to conventional wisdom, is destined to dwarf the size of the AI training market. NVIDIA's own estimates for AI's computing requirements grew roughly 100-fold in 12 months, as CEO Jensen Huang noted during his GTC 2025 keynote. That's thanks partly to the rise of agentic AI, expected to be a cornerstone of inference, he said.

Most inference does not require the massive datacenters that large language model (LLM) training does. Some will, but many if not most inference workloads will suitably run in enterprise facilities, on laptops, or even on embedded devices. That means fewer eye-popping statistics about millions of GPUs and gigawatt-sized facilities. But inference matters, because it’s the way enterprises and employees wring value from AI and LLMs.

Inferencing Grows Its Influence on AI

NVIDIA and AMD both re-asserted the importance of inference during their early 2025 product launch events. Huang’s 100X statement, although offered without hard-number context, is partly a reaction to the way reasoning models will iterate, conducting multiple attempts to derive an answer. AMD expects the AI inference market to grow at 80% per year “for the next few years,” CEO Lisa Su said recently, although she offered neither market sizes nor a timeframe for context.

Retrieval-augmented generation (RAG) is a crucial element of inference, especially in generative AI (GenAI) use cases. It brings more information and context into consideration, adding to the knowledge already present in an LLM. RAG is in use today and will likely become more complex as AI agents become commonplace.

What is RAG?

RAG is a framework to improve the performance of AI models, including large language models (LLMs) and small language models (SLMs), by integrating external data, making AI applications more accurate for specific use cases. This is not the same as pre-training a model by, say, teaching it the jargon specific to an industry. Training is slow and expensive, whereas RAG operates on the fly during inferencing, infusing a model with information relevant to the current query.

RAG is also useful for keeping up with changing data. An inherent problem with any trained model is that it was not trained right now. This doesn't always matter—such as when digging through years-old earnings reports—but for situations built around dynamic, real-time information, it's a shortcoming. Chances are your own interactions with LLMs have included RAG.

In our first report on RAG, Futuriom goes over some of the drivers for RAG as well as why it exists. We also detail how some key technologies are evolving to support RAG. These include:

Inferencing approaches and architectures.
Interest in smaller AI models.
Databases, including vector databases and vector search.
Networking and security.

So read up! You can download the report right here. Below are some of the key findings.

Some companies highlighted in this report: Aryaka, Amazon, Aviatrix, Chroma, Cisco, Cloudflare, Couchbase, Crunchy Data, Databricks, DDN, Dell, Deep Lake, Elastic, F5, Fortanix, Google, Hitachi Vantara, HPE, Infinidat, Juniper, LanceDB, Microsoft, MinIO, MongoDB, Neon, NetApp, Nuclia, NVIDIA, OpenSearch, Oracle, Pinecone, Pryon, Pure Storage, Qdrant, Ragie, Snowflake, TileDB, Timescale, VAST Data, Vectara, Veeam, Versa, Vultr, Weaviate, Weka, Yugabyte, and Zilliz.

Highlights and Key Findings

Inferencing is the heart of enterprise AI. Enterprises will still train specialized models, but they can't reap the benefits of AI until they become experts at inference.
Retrieval-Augmented Generation (RAG) is a relevant way to infuse LLMs with additional data. Thanks to larger LLM context windows, users can add quite a bit of information to a query. RAG, however, is less costly in terms of tokens and a better way to accommodate dynamic, real-time data.
Vector databases are having their moment. By storing documents, images, and other data as vectors, enterprises can use RAG to perform multimedia searches. This has led major data players such as Oracle, Databricks, and Snowflake to incorporate vector support into their products.
Vector indexing and vector search are crucial database features. Any database can store vectors. But with vectors numbering in the billions for individual enterprises, it's the ability to search instantly that gives vector databases their appeal.
RAG sprawl is an issue for early adopters. This has led to the rise of RAG-as-a-service (RaaS) and turnkey RAG—cloud-based options that can abstract away the details of different RAG approaches.
Agentic AI will unlock more ambitious RAG and inference. AI-driven agents can satisfy more complex queries that require multiple steps.
The Model Context Protocol (MCP) is accelerating the maturity of agentic AI. It's an esoteric under-the-covers protocol, but developers have leapt onto MCP as a way to make AI handle sophisticated tasks.
Security is a major issue in all this. That was true with RAG and goes doubly for MCP. The good news is that these issues are getting attention. Companies should strive to allay security concerns with the proper architecture and security tools.

ANALYSIS OF EMERGING CLOUD TECHNOLOGIES

RAGs to Riches: The New Era of AI Inferencing

Inferencing Grows Its Influence on AI

What is RAG?

Highlights and Key Findings

Download the report here!

Inferencing Grows Its Influence on AI

What is RAG?

Highlights and Key Findings

Download the report here!

SUBSCRIBE to the Futuriom Take