Why Inference Chips Are on the Rise

This week, Google will unveil new tensor processing units (TPUs) designed specifically to handle AI inference workloads at the company’s Google Cloud Next event in Las Vegas, Bloomberg reports. And as it does, Google will be adding significantly to a rising groundswell of activity in specialized AI chips.

The momentum is fueled by two things: first, the desire of hyperscalers, AI labs, and large enterprises to find alternatives to NVIDIA’s graphics processing units (GPUs), which have dominated the market for high-end large language model (LLM) training.

Second, the widespread deployment of inferencing, or the process of adapting LLMs to fit specific application requirements, has given rise to chips specially designed to handle the latency and processing requirements of those workloads. The rise of agentic AI has only heightened the trend because agents often need continual inferencing to perform optimally.

To understand the need for inference chips, it's important to note how they differ from GPUs, which are required to train LLMs. GPUs can perform inferencing, but they provide functions that inference chips don’t need. For example, GPUs are set up to perform more exacting mathematics than inference requires. They are designed to shuttle big batches of tokens, optimizing throughput, whereas inference chips sacrifice throughput for lower latency, and they don’t call for all the math options GPUs feature. This means that while GPUs can do inferencing, inferencing chips aren’t optimized for training, though they perform faster for their chosen tasks and offer more efficient power consumption than GPUs, resulting in lower operational costs. It all adds up to demand for chips that don't come with some of the bells and whistles of GPUs but offer more focused functionality for inference.

Inference Chips on the Rise

Besides Google’s anticipated TPU news this week, there are other indications that specialized chips are on the rise, as per the following recent news items:

Google is reportedly in talks with Marvell to develop AI inference chips. Though a deal hasn't been finalized with Marvell, Google appears to want to expand its chip provider options beyond Broadcom, its chip partner up to now.
Late last year, Google announced that Anthropic agreed to purchase 1 million Google TPUs for training and inference processing.
Accelerator chipmaker Cerebras, which offers an Inference Cloud service, has filed an S-1 registration statement with the U.S. Securities and Exchange Commission to go public on Nasdaq under the ticker symbol CBRS.
Separately, Cerebras has penned a supply agreement with major customer OpenAI worth about $20 billion, as well as an agreement with two other major buyers: the UAE’s G42 conglomerate and that country’s Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), an Abu Dhabi-based university dedicated to AI science. Both OpenAI and G42 hold warrants to purchase Cerebras common stock at the IPO.
NVIDIA released in March its NVIDIA Groq 3 LPX Rack, comprising a rack of 256 Groq Language Processing Units (LPUs), which are chips architected for what NVIDIA calls “the low-latency and large-context demands of agentic systems.” The chip is the result of NVIDIA’s $20 billion purchase of inference technology from Groq last year.
In November 2025, AMD acquired MK1, a company that specializes in high-speed inference and enterprise ai. Separately, AMD offers its EPYC 9005 server CPU for enterprise inference.
Inference chipmaker SambaNova has teamed up with Intel, which invested in SambaNova’s recent Series E funding round of $350 million. Both SambaNova and Intel will market reference architecture using Intel’s Xeon processors and a new SN50 inference chip from SambaNova.

A Strong and Growing Market

Clearly, the demand for inference has set the stage for a lot of activity in the chip market. And it’s likely that as agentic AI grows, demand for these specialized chips will as well. While NVIDIA may have its share of the market, the success of the growing roster of alternatives proves there’s room for a number of players in a robust market.

A quote from Cerebras’s S-1 filing articulates this progression:

“Inference is no longer limited to answering questions; modern AI applications now perform actions on behalf of their users. They can directly book travel itineraries, code full web applications from scratch, help customers apply for mortgages, automatically analyze legal contracts for discrepancies, process insurance claims, and more. As a result, demand for AI inference has surged alongside the adoption of these smarter reasoning models that leverage more inference-time compute.

“Ultimately, inference compute demand is driven by the compounding effect of three forces: the number of users, the frequency of use, and the compute per use. Each of these forces is growing at an extraordinary rate, producing a geometric expansion of demand for inference and its underlying compute.”

Futuriom Take: Rising demand for inference is spawning much activity in the alternative chip market, including IPOs, licensing agreements, and significant deals and partnerships. This is a strong and growing market with interesting prospects.