Cloud Tracker Pro

WEKA's AI Push Targets Inferencing Tokenomics

AI inferencing is being overrun by tokens. While the cost per token has dropped considerably, the volume of tokens being used is increasing at a much faster rate. Inference is turning out to be expensive.

Val Bercovici, chief AI officer at venture-backed startup WEKA, explained it at an evening event held by VentureBeat in San Francisco earlier this month. Reasoning models are arriving from providers like Anthropic, and they generate on the order of 100 times more tokens than previous LLMs.

At the same time, developers really are building AI agents—"swarm coding" has overtaken vibe coding as the idea of the moment, Bercovici said—which can spur another 100x increase in token usage. Smaller AI models won't change this, he added.

All of this has happened during calendar 2025. It's a sudden shock to the system and implies that the future of enterprise AI is going to be expensive.

WEKA has steered its product plans to address this. It's part of a larger trend of storage-related vendors becoming broader data management players. (VAST Data says it had those ambitions all along.) On the surface, they're addressing the GPU-to-memory bottleneck, which has proven to be a major issue in AI training and even inference.

More deeply, it's about improving GPU utilization, which in turn improves the economics of running AI. It's gotten WEKA and VAST into partnerships with neoclouds and large AI model providers.

A Year of AI

WEKA started in 2013 with the intention of building a new storage architecture, working from scratch to fit the needs of high-performance computing. Nearly a decade later, that focus put WEKA in a good place to join the AI surge, targeting the infrastructure behind large language models (LLMs).

WEKA raised a $140 million Series E last year, bringing its post-money valuation to $1.6 billion. That was followed up by a busy 2025, as WEKA made it clear it's focusing on being a platform for AI, backing up that declaration with some major product announcements.

That started with the launch of the Augmented Memory Grid, which speeds up AI processing by pooling together memory across different servers. This saves time, because the GPU is no longer redundantly generating tokens that it should have remembered; those tokens can get pushed out of cache when the GPU's own memory fills up.

NeuralMesh came in June, adapting WEKA's parallel file system to a world of microservices and Kubernetes. That became the basis of NeuralMesh Axon, formally launched in July, although you can find earlier mentions of it. Like the Augmented Memory Grid, NeuralMesh Axon pools together resources, but at a scale intended for massive AI factories. It sits on the GPU server itself and lets GPUs treat disk storage as if it were local memory, storing information there that can be grabbed immediately.

What makes this feasible is that Neural Mesh Axon uses the server's east-west network, the one that connects GPUs to one another. With speeds approaching 800 Gbit/s, it's a much faster network than the north-south connections normally used for storage I/O. It's fast enough that latency incurred by the distance to a storage drive hasn't been a factor, WEKA says (and it's aided by the fact that WEKA was engineered for large systems in the first place).

Pursuing Neocloud Tokenomics

The upshot of all this is that WEKA, in addition to being available on the major public clouds, has ensconced itself in the neocloud world. Its platform has been deployed by CoreWeave and AI model developer Cohere. The company also announced a partnership with Nebius in June, having been introduced to the neocloud by a common healthcare-industry customer.

In terms of competition, VAST Data has likewise gained traction in the neocloud world. WEKA is quick to point out that VAST uses some proprietary hardware. VAST is also more ambitious in terms of data services, now branding itself as an "operating system" for AI.

WEKA's strategy involves pressing the theme of token economics, emphasizing their role as the real currency of AI inference. (This also positions the company to ride along as the AI buzz starts emphasizing inference more than training.)

Inference clusters operate on different principles than training clusters, Bercovici argues. Inference needs elasticity. WEKA can't necessarily stop inference from needing more and more tokens, but the company argues it can help with the economics of inference by making more efficient of GPUs and storage.

WEKA isn't alone in pitching ways to overcome GPU idle time. Its platform has found footing with some key customers, however, and its strategy of emphasizing token economics should resonate as the next wave of AI inferencing takes shape.

To access the rest of this article, you need a Futuriom CLOUD TRACKER PRO subscription — see below.