AWS Challenges GPUs with Trainium3

Chip2

By: Mary Jander


AWS announced general availability of its Amazon Elastic Compute Cloud (EC2) Trn3 UltraServers at its re:Invent conference in Las Vegas this week, spotlighting its new 3-nanometer chip, Trainium3. And while talking future integration with NVIDIA’s NVLink Fusion partner integration platform, AWS maintained its position that Trainium3 offers a cost-effective alternative to other AI accelerators. This puts AWS firmly in the growing league of players challenging NVIDIA’s reign in the accelerated computing market.

Trainium3 has been released just a year after Trainium2, setting a cadence that’s been noted by observers as competitive with NVIDIA’s once-a-year chip cycle and reflective of a general sense of urgency about upping the ante on AI processing power. And as it did last year, AWS is touting a massive clustering solution in the Trn3 UltraServer. That new solution handles up to 144 Trainium3 chips, and via AWS instances, thousands of UltraServers can scale up to handle 1 million Trn3 chips.

The hyperscaler boasts a range of improvements for Trn3 over Trn2, including 4.4 times higher performance, 3.9 times higher memory bandwidth, and 4 times better performance per watt. Added to all this is a new NeuronSwitch-v1 based on technology from Astera Labs, which doubles the bandwidth within the UltraServer while providing links between chips of less than 10 microseconds. These features gear the Trn3 to all kinds of functions in the development of agentic, reasoning, and video generation apps, AWS says.

Trainium3. Source: AWS

The NVIDIA Question

So far, Trainium chips have been doing well, as noted in Amazon’s latest earnings call, in which Trainium was described as a “multibillion-dollar franchise.” This may explain the accelerated release schedule.

But some observers say that despite its advantages, AWS Trn3 trails the competition in two key areas. First, there’s the obvious comparison of AWS XPUs to NVIDIA’s GPUs. “Roadmap highlights improved performance, but overall performance still trails other solutions,” wrote Jefferies analyst Blayne Curtis in a December 2 note. “From a TCO perspective, Trn3 appears to be roughly in line with the H200 in terms of TFLOPS/Watt but makes larger improvements on tokens/sec (no data yet) given the better scale-up network.”

Another issue is the lack of developer infrastructure comparable to NVIDIA’s CUDA libraries, which represent many years of innovation. Still, AWS offers its Neuron developer stack to optimize functions in Trainium and Inferencia environments, and the vendor claims it’s an outstanding alternative for AI researchers by virtue of its multicode support.

But AWS stands by its alliance with NVIDIA, despite its competitive stance with Trn3. AWS maintains that its cloud services are the best platform for NVIDIA. In an interview with Bloomberg this week, AWS CEO Matt Garman said:

“When you’re running a large cluster of NVIDIA GPUs, people will tell you, AWS is the best place, you get the best performance, the most stable cluster, the best capabilities out there…. There’s some use cases that are best for Trainium, there’s other use cases where NVIDIA GPUs are going to be your best option. We want to have all of those available.”

Matt Garman, CEO of AWS. Source: AWS

Part of this week’s announcement also is a future look at Trainium4, which will support NVLink Fusion in NVIDIA MGX racks, and UALink for dedicated UltraServer architectures. Add to this the promise that Trn4 will have “at least 6x the processing performance (FP4), 3x the FP8 performance, and 4x more memory bandwidth to support the next generation of frontier training and inference,” and AWS is building a solid case for competing aggressively with its partner NVIDIA.

Faster and Farther

At the heart of its pitch against NVIDIA and others, AWS is marketing Trn3 and its UltraServer cluster solution as delivering more cost-effective performance, which as noted by Blayne Curtis hasn’t been tested against market alternatives but nevertheless looks impressive on paper.

AWS claims Anthropic, Karakuri, Metagenomi, NetoAI, Ricoh, and Splash Music have “reduced their training costs by up to 50% compared to alternatives,” though AWS doesn’t say those firms are using Trn3. That said, AWS is using the new chip to power its Bedrock platform. And another company, AI lab Decart, is, according to AWS, “leveraging Trainium3's capabilities for demanding workloads like real-time generative video, achieving 4x faster frame generation at half the cost of GPUs.”

Ultimately, AWS’ Trn3 will prove its worth in the field over the next few quarters, presenting NVIDIA, AMD, and other chip companies with potentially significant competition.

Futuriom Take: AWS is joining Google and other companies in challenging NVIDIA’s dominance in the XPU market with a bid to offer faster and more efficient functionality with Trainium3.