Arista Unveils AI Agent for NVIDIA SmartNICs

A Ibrain3

By: Mary Jander

In a move that highlights the key role of SmartNICs in AI networks, networking vendor Arista Networks has created an Extensible Operating System (EOS) agent for NVIDIA’s BlueField-3 platform.

The problem this solves is acute, according to Arista CEO Jayshree Ullal. Presently, instead of coordinating control, quality of service, telemetry, and monitoring through one system, the network interface cards (NICs) and switches in an AI-oriented datacenter are configured and managed separately. This can cause misconfigurations and make it tough to identify root causes of failure, which is a nontrivial matter in an AI datacenter.

“AI networking demands consistent end-to-end Quality of Service for lossless transport,” wrote CEO Ullal in a recent blog post. She continued:

“This means that the NICs in a server, as well as networking platforms, must have uniform markers/mappings and accurate controls and congestion notifications (PFC & ECN with DCQCN) as well as appropriate buffer utilization thresholds so each component can react to network events like congestion promptly, ensuring the sender can precisely control the traffic flow rate to avoid packet drops. Today, the NICs and networking devices are configured separately. Any configuration mismatch can be extremely difficult to debug in large AI networks.”

An Agent for a SmartNIC

To alleviate the control mismatch between switch and NIC, Arista has extended its EOS operating system via a remote AI agent that resides on directly attached NICs and servers—starting with NVIDIA’s BlueField-3 SuperNIC.

The BlueField-3 SuperNIC is an accelerator card designed to connect GPU servers via 400-Gb/s InfiniBand or RDMA over Converged Ethernet (RoCE). The SmartNIC hosts the BlueField-3 DPU, which offloads networking, storage, security, and management functions from the CPU.

The EOS agent coordinates the QoS and traffic tuning on behalf of the Arista switch attached to the AI GPUs/server. “The remote agent deployed on the AI NIC/server transforms the switch to become the epicenter of the AI network to configure, monitor and debug problems on the AI Hosts and GPUs. This allows a singular and uniform point of control and visibility,” wrote Ullal.

For RoCE Only

Significantly, Arista’s EOS agent works only with RoCE, though BlueField-3 SmartNICs can work with InfiniBand as well. Arista is a founding member of the Ultra Ethernet Consortium (UEC), which is creating a standardized version of Ethernet to compete directly with InfiniBand in AI datacenters, so its choice is no surprise. Also notably, while Arista worked with NVIDIA to extend the EOS agent to BlueField-3, NVIDIA remained silent on the announcement.

Up to now, Arista has been reticent about the use of SmartNICs, presumably due to the problems cited above. Now, it’s clear that Arista intends to use its AI agent to draw the world’s most popular SmartNICs into its switching fold.

Arista to Include a Broader Ecosystem

While starting with NVIDIA, Arista is casting a wider net for the future. In her blog, CEO Jayshree Ullal cited ecosystem partners AMD, Broadcom, and Intel as well as NVIDIA. And her blog cites a broad range of components to be covered within the so-called AI Center managed from the Arista switch, including “GPUs, servers, cables, switches, and routers.”

Arista is set to demo the new agent at the 10th anniversary of its IPO, an invitational gathering of financial analysts set for June 5th at the New York Stock Exchange. After that, however, the vendor will start customer trials sometime during the second half of 2024.