2024 Trend Preview: AI Infrastructure Startups

Clouddatacenter

By: Mary Jander


(Editor's note: This is a multipart series in which Futuriom profiles the trends we expect to drive cloud technology in 2024, as well as the top private companies that have been featured on our 2023 Futuriom 50 list or those being considered for 2024. The new report will be out at the end of January.)

The number of companies using machine learning (ML) and artificial intelligence (AI) to support cloud and communications infrastructure has exploded. Indeed, nearly every firm that monitors or controls any aspect of the IT estate claims some sort of intelligence in their wares.

At the same time, AI -- particularly generative AI (GenAI), which creates text, images, sounds, and other output from natural language queries -- is facilitating a new computing paradigm, one that runs on highly distributed and accelerated platforms. These new environments require a complex and powerful underlying infrastructure, one that addresses the full stack of functionality, from chips to specialized networking cards to distributed high performance computing systems.

In short, AI is being used in nearly every aspect of cloud infrastructure, while it is also deployed as the foundation of a new era of compute and networking. The following key terms describe some micro trends associated with AI:

  • AI infrastructure
  • Machine learning
  • Natural language querying
  • Incident correlation
  • AI modeling

Let's dive in and look at some of the niche trends and markets emerging within the broader AI trend.

Infrastructure to Support AI

AI can fuel the trends from both directions -- not only will more AI be used by companies to improve their own products, but new products are needed to build the infrastructure to deliver and support AI services.

Building infrastructure for AI services is not a trivial game. It requires massive amounts of money and top-of-the-line engineering to minimize latency and maximize connectivity. In short, AI infrastructure makes traditional enterprise and cloud infrastructure look like child's play.

There has been a surge in companies contributing to the fundamental infrastructure of AI applications -- the full-stack transformation required to run large language models (LLMs) for GenAI. The giant in the space, of course, is NVIDIA, which has the most complete infrastructure stack for AI, including software, chips, SmartNICs, and networking. But there will be plenty of spots for emerging private companies to play as Ethernet-based networking solutions emerge as an alternative to Infiniband, which has ruled the AI market so far. At the same time, specialized AI service providers are emerging to build AI-optimized clouds.

Arrcus, for instance, offers Arrcus Connected Edge for AI (ACE-AI), which uses Ethernet to support AI/ML workloads, including GPUs within the datacenter clusters tasked with processing LLMs. The vendor aims the solution at communications service providers, enterprises, and hyperscalers looking for a way to flexibly network compute resources for AI infrastructure in a software-based approach that avoids the costs and limitations of switching hardware. Arrcus recently joined the Ulta Ethernet Consortium, a band of companies targeting high-performance Ethernet-based solutions for AI.

DriveNets offers a Network Cloud-AI solution that deploys a Distributed Disaggregated Chassis (DDC) approach to interconnecting any brand of GPUs in AI clusters via Ethernet. This massively scalable platform is meant to be an InfiniBand alternative. Implemented via white boxes based on Broadcom Jericho 2C+ and Jericho 3-AI components, the product can link up to 32,000 GPUs at up to 800 Gb/s.

Enfabrica, a startup founded in 2020 that emerged from stealth early in 2023, has created an accelerated compute fabric switch (ACF-S) that replaces the SmartNICs and PCIe switches that connect Ethernet-linked servers with the GPUs and CPUs within the systems that process AI models. The switch chip offers faster connections from the network to the AI system and reduces latency associated with traffic flows between NICs and GPUs. All of this streamlines AI processing and lowers the total cost of ownership (TCO) for AI systems.

Enfabrica hasn’t released its ACF-S switch yet, but it is taking orders for shipment early this year, and the startup has been displaying a prototype at conferences and trade shows in recent months. While it can’t list customers yet, Enfabrica’s investor list is impressive, including Atreides Management, Sutter Hill Ventures, IAG Capital, Liberty Global, NVIDIA, Valor Equity Partners, Infinitum, and Alumni Ventures.

Software for Open Networking in the Cloud (SONiC) is an open networking platform built for the cloud — and many enterprises see it as an economical solution for running AI networks, especially at the edge in private clouds.

Aviz Networks has built the Open Networking Enterprise Suite, a multivendor networking stack for the open-source network operating system, SONiC, enabling datacenters and edge networks to deploy and operate SONiC regardless of the underlying ASIC, switching, or the type of SONiC. It also incorporates NVIDIA Cumulus Linux, Arista EOS, or Cisco NX-OS into its SONiC network.

Hedgehog is a cloud-native software company helping cloud-native application operators manage workloads and networking with the ease of use of the public cloud. This includes managing applications across edge compute, on-premises infrastructure, or in distributed cloud infrastructure. CEO Marc Austin recently told us the technology is in early testing for some projects that need the scale and efficiency of cloud-native networking to implement AI at the edge.

Another startup with AI infrastructure in mind is WebAssembly (Wasm) developer Fermyon, which has created Spin, an open-source tool for software engineers, and Fermyon Cloud, a premium cloud service aimed at larger enterprises. Both products deploy the W3C Wasm standard to efficiently compile many different types of code down to the machine level, giving Web apps much faster startup times. The software also runs cloud apps securely in a Web sandbox separated at the code level from the rest of the infrastructure.

When it comes to AI, Fermyon is looking ahead. Thanks to its ability to shrink the amount of code required to run cloud applications on the Web, Fermyon's Wasm could be used to run AI in distributed Internet of Things (IoT) environments or in cloud infrastructure supporting edge data analytics and AI.

Another company, Tecton, specializes in creating feature stores for machine learning. Feature stores refer to multiple data items organized and stored for use in training machine learning models. The founding team at Tecton came from Uber, where they helped develop Michelangelo, a feature store platform that helped Uber develop thousands of ML models. Tecton now seeks to "democratize" this technology.

The above is just a sampling of highly innovative firms in the AI space. Many more companies are continually emerging, as demand for AI keeps building and new use cases proliferate. And though many of the companies in this space are still in early stages, it seems realistic to expect sizable growth in the AI infrastructure arena.

AI-Enabled Observability and Automation

Several companies use AI for observability, defined as the capability to gather and analyze information about the status of IT elements. Kentik's Network Intelligence Platform, delivered as a service, uses AI and machine learning to monitor traffic from multiple sources throughout the IT infrastructure and correlate that data with additional information from telemetry, traffic monitoring, performance testing, and other sources. The results are used for capacity planning, cloud cost management, and troubleshooting. Selector uses AI and ML to identify anomalies in the performance of applications, networks, and clouds by correlating data from metrics, logs, and alerts. A natural language query interface is integrated with messaging platforms such as Slack and Microsoft Teams.

In addition to observability firms, there are companies using AI for cloud cost optimization. CAST AI, for example, uses artificial intelligence to optimize cloud spending and monitor Kubernetes costs. The startup claims to help IT realize up to 60% in savings through use of its platform, which also uses AI to make recommendations for optimizing resources. Zesty uses an AI model to predict and supply the levels of cloud storage, CPUs, and other resources an application requires in real time, resulting in substantial savings.

AI for Security

AI plays a key role in IT security by identifying threats from numerous alerts and offering automated solutions or remediation advice. Exabeam's Security Operations Platform provides security information and event management (SIEM) as well as an AI-driven Analytics Engine that consolidates and correlates alerts from hundreds of sources, including third-party event monitoring applications, in order to identify insider threats, compromised credentials, data exfiltration, and more.

Lacework offers a cloud-native application protection platform (CNAPP) that ingests data from numerous sources in public, private, and hybrid clouds and uses AI/ML to detect anomalies of security events in AWS, Azure, GCP, Oracle OCI, multicloud, and Kubernetes environments. An AI Assist GenAI assistant provides guidance and recommendations for security operations.

Stellar Cyber's Open XDR platform combines many security capabilities, including SIEM; security orchestration, automation, and response (SOAR); and network detection and response (NDR), to name just a few. It uses AI and ML to correlate incidents and issue automatic responses to security breaches.

The Versa Unified SASE Platform gathers data from the wide-area network, the cloud, campus, branch, and individual users and devices into a data lake to which AI/ML is applied to generate alerts, identify anomalous user behavior, and protect against data loss.

AI for Data and Apps at the Edge

One key features of AI is that it consumes, and generates, a lot of data. Learning models require vast collection of data from any type of device or application. There is a lot of overlap between AI, edge infrastructure and networking.

AI will fuel the growing trend of multicloud networking. Networking companies targeting data and apps at the edge should benefit from the need for secure connectivity. Prosimo, for example, offers a cloud-native, multicloud infrastructure stack that delivers cloud networking, performance, security, observability, and cost management. AI and machine learning models provide data insights and monitor the network for opportunities to improve performance or reduce cloud egress costs. Graphiant's Network Edge tags remote devices with packet instructions to improve performance and agility at the edge compared to MPLS or even SD-WAN. A Graphiant Portal enables policy setup and connectivity to major public clouds.

Providers of AI in IT and cloud environments should also benefit. These include ClearBlade, whose Internet of Things (IoT) software facilitates stream processing from multiple edge devices to a variety of internal and external data stores. ClearBlade Intelligent Assets deploys artificial intelligence (AI) to create digital twins of a variety of IoT environments that can be linked to real-time monitoring and operational functions.

In other scenarios, vendors have augmented their products with GenAI to help customers develop their own GenAI applications. Databricks, for example, offers a Data Intelligence Engine that enables customers to obtain information about specific data (which can then be used in AI model training) through natural language queries to the lakehouse (the company's flagship combination of data warehousing with data lakes). The company also uses this intelligence to manage data and optimize performance.

Data backup expert Rubrik offers Ruby, which it calls a "generative AI companion" within the Rubrik Security Cloud that identifies threats to stored data and assists with remediation and recovery. Ruby, which was built using Microsoft Azure's OpenAI service, works with Rubrik's ML-powered data security products to streamline the discovery, investigation, and remediation of cyber threats.

Stay tuned as we update all of these trends in our forthcoming Futuriom 50 report!