Future Proofing Transport Networks for AI

With the rise in generative artificial intelligence (AI) applications and the massive buildout of AI infrastructure, the optics industry is at the forefront of this evolution since improved optical interconnections can mitigate bandwidth constraints within an AI cluster. This was one of the hottest topics at OFC 2024, with LightCounting forecasting that total sales of optical transceivers for AI cluster applications may reach approximately $52 billion over the next 5 years.

While the near-term focus has been on how AI will affect the technology around short reach interconnects, there will certainly be an impact to interconnections beyond the AI clusters and beyond the AI data center, in hyperscaler networks.

The question is: beyond the short-distance high-bandwidth interconnections, how would AI traffic impact the optical transport environment beyond the intra-building network and into the metro, long haul, and longer reach applications where optical coherent transmission is heavily utilized?

Figure 1. Sales Forecast for Ethernet Optical Transceivers for AI Clusters (July 2024 LightCounting Newsletter, “A Soft Landing for AI Optics?”).

Effect of Past Applications on the Transport Network
Bandwidth-intensive computing is attributed to both AI training as well as AI inference, where inference refers to the post-training process in which the model is “ready for the world,” creating an inference-based output from input data using what it learned during the training process. In addition to AI training’s requirements for a large amount of computing power and a large number of high-bandwidth, short connections, there are also foreseen bandwidth requirements beyond the AI data center. To understand how network traffic patterns may evolve beyond the AI data center, let’s review some examples of how the wider transport network was affected by past growth of various applications. Although these applications may not strongly match the effect of AI application traffic, it can provide some insight into the effects that the growth of AI applications may have on optical transport, and thus on the growth of coherent technology.

If we look at search applications, the AI training process is generally analogous to a search engine’s crawling bots combing the internet to gather data to be indexed (AI training being much more computationally intensive). The AI inference process is analogous to the search engine being queried by the end user with results made available for user retrieval with minimal latency. While the required transport bandwidth for search bots and user queries are minimal compared to higher bandwidth applications, the cumulative effect of the search-related traffic is part of the contribution to overall transport traffic, including bandwidth from regional/local caching to minimize latency, as well as usage from subsequent traffic created by acting upon search results.

Understanding how network traffic was affected by the growth of video content delivery is another example that can inform potential AI network transport traffic patterns. A main concern resulting from video content distribution was the burden imposed on the network in delivering the content (especially high-resolution video) to the end-user. To address this concern, content caching, where higher demand content was cached closer to the end-user, was implemented to reduce overall network traffic from the distribution source to the end-user, as well as reduce latency. While it is too early to predict how much network traffic would increase due to expansive queries to and responses from AI inference applications, the challenge is to ensure the latency for this access is minimal. One could see an analogy of content caching to edge computing where the AI inference model is closer to the user with increased transport bandwidth required for these edge computing sites. However, the challenge would be to understand how this would affect the efficiency of the inference function.

Turning to cloud computing for insights on traffic patterns, the rise of (multi-) cloud and computing resulted in intra and inter-datacenter traffic (a.k.a. east-west traffic) increasing as workloads traversed across the datacenter environment. There’s a similar potential rise in this type of traffic with AI as data for training could be dispersed among multiple sites of clusters as well as inference models being distributed to physically diverse sites to reduce latency to end users.

For any of these previous examples, as the demand of these applications increases, the transport bandwidth requirements would also increase from not only the target data (e.g., search results, video), but also from overhead or intra datacenter traffic to support these applications (e.g., content caching, cloud computing, backend overhead). Traffic behavior for aggregating AI training content as well as the distribution of AI inference models and its results may be similar to the traffic patterns of these previous applications, applying pressure to network operators to increase capacity for its data center interconnect, metro, and regional networks. Long haul and subsea networks may also experience a need to expand to meet the demands of AI-related traffic.

Figure 2. A scenario in which the network fabric physically expands due to facility power constraints, requiring high-capacity optical interconnections.

The Balance of Power and Latency
While the application examples above are related to how the AI application itself may affect bandwidth growth, what is becoming apparent is the power requirements to run AI clusters and data centers are significant. In the past, as the demand for cloud services grew, the need for large-scale data centers to have access to localized inexpensive power sources helped to drive the location selection for large data centers. However, power facility/availability constraints helped drive the adoption of physically distributed architectures, which then relied on high-capacity transport interconnects between data centers to maintain the desired network architecture (Figure 2). We anticipate a similar situation with AI buildouts requiring distributed facilities to address power constraints with potential trade-offs of reduced efficiencies for both AI training and inference. The distributed network would then rely on high-capacity interconnect transport using coherent transmission to extend the AI network fabric. Unlike cloud applications, physical expansion of the network fabric for AI applications has a different set of challenges due to compute and latency requirements for both training and inference.

Figure 3. Extremely low latency is required within the AI cluster to expeditiously process incoming datasets during the training mode. Since datasets are collected before being fed into the training cluster, the process of collecting these datasets may not be as latency sensitive.

As we plan for AI buildouts, one common question is how the physical extension of an AI networking fabric may affect AI functions. While geographic distribution of AI training is not ideal, facility power constraints are certain to lead to a growing adoption of distributed AI training techniques that attempts to mitigate introduced latency effects. As part of the training process, sourcing datasets feeding into the training cluster may not be latency sensitive and would not be as impacted by physical network extension (Figure 3). After training, when the inference model is complete, the goal is to minimize the latency between the user query to the inference model and the transmitted results to the user (Figure 4). The latency is a combination of the complexity of the query as well as the number of “hops” between the inference model and the user. Latency reduction when accessing the inference model, as well as methods to effectively distribute both the training and the inference function beyond a centralized architecture to address single-site power constraints, are ongoing discussions within the industry.

Whether driven by power constraints, dataset sourcing, or inference response efficiency, the sheer growth of AI applications will drive network traffic growth beyond AI cluster sites towards the wider network requiring high-capacity interconnects.

Figure 4. Minimizing latency for AI inference is a key objective.

Trading off power requirements versus access to inexpensive and abundant power versus latency is familiar territory when it comes to bandwidth intensive applications. The outcome that optimizes these trade-offs is application dependent and can even be deployment-by-deployment dependent. We continue to watch the evolving AI space to see how these network architecture trade-offs will play out, with the impact of how the transport network is designed. High-capacity coherent transport can certainly influence these trade-offs. And as we have already seen, by using coherent high-capacity transport cloud architectures, networks were able to physically expand to alleviate power source constraints by provided fat-pipe links between sites. We anticipate a similar scenario with expanding AI network architectures.

The Ripple Effect
While the near-term focus on high-capacity interconnects for AI applications has been on short reach connections within AI clusters, we are already seeing bandwidth requirements begin to increase, requiring additional coherent connectivity between datacenters supporting AI. And while there is general agreement that the resulting bandwidth demand from AI applications translates to increased traffic across the network, we are at the early stages in understanding how specific segments of the network are affected. Coherent optical interconnects for high-capacity transport beyond the data center already provide performance-optimized transponder solutions at 1.2T per wavelength as well as 400G router-to-router wavelengths moving to 800G using MSA pluggable modules. This technology will continue to play a role in the transport solution supporting AI applications whether the expanding traffic is in the metro portion, data center interconnects, long haul, or beyond.

Future Proofing Transport Networks for AI

Related

Be Part of the Terabit Era Today

Acacia Introduces 800G ZR+ and 800ZR with Interoperable PCS in OSFP and QSFP-DD

Connect with Acacia