At OFC last month there was a lot of discussion about the three architectural strategies for expanding system capacity: scale up, scale out, and scale across. Scale up refers to scaling within a rack, while scale out extends outside of the rack and throughout the data center. And when that’s not enough to handle the AI workloads, providers look to scale across multiple data centers so that resources such as GPU power can be distributed across multiple geographic locations.
While we discussed these new architectures in a previous blog, this blog will focus on scale across and how coherent optics are a key enabler.
What is Scale Across?
To alleviate power and space constraints, algorithmic improvements have made it possible to expand outside a single data center facility (to campus, metro, and even beyond). This is referred to as the scale-across network—connecting a back-end network across multiple data centers in close proximity or across regions. The reasons for doing this may include geo-distributed AI training, capacity pooling or connecting AI factories.
Figure 1. Scale-across architecture extends the AI Back-End network beyond the physical data center
In this new computing paradigm, data center operators need to confront the physical, operational, commercial, and regulatory challenges of power delivery and cooling to these sites. The world’s largest clusters are already operating in facilities designed to scale to gigawatts of electricity—comparable to small cities. This has forced a rethinking in data center location strategy, with operators seeking regions offering lower-cost and more-reliable power.
Scale-Across Challenges
Scale-across use cases are estimated to require 14X the bandwidth between sites compared to cloud DCI. Instead of deploying bandwidth incrementally at the wavelength level, some operators are thinking in terms of fiber-level bandwidth granularity. The scale of this bandwidth between data centers will demand significant improvements in power and reliability for optical networks. New strategies include multi-rail amplifiers that allow vendors to leverage common elements across multiple different fibers, as well as media converters that integrate the client and line into a single DSP. All this scale is leading to several new challenges that need to be addressed, including:
- Operation scalability – Disaggregation is important because it allows vendors to choose different solutions and limit risk to the supply chain and component failures. The simpler we can make networks, the easier it will be to scale in the future, and this includes multi-vendor ecosystems and reduced component counts.
- Power Efficiency – If less power is used for networking, that power can be applied to GPUs delivering the compute performance. In addition, power availability will be a challenge, particularly at regeneration sites. Finding the right trade-off between power and performance is going to be critical.
- Reliability – AI training is very sensitive to any disruption to traffic flow. This problem is magnified even more in scale across because there might be issues in remote regenerator sites where operational support is even more challenging.
How Coherent Optics Enable Scale Across
Optical interconnects in a scale-across network are made possible today by coherent technology which includes both pluggable coherent optics as well as transponder-based performance-optimized coherent solutions. 800G ZR/ZR+ pluggable modules are well positioned to be the workhorse in these scale-across networks, with performance-optimized 1.2T modules such as Acacia’s CIM 8 also being deployed, followed by 1600ZR/ZR+ coherent modules in the future. Analysts such as Cignal AI are predicting that the ramp of 800G ZRx pluggables will surpass all earlier coherent generations driven by AI scale-across.
Pluggable coherent optics are critical components in scale-across networks because they reduce power consumption by eliminating unnecessary client interfaces and allowing direct router-to-router connectivity. The evolution from early ZR modules for metro DCI (~120km) to ZR+ (~1000km) and ultra-long haul (3000+ km) now provides flexible, scalable solutions for both intra- and inter-data center links.
Figure 2. Coherent pluggable and performance-optimized modules play a key role in scale-across networks
Interestingly, the capacity needed for these scale-across networks is separate from the capacity required for the traditional front-end DCI needs. Traditional coherent optics solutions utilize the C-band spectrum of the low-loss optical fiber transmission window. However, the tremendous capacity requirements for AI scale-across infrastructure are accelerating coherent optics adoption of the secondary low-loss transmission window, the L-band spectrum.
Scaling Across Today and in the Future
The industry is still in the early days of deploying AI scale-across architectures, but the trajectory is clear. Class 3 baud rate technologies (such as 800ZR/ZR+ and advanced transponders) will dominate near-term deployments, while Class 4 and higher baud rate technologies will see even more aggressive innovation.
Acacia’s industry leading coherent optics are already being deployed in AI build outs and Acacia’s recently released 800G DSP, 400G DSP, and 1.2T DSP port shipment numbers show continued market share leadership.
To learn more about AI architectures and Acacia’s leading coherent optics technology, contact us.
