Sharada Yeluri’s Post

View profile for Sharada Yeluri

Sr. Director of Engineering, Silicon and Systems Technology, @ Juniper Networks

I finally watched a few videos from this year's OCP conference. As expected, the networking sessions focused on improving performance, congestion, and resiliency for large GPU clusters. Meta and ByteDance presented Disaggregated Scheduled Fabric (DSF) as a cure-all solution. It's interesting how Meta is trying out DSF for their next-gen clusters after two generations of Ethernet fabrics with hop-by-hop scheduling. Scheduled fabric isn't new; it's in all high-end modular switches. It's receiver-based scheduling between leaf switches in a leaf-spine topology. The sender requests credits for transmission. Once approved, data is fragmented into smaller cells, spread across fabric links, and reassembled by the destination leaf switch for delivery to the target GPU. DSF offers hyperscalers several advantages: It enables them to treat the distributed chassis as a large pizza box/single router, shifting the burden of network utilization/performance to switch vendors. Resilience is often built into the hardware, enabling link switchovers without software intervention. DSF eliminates the extensive tuning required for ECN and other schemes. With DSF, packets are buffered in ingress leaf switches while waiting for grants, and packet buffering requirements increase linearly with scale. Egress leaf must have sufficient output buffering to hide RTT (which depends on cable lengths) to avoid under-subscribed links. With limited on-chip SRAM buffering in ingress switches, traffic could spill to external memory, increasing tail latencies despite preventing packet loss. ByteDance claims all traffic can remain on-chip in their 128 GPU DSF results, but this may not hold for larger systems. Meta's DSF results at OCP with a 144 GPU cluster show a 10% improvement for bandwidth-intensive collectives like all-to-all. For other collectives, including all-reduce, the results are at par with the existing adaptive routing (with DLB) configuration used in Meta's other ethernet fabrics. ByteDance saw similar results in their clusters. This raises the question of ROI for the expensive DSF fabrics. While the 10% advantage is significant, would that hold good for large clusters? We will have to wait for production network results. DSF adds control plane complexity. All DSF implementations are vendor-specific, requiring customers to understand specific implementations. DSF, in a sense, is a closed vendor solution like Infiniband. UEC also enables packet spray with a transport protocol that doesn't require strict ordering. It specifies a combined sender and receiver-based (similar to VOQ in DSF) scheduling from NICs. With AMD's announcement, UEC-compliant NICs (with programmable pipes) are on the horizon. Time will tell if DSF fizzles out in AI clusters when more economical UEC-compliant switches/NICs flood the market (wishful thinking😊) or continue to coexist with hyperscalers wanting turnkey solutions. Any thoughts? 🤔 #networkingforAI #AI #GPU #Junipernetworks

  • diagram
Sharada Yeluri

Sr. Director of Engineering, Silicon and Systems Technology, @ Juniper Networks

4mo

Thanks, Aibing Zhou, for giving more insights on UEC congestion/incast control. The UEC NIC end-to-end congestion control is a combined sender and receiver-based algorithm. The sender component deals with cases when the core of the Fabric is oversubscribed (either designed so or as a result of certain fault links in the core somewhere). This, combined with no strict ordering requirements, should result in a fabric that is as good as DSF when it comes to handling utilization and in-cast.

Chris Whyte

Principal Network Solutions Architect at Marvell Semiconductor

4mo

I have so many issues with DSFs but let's first just focus on your statement that claims DSF offers hyperscalers several advantages: "It enables them to treat the distributed chassis as a large pizza box/single router, shifting the burden of network utilization/performance to switch vendors." From my experience, the one thing I've never heard a hyperscaler say is that they would like to shift the burden of doing some set of functions (away from themselves) and give it to switch vendors. Keep in mind, you're stating it as a "burden" when in fact it's not a burden but a function that allows the hyperscaler to have more control over the e2e solution. Certain switch vendors like DSFs because it gives them the control by adding more unnecessary complexity within the fabric. More complexity gives them opportunity to sell more SW and increase the CoGs of the underlying HW (e.g., require deep buffers). In addition, it encourages them to build proprietary solutions that lock their customers into a single vendor, which is clearly evident in today's solutions thus far.

Yossi Kikozashvili

Head of Product, AI Infrastructure

4mo

Hi :) Reasons scheduled fabric gains so much traction for GPU cluster deployments are pretty simple. First, its really easy to use - non like typical ethernet, Infiniband, or enhanced ethernet (NIC based scheduling), no tuning is needed, it simpley works. Regardless of the NIC you use, regardless of whether you invested days on tuning buffers and DCQCN tresholds. So ops simplicity is one thing (Btw, thats one reason there’s huge interest coming from enterprises that are low on skillset and engineering force to tune buffers every time they deploy a model). Second is the performance. Its simply better. ByteDance, at their OCP presentation reported 37% improvement to All-to-All intensive workloads and 20% improvement to all-reduce ones. Thats offcourse compared to traditional ethernet with hashing. Compared to packet spraying capable solutions, the advantage of Scheduled fabric is decreased, but very far from being “on par”. I had the luck & joy of working with ByteDance on this specific project so i know :) Lastly, its field proven- one fact some folks might not know is- AT&T’s next-gen core network is built with these scheduled fabrics, replacing legacy chassis with a leaf and spine, resilient architecture.

Jeff Tantsura

Distinguished @Nvidia, building the best network and technologies to connect GPUs

4mo

100.000 GPUs, in production ;-)

Mike Reznikov

Network|Math|ML Acceleration (C/C++/RTL)

4mo

Sharada Yeluri I dont understand the numbers. Why are there so few GPUs (144) when testing large scale functions? Thank you for the post.

Like
Reply
David Williams

Senior Network Engineer

4mo

Love your posts. You always express complexity in a way that is consumable for those of us who have not tackled the kind of problems that are usually only handled by vendors or Goliath corporations. I have wanted to dig into the comparisons you make in this post. What jumped out at me was that “DSF, in a sense is a closed source solution.” Are there detailed documents available for control plane solutions with DSF or is that all intellectual property that is not available to the public?

Like
Reply

Super insightful analysis as always Sharada. As you know we (Juniper) have so much first hand experience (the good and the bad) in DSFs with past projects. Can’t wait to see how this all plays out and yet another reason why it’s an exciting time to be a network technologist today!

Weiqiang Cheng

Research Institute of China Mobile - Technical Manager

4mo

This’s an interesting topic! Last May, China Mobile initiated the establishment of a similar organization called GSE (Global Scheduling Ethernet). We have released a DSF that is similar to Broadcom’s but is entirely based on an open Ethernet standard. Unlike Broadcom’s DSF, GSE operates with Ethernet packets between the spine and leaf, rather than cells, and the VoQ is dynamically established, so there’s no need to worry about excessive buffer usage. We have developed switches based on this standard and tested them in clusters with around one thousand of GPUs. The all-to-all efficiency performance is comparable to ByteDance’s test results, but during model training, it shows even greater performance improvements, reaching over 20% compared to traditional networks. Additionally, the overall cost seems to have an advantage; the combination of scheduled switches and simple NICs can be more cost-effective than using simple switches with smart NICs.

Dennis Qin

AI/HPC infrastructure product manager at Supermicro

4mo

How about step back a little bit, with two end points, connected back to back with Ethernet, say 10G, so the maximum you can get is 10G line rate, if you don't see throughput at 10G, then it's the problem with endpoint application software stack. if you want to share this 10G Ethernet link with many apps and constantly make it running at 10G line rate, then you need to work with app stacks together to achieve this. Different apps may need different tweaks to make that happen, how we make Ethernet work with all the apps to achieve this? we developed TCP/IP on top of Ethernet but then we found it doesn't work well AI/LLM type of workloads, then we picked up IB, or DPU with TCP/IP, etc. to achieve this. My 2 cents. Dennis Qin

Like
Reply
Nitin Kumar

Innovative Leader Driving Revenue Across Key Market Segments | Expert in Routing Solutions & Customer Engagement | Quality Champion & Diversity Advocate | Mentor & Coach

4mo

Thanks for the synopsis, Sharada. As usual, the battle between standards compliant solutions (which are usually slower and despite best efforts take time to settle down) and vendor-specific solutions (agility being the advantage here, so long they can strongly differentiate and showcase value when compared to standards based solutions) will continue till either (1) the standards based solutions are so good that incremental improvements on vendor-specific solutions are not enough to warrant any consideration, or (2) vendor-specific solutions continue to clearly out-innovate the standards based solution and continue keeping the lead, Both will continue to persist, maybe for different usecases. Any DC deployments where different vendor leafs & spine switches are used, standards based solution, as you pointed out, is the only choice. But i believe there will be enough vendor specific POD deployments (entire pod built with leaves/spines from same vendor) to justify the DSF like vendor specific solutions. The push to commoditize any high-volume and high-cost devices/NPUs/NICs to continuously reduce costs will ultimately win for sure.

See more comments

To view or add a comment, sign in

Explore topics