Blog

A closer look at Nvidia's 120kW DGX GB200 NVL72 rack system • The Register

GTC Nvidia revealed its most powerful DGX server to date on Monday. The 120kW rack scale system uses NVLink to stitch together 72 of its new Blackwell accelerators into what's essentially one big GPU capable of more than 1.4 exaFLOPS performance — at FP4 precision anyway.

At GTC this week, we got a chance to take a closer look at the rack scale system, which Nvidia claims can support large training workloads as well as inference on models up to 27 trillion parameters — not that there are any models that big just yet. 3 phase busbar

Nvidia's DGX GB200 NVL72 is a rack scale system that uses NVLink to mesh 72 Blackwell accelerators into one big GPU (click to enlarge)

Dubbed the DGX GB200 NVL72, the system is an evolution of the Grace-Hopper Superchip based rack systems Nvidia showed off back in November. However, this one is packing more than twice the GPUs.

While the 1.36 metric ton (3,000 lb) rack system is marketed as one big GPU, it's assembled from 18 1U compute nodes, each of which is equipped with two of Nvidia's 2,700W Grace-Blackwell Superchips (GB200).

Here we see two GB200 Superchips, minus heatspreaders and cold plates in a 1U liquid cooled chassis (click to enlarge)

You can find more detail on the GB200 in our launch day coverage, but in a nutshell, the massive parts use Nvidia's 900GBps NVLink-C2C interconnect to mesh together a 72-core Grace CPU with a pair of top-specced Blackwell GPUs.

In total, each Superchip comes equipped with 864GB of memory — 480GB of LPDDR5x and 384GB of HBM3e — and according to Nvidia, can push 40 petaFLOPS of sparse FP4 performance. This means each compute node is capable of producing 80 petaFLOPS of AI compute and the entire rack can do 1.44 exaFLOPS of super-low-precision floating point mathematics.

Nvidia's Grace-Blackwell Superchip, or GB200 for short, combines a 72 Arm-core CPU with a pair of 1,200W GPUs (click to enlarge)

At the front of the system are the four InfiniBand NICs — note the four QSFP-DD cages on the left and center of the chassis' faceplate — which form the compute network. The systems are also equipped with a BlueField-3 DPU, which we're told is responsible for handling communications with the storage network.

In addition to a couple of management ports, the chassis also features four small form-factor NVMe storage caddies.

The NVL72's 18 compute nodes come as standard with four Connect-X InfiniBand NICs and a BlueField-3 DPU (click to enlarge)

With two GB200 Superchips and five NICs, we estimate each node is capable of consuming between 5.4kW and 5.7kW a piece. The vast majority of this heat will be carried away by direct-to-chip (DTC) liquid cooling. The DGX systems Nvidia showed off at GTC didn't have cold plates, but we did get a look at a couple prototype systems from partner vendors, like this one from Lenovo.

While the GB200 systems Nvidia had on display didn't have coldplates installed, this Lenovo prototype shows what it might look like in production (click to enlarge)

However, unlike some HPC-centric nodes we've seen from HPE Cray or Lenovo's Neptune line which liquid cool everything, Nvidia has opted to cool low-power peripherals like NICs and system storage using conventional 40mm fans.

During his keynote, CEO and leather jacket aficionado Jensen Huang described the NVL72 as one big GPU. That's because all 18 of those super dense compute nodes are connected to one another by a stack of nine NVLink switches situated smack dab in the middle of the rack.

In between the NVL72's compute nodes are a stack of nine NVLink switches, which provide 1.8 TBps of bidirectional bandwidth each of the systems 72 GPUs (click to enlarge)

This is the same tech that Nvidia's HGX nodes have used to make its eight GPUs behave as one. But rather than baking the NVLink switch into the carrier board, like in the Blackwell HGX shown below, in the NVL72, it's a standalone appliance.

The NVLink switch has traditionally been integrated into Nvidia's SXM carrier boards, like the Blackwell HGX board shown here (click to enlarge)

Inside of these switch appliances are a pair of Nvidia's NVLink 7.2T ASICs, which provide a total of 144 100 GBps links. With nine NVLink switches per rack, that works out to 1.8 TBps — 18 links — of bidirectional bandwidth to each of the 72 GPUs in the rack.

Shown here are the two 5th-gen NVLink ASICS found in each of the NVL72's nine switch sleds (click to enlarge)

Both the NVLink switch and compute sleds slot into a blind mate backplane with more than 2 miles (3.2 km) of copper cabling. Peering through the back of the rack, you can vaguely make out the massive bundle of cables responsible for meshing the GPUs together so they can function as one.

If you look closesly, you can see the massive bundle of cables that form the rack's NVLink backplane (click to enlarge)

The decision to stick with copper cabling over optical might seem like an odd choice, especially considering the amount of bandwidth we're talking about, but apparently all of the retimers and transceivers necessary to support optics would have added another 20kW to the system's already prodigious power draw.

This may explain why the NVLink switch sleds are situated in between the two banks of compute as doing so would keep cable lengths to a minimum.

At the very top of the rack we find a couple of 52 port Spectrum switches — 48 gigabit RJ45 and four QSFP28 100Gbps aggregation ports. From what we can tell, these switches are used for management and streaming telemetry from the various compute nodes, NVLink switch sleds, and power shelves that make up the system.

At the top of the NVL72, we find a couple of switches and three of six powershelves (click to enlarge)

Just below these switches are the the first of six power shelves visible from the front of the NVL72 — three toward the top of the rack and three at the bottom. We don't know much about them other than they're responsible for keeping the 120kW rack fed with power.

Based on our estimates, six 415V, 60A PSUs would be enough to cover that. Though, presumably Nvidia or its hardware partners have built in some level of redundancy into the design. That leads us to believe these might be operating at more than 60A. We've asked Nvidia for more details on the power shelves; we'll let you know what we find out.

However they're doing it, the power is delivered by a hyperscale-style DC bus bar that runs down the back of the rack. If you look closely, you can just make out the bus bar running down the middle of the rack.

According to CEO Jensen Huang, coolant is designed to be pumped through the rack at 2 liters per second (click to enlarge)

Of course, cooling 120kW of compute isn't exactly trivial. But with chips getting hotter and compute demands growing, we've seen an increasing number of bit barns, including Digital Realty and Equinix, expand support for high-density HPC and AI deployments.

In the case of Nvidia's NVL72, both the compute and NVLink switches are liquid cooled. According to Huang, coolant enters the rack at 25C at two liters per second and exits 20 degrees warmer.

If the DGX GB200 NVL72's 13.5 TB of HBM3e and 1.44 exaFLOPS of sparse FP4 ain't cutting it, eight of them can be networked together to form one big DGX Superpod with 576 GPUs.

Eight DGX NVL72 racks can be strung together to form Nvidia's liquid cooled DGX GB200 Superpod (click to enlarge)

And if you need even more compute to support large training workloads, additional Superpods can be added to further scale out the system. This is exactly what Amazon Web Services is doing with Project Ceiba. Initially announced in November, the AI supercomputer is now using Nvidia's DGX GB200 NVL72 as a template. When complete, the machine will reportedly have 20,736 GB200 accelerators. However, that system is unique in that Ceiba will use AWS' homegrown Elastic Fabric Adapter (EFA) networking, rather than Nvidia's InfiniBand or Ethernet kit.

Nvidia says its Blackwell parts, including its rack scale systems, should start hitting the market later this year. ®

sf6 rmu The Register Biting the hand that feeds IT