A few months ago, we extended the JURECA Evaluation Platform1 at JSC by two nodes with AMD Instinct MI250 GPUs (four GPUs each). The nodes are Gigabyte G262-ZO0 servers, each with a dual socket AMD EPYC 7443 processor (24 cores per socket, SMT-2) and with four MI250 GPUs (128 GB memory).

  1. OSU Bandwidth Micro-Benchmark
    1. A100 Comparison
  2. GPU STREAM Variant
    1. Data Size Scan
    2. Threads and Data Sizes
  3. Conclusion
  4. Technical Details
    1. OSU Microbenchmarks
    2. STREAM Variant
    3. Evaluation Notebooks
    4. Post Changelog

We’ve deployed the nodes somewhat silently in the spring and are polishing them and getting to know them ever since. Starting off with a pre-GA software stack, by now we run the publicly available ROCm 5.2. There are still some minor issues with the nodes, but the GPUs themselves are running reasonably well to finally show some very basic benchmarks!

Still on pre-GA software, we also held an AMD Porting Workshop, in which we worked together with application developers and AMD to enable first users for the system. Despite the unfinished, preliminary software environment, we could achieve some interesting results. Check them out on the workshop’s Indico!

But now, let’s understand the devices better by looking at the OSU bandwidth micro-benchmark and a GPU variant of the STREAM benchmark. Plenty of graphs follow, click on them to enlarge. Find some technical details at the end.

OSU Bandwidth Micro-Benchmark

First off, the one-directional bandwidth micro-benchmark from the OSU microbenchmark suite, osu_bw. It is usually used for testing MPI connections, but can also be abused to get a glimpse of inter-device bandwidths. See a dedicated section at the end for technical details.

Multiple measurement of osu_bw Microbenchmark on AMD MI250 GCDs

The picture shows bandwidth data for two message sizes, large (64 MiB, left) and small (4 MiB, right). Each color-coded box contains the bandwidth of a message going from GPU with certain ID to another GPU with a certain ID. Also included are messages going from the GPU to itself – for example from GPU 0 to GPU 02.

One immediately sees that there are not four GPU IDs but eight. That is a feature of the MI250 GPUs: Each MI250 is built as a multi-chip module (MCM) with two GPU dies contained in each MI250 device package. Each GPU die is very similar to an AMD Instinct MI100 GPU and it has access to half (64 GB) of the total memory. From a software perspective, each MI250 GPU is actually displayed as two GPUs and needs to be used as such. For most practical purposes, it is much simpler to think of the system with four MI250 GPUs as a system of eight MI250 GPUlets. The proper name for GPUlet is GPU Complex Die (GCD), which is displayed in the picture.

Even on a birds-eye view one can immediately see the clusters of two GCDs which belong together and form a GPU; like GPU 0 and 1, displayed in a blue 2-by-2 box, and GPU 2 and 3, etc., all on the main diagonal. The reason: GCDs on one GPU are connected well to each other with many links and have great bandwidths; for the large message size usually around 155 GiB/s.

Implicitly, the clusters tell us even more about the inter-GPU connections: There are not only blue 2-by-2 boxes, but also green and yellow boxes. Focusing on the first row with bandwidths from GCD 0 to other GCDs, one can see that to GCD 2+3 and GCD 6+7 the bandwidths are each around 40 GiB/s, and to GCD 4+5 the bandwidths are around 80 GiB/s.

Block diagram of AMD MI250 Node Topolgy for Mainstream HPC Installations.
Diagram shared by AMD.

The entire structure is the result of the complex connection topology of the GPUs. Each GCD has eight Infinity Fabric ports, with each Infinity Fabric link having a peak bandwidth of 50 GB/s3 in one direction. On a GPU, the two GCDs are connected with four Infinity Links, amounting to a peak bandwidth of 200 GB/s (or 400 GB/s, if you add up both directions). Going out of the MCM, things are a bit more convoluted. There are GCDs which are connected to other GCDs with two direct links (like GCD 1 → GCD 4) and GCDs connected to other GCDs with one direct link (like GCD 0 → GCD 2). Through their respective partner GCD, there might be other indirect links. And in addition, there are Infinity Fabric links going to the PCIe switch and then to the network or CPU. If you look closely, you can also see the indirect connections in the bandwidth pattern of the picture (like GCD 0 → GCD 4 being slightly faster than GCD 0 → GCD 5, although 4 and 5 are part of the same package).

All in all, it’s a hell of a complex pattern and I’m curious about the load imbalances of future Multi-GCD applications…

Now that we know how the patterns come to be, we can look at bandwidth usage relative to the various peaks. Enable relative numbers by clicking on the “Relative” toggle below the picture up top. We can see that there’s good utilization around 90% for the direct connections, and 80% for the indirect connections. For the smaller message size it’s somewhat similar compared to the larger message size, albeit 20 percentage points (pp) lower for the direct connections (indirect: 10 pp).

A100 Comparison

I also ran the micro-benchmark in the same fashion on a usual GPU node of JURECA DC with four NVIDIA A100s.

Multiple measurement of osu_bw Microbenchmark on NVIDIA A100 GPUs.

The first thing to notice is the uniformity of the connections. In the node design we deploy on JURECA DC, there are always four NVLink 3 connections between each GPU – 87 GiB/s for all possible connections (for large message sizes). Using the memory on the same GPU, 592 GiB/s are reached; roughly 130 GiB/s more than on an MI250 GCD. In terms of relative performance – which can be viewed when flipping the switch below the picture – the links to other GPUs can be utilized by 93%, or 41% for the own-memory accesses.

Time will tell if there is more software tuning room available for the MI250s or if the difference is part of the architectural choices. Noteworthy: One MI250 (i.e. two GCDs) has a TDP of 560 W, while one A100 has 400 W.

Expand here to display pictures to compare MI250 to A100 next to each other.
STREAM Kernels for 3 data sizes and 4 numbers of threads per block on MI250 GCD
STREAM Kernels for 3 data sizes and 4 numbers of threads per block on A100 GPU

GPU STREAM Variant

Another simple benchmark to test certain aspects about a device’s memory is the STREAM benchmark, of which I ran my own GPU variant on the MI250s. I used an old CUDA code which I HIPified with the hipify-perl tool; it ran without a single further change. Quite amazing.

Data Size Scan

Two plots (linear, logarithmic) with results of all four STREAM Microbenchmarks for increasing amounts of data. An inset in the linear plot focuses on the maximum bandwidths for very large data sizes.

One GCD reaches around 1.42 TB/s for the copy kernel and about 1.34 TB/s for the triad kernel when the message size is large enough, as the inset view of the above linear plot shows (left). For triad, this is about 82% of the theoretically available peak. The double-logarithmic plot (right) shows well that the increase to the maximum bandwidth is regular (and according to a power law) and that the maximum is reached around \(2^{26}\) Byte (64 MiB).

Below the plot, there’s a switch to show results for A100. The GPU has a lower peak bandwidth compared to MI250, but reaches nearly identical values for copy (1.42 TB/s) and triad (1.35 TB/s) kernels of the benchmark – resulting in utilization of 87% of the available peak. The data point at \(2^{23}\) Byte (8 MiB) is a weird, systematic outlier which reaches the peak (or even beyond).

It is interesting, how closely a MI250 GCD matches the performance of an A100 GPU. In the following plot, I compare the triad bandwidth behaviors directly.

Linear and double-logarithmic comparison of Triad bandwidth. The A100 is always a tiny bit faster, except for at-peak bandwidths. There, it is a tiny-tiny bit faster (1.399 TB/s vs 1.349 TB/s).

Especially in the double-log plot one can see that the A100 is always a tiny amount faster. After the weird outlier, it much closer matches MI250 GCD bandwidth. Still, for the final value, the A100 is about 3,6% faster than the MI250 GCD.

Threads and Data Sizes

STREAM Kernels for 3 data sizes and 4 numbers of threads per block on MI250 GCD

To understand how well the memory can be accessed depending on the number of threads per block (work items in a workgroup in AMD terminology), the picture above shows four plots – one for each of the STREAM kernels. On the x axis, always three data sizes are shown; 0.5 GiB, 2 GiB, and 8 GiB – values on the larger side of things and on the plateau in the previous STREAM plots. On y, four semi-typical values for threads-per-block are chosen.

It appears that 256 threads per block is always a good choice. So that’s going to be my go-to default for the future. You can view relative values for the link usage by flipping the switch below the picture – the usage is between 88% and 76%. It’s worthwhile to run a simple test like this once for your actual application, as the number of threads can in most cases be chosen somewhat freely, and may offer improvement of up to 7 pp (see add kernel for 2 GiB).

STREAM Kernels for 3 data sizes and 4 numbers of threads per block on A100 GPU

On a first glimpse, the behavior of the A100 looks very similar. And – as expected – it is able to achieve higher bandwidths and higher relative usage. Note the different color scales: The lower bound for A100 is 1270 GiB/s and not 1140 GiB/s of MI250. On a second look, there seem to be some different underlying trend in the behavior on the A100. For the one-vector kernels (copy, scale), the A100 seems to prefer fewer threads and larger messages. For the two-vector kernels (add, triad), the last column for 8 GiB is interesting, as the bandwidth drops by 20 GiB going from 128 threads to more threads. All of this is probably not very relevant for real-world applications, but fun to see!

Expand here to display pictures to compare MI250 to A100 directly by toggle of switch.
STREAM Kernels for 3 data sizes and 4 numbers of threads per block on MI250 GCD
STREAM Kernels for 3 data sizes and 4 numbers of threads per block on A100 GPU

Conclusion

AMD Instinct MI250, the GPU design which breaks the Exascale barrier in Frontier4, are quite powerful GPUs, featuring up to 90 TFLOP/s performance in FP64. We deployed two nodes with four MI250s in JURECA DC as part of an Evaluation Platform at beginning of 2022. After some setup time, the nodes can now be used for tests. Results from an early Porting workshop can be found online and Moritz Lehmann has just published a paper with results obtained on the machine.

I used the bandwidth experiment of the OSU Microbenchmarks to study connections between the GPUs of a node with MPI. One can see that each MI250 consists of two Graphics Compute Dies (GCD) which are basically two individual GPUlets on a GPU. The obtainable bandwidths are diverse, due to the complex connection matrix between the GCDs. Bandwidths between GCDs on the same GPU are usually about 150 GiB/s, and between GCDs of different GPUs between 80 GiB/s and 40 GiB/s. I also showed results for A100 GPUs which have much more homogeneous connections, with always 87 GiB/s between the GPUs.

As a second experiment, I ran a CUDA variant of the STREAM benchmark, which I HIPified easily for AMD. When increasing the data size, one can see that the memory bus is saturated at around 64 MiB data sizes, and eventually a 1.42 TiB/s bandwidth is reached – about 87% of the available peak of the GCD5. Looking at different number of threads per block, 256 threads seems to be a good choice, memory-wise. In comparison to A100 GPUs, one sees that the obtained bandwidth is surprisingly similar (the A100 is slightly faster, though) – but with a peak bandwidth a little lower for the A100.

Each GCD seems to be similar to an A100 in many ways. For the connection-targeted benchmarks shown, a MI250 GCD is usually a little slower and less efficient than the A100. But using 30% less power. Quite interesting devices.

Technical Details

Benchmarks were performed on the AMD Instinct MI250 nodes of JURECA DC’s Evaluation Platform. While the systems run publicly available software and firmware versions, the benchmarks were run while we still got to know the systems. Please let me know if you discover errors or have significantly different results on another machine. The evaluation notebooks are linked below.

The following software and versions were used

  • ROCm 5.2.0
  • ROCm driver 5.16.9.22.20
  • CUDA 11.5
  • CUDA driver 510.47.03.
  • UCX 1.12.1 (with UCX_TLS=rc_x,self,sm,rocm_copy,rocm_ipc for ROCm and UCX_TLS=rc_x,self,sm,cuda_ipc,gdr_copy,cuda_copy for CUDA)
  • OpenMPI 4.1.2

OSU Microbenchmarks

Version 5.9; compiled as per official OpenUCX instructions:

./configure --enable-rocm --with-rocm=/opt/rocm CC=$(which mpicc) CXX=$(which mpicxx) LDFLAGS="-L$EBROOTOPENMPI/lib/ -lmpi -L/opt/rocm/lib $(hipconfig -C)" CPPFLAGS="-std=c++11"

Run by setting HIP_VISIBLE_DEVICES=A,B, like:

HIP_VISIBLE_DEVICES=0,1 \
srun -n 2 mpi/pt2pt/osu_bw -d rocm -m 4194304:4194304 D D

STREAM Variant

Base code from my GitHub – github.com/AndiH/CUDA-Cpp-STREAM – and then compiled the following for AMD

hipify-perl CUDA-Cpp-STREAM/stream.cu > stream.cu.hip
HIP_PLATFORM=amd hipcc --offload-arch=gfx90a -o hip-stream stream.cu.hip

Run by looping through data sizes:

./stream -n $((2**0)) -t --csv -f | tee file.csv && \
for i in {1..28}; do \
	./stream -n $((2**$i)) --csv -f; \
done | tee -a file.csv

Evaluation Notebooks

The graphs presented here are created in Jupyter Notebooks with Pandas, Matplotlib, and Seaborn. Find the Notebooks here for reference, including the evaluation and raw data.

Post Changelog

Since publication of this blog post, the following edits were made

  • 2022-Sep-21: Replaced MI250 node connections diagram by a corrected version, shared directly by AMD (not yet in a Whitepaper).
  1. Actually, the Evaluation Platform was created together with the AMD nodes! 

  2. The data rate on each GPU itself gives only a rough idea about the memory bandwidth; it’s not a proper memory benchmark because of the implementation and indirections – STREAM is much better suited for that. For STREAM, see further down in the text. 

  3. Infinity Fabric is also called xGMI. One xGMI lane can do 25 Gbit/s, and there seem to be 16 lanes per link. So, one Infinity Fabric connection can do 50 GB/s. 

  4. Actually, Frontier does not deploy MI250s but MI250Xs. The difference is mainly in the number of compute units: MI250X has 220 and MI250 has 208. There are performance difference because of this (like 95.7 TFLOP/s peak vs. 90.5 TFLOP/s), but no direct differences relating to memory. An additional difference in the design of Frontier is relating to the CPU: The GPUs are directly connected via a coherent Infinity Link to a single CPU – not PCIe, no two CPU sockets. 

  5. The advertised 3276.8 GB/s peak memory bandwidth are actually for the full GPU. I divided by two to get the per-GCD bandwidth; 1638 GB/s.