NVidia GPU/Infiniband HPC Setup

Caution

These are experimental instructions as of April 2021. This document is currently a work in progress.

These instructions are intended for multi-node, multi-gpu HPC Nvidia GPU clusters that use Infiniband based network fabric for high bandwidth applications. Topics such as resource management are out of scope for this document. This section will go over how to install all the necessary drivers and software to run M-Star CFD in a multi-node GPU environment. When possible, you are encouraged to refer to the authoritative documentation indicated in each section.

Operating systems supported

  • Centos/Redhat 7.x

  • Ubuntu 18

Hardware requirements

Supported GPUs

GPUs should be connected via NVLink/NVswitch

  • NVIDIA® Tesla™ / Quadro K-Series or Tesla™ / Quadro™ P-Series GPU

Supported HCAs

  • ConnectX®-4 (VPI/EN)

  • ConnectX®-4 Lx

  • ConnectX®-5 (VPI/EN)

  • ConnectX®-6 (VPI/EN)

  • ConnectX®-6 Dx

  • ConnectX®-6 Lx

Note

To achieve the best performance for GPUDirect RDMA, it is required that both the HCA and the GPU be physically located on the same PCIe IO root complex

What is being installed

  • CUDA 11.1 driver

  • CUDA 11.1 toolkit

  • MLNX_OFED 5.2

  • NV PEER MEM 1.1

  • GDR COPY 2.2

  • HPC-X 2.8.1

Version interdependency

  • The CUDA version should match the version compiled against in HPC-X. For example, HPC-X 2.8.1 was compiled with CUDA 11.1, so you should install CUDA 11.1. If you already have a newer version of the driver installed on your system, you should install the CUDA 11.1 tookit so that it can be referenced in various stages of this installation process.

  • HPC-X requires MLNX_OFED 4.7 and above

CUDA 11.1

Install CUDA 11.1 per nvidia documentation.

NV PEER MEM 1.1

Authoritative documents –

Download – https://www.mellanox.com/sites/default/files/downloads/ofed/nvidia-peer-memory_1.1.tar.gz

Build packages:

tar xzf nvidia_peer_memory-1.1.tar.gz
cd nvidia_peer_memory-1.1
./build_module.sh

# Expected output -- Note location of built packages
#
#   Building source rpm for nvidia_peer_memory...
#   Building debian tarball for nvidia-peer-memory...
#   Built: /tmp/nvidia_peer_memory-1.1.0.src.rpm
#   Built: /tmp/nvidia-peer-memory_1.1.orig.tar.gz

Install on RPM based OS:

rpmbuild --rebuild /tmp/nvidia_peer_memory-1.1-0.src.rpm
rpm -ivh  <path to generated binary rpm file>

Install on DEB based OS:

cd /tmp
tar xzf /tmp/nvidia-peer-memory_1.1.orig.tar.gz
cd nvidia-peer-memory-1.1
dpkg-buildpackage -us -uc
dpkg -i <path to generated deb files>

Check kernel module startup:

# Check kernel module status
service nv_peer_mem status

# alternative command
lsmod | grep nv_peer_mem

# Ensure kernel module is set to load up
service nv_peer_mem start

# alternative command
modprobe nv_peer_mem

GDR COPY 2.2

Install on RPM based OS:

# dkms can be installed from epel-release. See https://fedoraproject.org/wiki/EPEL.
sudo yum groupinstall 'Development Tools'
sudo yum install dkms rpm-build make check check-devel subunit subunit-devel

tar xzf gdrcopy-2.2.tar.gz
cd gdrcopy-2.2/packages

# specify the absolute path to the CUDA 11.1 top level directory, eg /usr/local/cuda-11.1
CUDA=<cuda-install-top-dir> ./build-rpm-packages.sh
sudo rpm -Uvh gdrcopy-kmod-<version>.<platform>.rpm
sudo rpm -Uvh gdrcopy-<version>.<platform>.rpm
sudo rpm -Uvh gdrcopy-devel-<version>.<platform>.rpm

Install on DEB based OS:

sudo apt install build-essential devscripts debhelper libsubunit0 check libsubunit-dev fakeroot pkg-config dkms

tar xzf gdrcopy-2.2.tar.gz
cd gdrcopy-2.2/packages

# specify the absolute path to the CUDA 11.1 top level directory, eg /usr/local/cuda-11.1
CUDA=<cuda-install-top-dir> ./build-deb-packages.sh
sudo dpkg -i gdrdrv-dkms_<version>_<platform>.deb
sudo dpkg -i gdrcopy_<version>_<platform>.deb

HPC-X 2.8.1

Install as shown in https://docs.mellanox.com/pages/viewpage.action?pageId=45194876

System Verification

Check NVIDIA driver:

# verify that the output from the command shows the CUDA version and driver version number you installed
nvidia-smi

Check the MLNX_OFED Infiniband:

# run as root
# refer to  docs/readme_and_user_manual/hca_self_test.ofed.readme for more information
hca_self_test.ofed

Check NV PEER MEM kernel module:

# verify that this indicates status is good
service nv_peer_mem status

# verify this shows module is loaded
lsmod | grep nv_peer_mem

Check GDRCOPY kernel module:

# verify this shows module is loaded
lsmod | grep gdrdrv

Verify GPU-GPU bandwidth

The below command uses the OSU bandwidth utility. This is part of the HPCX package in the directory ompi/tests/osu-micro-benchmarks-5.6.2-cuda under your HPCX installation. Add that to your PATH or use the full absolute path to this utility. This command runs a test between 2 GPU nodes, measuring the acheived point-to-point bandwidth between 2 GPUs. This is important to ensure that all the required software/drivers are working together to provide the maximum available bandwidth. Verify that the output of this command demonstrates that you acheive the maximum theoretical bandwidth of your network fabric.

Forced communication through UCX/gdrcopy/etc

# load the HPCX environment
module use $HPCX_HOME/modulefiles
module load hpcx

export UCX_MEMTYPE_CACHE=n
export UCX_TLS=rc_x,cuda_copy,gdr_copy,cuda_ipc
export UCX_MAX_EAGER_RAILS=3

# replace node1,node2 with two actual host names from your HPC
mpirun -np 2 --host node1,node2 --npernode 1 \
        -x PATH -x LD_LIBRARY_PATH -x UCX_MEMTYPE_CACHE -x UCX_TLS -x UCX_MAX_EAGER_RAILS\
        -mca pml ucx -bind-to core \
        -mca btl ^tcp \
        -x UCX_NET_DEVICES=mlx5_0:1 -x CUDA_VISIBLE_DEVICES=0 -x UCX_RNDV_SCHEME=get_zcopy \
        -x UCX_RNDV_THRESH=8192 \
        osu_bw -d cuda D D

Test without MPI/UCX tuning parameters

# Simple version of previous test
# This test verifies that you can run mpirun/ucx with no specific tuning parameters
# OpenMPI should select the best protocols automatically, so the result of this test should be the same as the previous
# Reference document -- https://docs.mellanox.com/display/GPUDirectRDMAv17/Benchmark+Tests

# load the HPCX environment
module use $HPCX_HOME/modulefiles
module load hpcx

export UCX_MEMTYPE_CACHE=n
# replace node1,node2 with two actual host names from your HPC
mpirun -np 2 --host node1,node2 --npernode 1 \
        -x PATH -x LD_LIBRARY_PATH -x UCX_NET_DEVICES=mlx5_0:1 -x CUDA_VISIBLE_DEVICES=0 -x UCX_RNDV_SCHEME=get_zcopy\
        osu_bw -d cuda D D

Further tests

See ClusterKit documentation – https://docs.mellanox.com/display/HPCXv281/ClusterKit . This a utility provided with HPC-X to do more thorough HPC validation and testing. You are encouraged to work through these tests.

Running M-Star

Submit a job to your HPC that gets 2 nodes with 4 GPUs each. The below snippet shows how one might execute M-Star in a job submission script. Be sure to adjust the environment loading for your specific system.

# load HPC-X environment
module use $HPCX_HOME/modulefiles
module load hpcx

# load mstar environment
source /path/to/mstar.sh

# mpirun   Invoke openmpi mpirun
# -np 8 Running on 8 GPUs
# -x UCX_MEMTYPE_CACHE=n   required for static compiled applications
# --gpu-auto  automatically selects GPUs on each node
# --disable-ipc improves performance
# if you are running this right from the node itself, you should likely add the "-x PATH -x LD_LIBRARY_PATH" as shown in the OSU bw test above
mpirun -np 8 -x UCX_MEMTYPE_CACHE=n mstar-cfd-mgpu -i input.xml -o out --gpu-auto --disable-ipc