Technical Diagrams¶
This document provides a bit more detail on inner workings and data flow in M-Star. This is particularly important for the solver, which requires high data bandwidth between GPUs.
Pre-Processer¶
The Pre Processer verifies the license at runtime. This may invoke a network connection to a floating license server, indicated by the dotted line. The license may also be configured locally as a file, this connection may or may not occur depending on configuration.
At runtime, the pre processor may invoke other M-Star tools in order to execute previewers or other utilities. Generally these utilities work by using the temporary directory of the user to write files for the utility and read them back into the pre processer.
The document file .msb
is the file for the M-Star Pre Processer. This contains all data defined by the user.
In order to run the solver, the solver files must be generated by the Pre Processer.
Solver¶
The solver technical architecture diagram is somewhat more complex. Being license controlled, the solver may contact a license server if configured.
The solver uses MPI to execute multiple processes that communicate with one another through various means. Each solver process is mapped to a single GPU. The GPU processes communicate with one another over the NVLINK/NVSWITCH hardware. This is special hardware that provides high speed memory access between GPUs, and avoids copying data to system/CPU memory. The correct setup of the NVLINK/NVSWITCH hardware is critical to running on multi gpu environments. For more information on multi GPU execution, please refer to the documentation.
GPU Topology¶
Related:
Platform Diagnostics – For help obtain system diagnostic information
One important considering for multi-gpu simulations is the NVLINK/NVSWITCH connection topology of the GPUs. In general, M-Star requires full peer-to-peer access between all GPUs for optimum performance. Some hardware configurations may or may not provide this full peer access, so it is important to confirm this.
On existing hardware, you can run the command nvidia-smi topo -m
. Confirm that the resulting table shows NV** between all GPUs. An example of complete peer access is shown in Figure 7 . An example of incomplete peer access is shown in Figure 8. Note that all cell values must show “NV” to inidicate the number of NVLinks connecting the GPUs. If NV is not present in the cell value, this indicates that the GPUs will communicate via the PCI bus, which will reduce multi-gpu performance.
On Windows, full peer access is absolutely required to run between all GPUs involved in the simulation. So for example in the system shown in Figure 8, you could not run a multi-gpu simulation on GPUs 4 and 5 since they do not have peer access via NVLINK.
On Linux, full peer access it not necessary, but will impact performance considerably. Linux has what is called CUDA-aware MPI, which allows the data communication to fall back on other means, such as copying GPU data over a PCI bus. However, this will reduce overall performance. Therefore, it is recommended to always configure GPU resources with full peer-to-peer access between all GPUs. This way optimum performance is always achieved, and you will have the option of selecting Windows or Linux for the operating system.
Partial peer access. Example of topology without full peer access. Notice some elements in the topology matrix show PHB
which indicates data will not travel over a fast NVLINK. This topology will not perform at optimum speed in M-Star.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV1 NV1 NV2 NV2 PHB PHB PHB 0-63 0-1
GPU1 NV1 X NV2 NV1 PHB NV2 PHB PHB 0-63 0-1
GPU2 NV1 NV2 X NV2 PHB PHB NV1 PHB 0-63 0-1
GPU3 NV2 NV1 NV2 X PHB PHB PHB NV1 0-63 0-1
GPU4 NV2 PHB PHB PHB X NV1 NV1 NV2 0-63 0-1
GPU5 PHB NV2 PHB PHB NV1 X NV2 NV1 0-63 0-1
GPU6 PHB PHB NV1 PHB NV1 NV2 X NV2 0-63 0-1
GPU7 PHB PHB PHB NV1 NV2 NV1 NV2 X 0-63 0-1
Full peer access. Example of topology with full peer access. All elements show NV
indicating full nvlink connected topology.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-23,48-71 0
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-23,48-71 0
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-23,48-71 0
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-23,48-71 0
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 24-47,72-95 1
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 24-47,72-95 1
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 24-47,72-95 1
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 24-47,72-95 1
The NVidia DGX-1 system is known to have incomplete peer access. This system has 2 sets of 4 GPUs, with partial connection between the 2 sets. Each set of 4 GPUs has full peer access. This is referred to as the so called “8 GPU hybrid cube-mesh interconnection network topology”. These systems are becoming less common in new installations but are still found in service on existing resources. The AWS P3.16XLARGE instance is one such example of a DGX-1 system with incomplete peer access that is still in use today in the year 2022. More information about P3 and P4 instance comparison is found on the AWS website at – https://aws.amazon.com/blogs/compute/amazon-ec2-p4d-instances-deep-dive/ . To run 8 GPU cases on AWS, you will need to use the P4dn.24xlarge instance. This instance has full peer access as noted in the above article.
The NVidia DGX-2 system is known to have complete peer access.
Post¶
The M-Star Post Processer is built on Kitware’s Paraview technology. This system reads in the solver output files and processes them. The user may then write out data, plots, images, videos, etc from this program. Note that it is not license controlled so has no connections to license servers.