Technical Diagrams¶

This document provides a bit more detail on inner workings and data flow in M-Star. This is particularly important for the solver, which requires high data bandwidth between GPUs.

Pre-Processer¶

The Pre Processer verifies the license at runtime. This may invoke a network connection to a floating license server, indicated by the dotted line. The license may also be configured locally as a file, this connection may or may not occur depending on configuration.

At runtime, the pre processor may invoke other M-Star tools in order to execute previewers or other utilities. Generally these utilities work by using the temporary directory of the user to write files for the utility and read them back into the pre processer.

The document file .msb is the file for the M-Star Pre Processer. This contains all data defined by the user.

In order to run the solver, the solver files must be generated by the Pre Processer.

../../_images/ArchitectureGUIRuntime.drawio.svg

Solver¶

The solver technical architecture diagram is somewhat more complex. Being license controlled, the solver may contact a license server if configured.

The solver uses MPI to execute multiple processes that communicate with one another through various means. Each solver process is mapped to a single GPU. The GPU processes communicate with one another over the NVLINK/NVSWITCH hardware. This is special hardware that provides high speed memory access between GPUs, and avoids copying data to system/CPU memory. The correct setup of the NVLINK/NVSWITCH hardware is critical to running on multi gpu environments. For more information on multi GPU execution, please refer to the documentation.

../../_images/ArchitectureSolverRuntime.drawio.svg

GPU Topology¶

System Requirements
Platform Diagnostics – For help obtain system diagnostic information

One important considering for multi-gpu simulations is the NVLINK/NVSWITCH connection topology of the GPUs. In general, M-Star requires full peer-to-peer access between all GPUs for optimum performance. Some hardware configurations may or may not provide this full peer access, so it is important to confirm this.

On existing hardware, you can run the command nvidia-smi topo -m . Confirm that the resulting table shows NV** between all GPUs. An example of complete peer access is shown in Figure 7 . An example of incomplete peer access is shown in Figure 8. Note that all cell values must show “NV” to inidicate the number of NVLinks connecting the GPUs. If NV is not present in the cell value, this indicates that the GPUs will communicate via the PCI bus, which will reduce multi-gpu performance.

On Windows, full peer access is absolutely required to run between all GPUs involved in the simulation. So for example in the system shown in Figure 8, you could not run a multi-gpu simulation on GPUs 4 and 5 since they do not have peer access via NVLINK.

On Linux, full peer access it not necessary, but will impact performance considerably. Linux has what is called CUDA-aware MPI, which allows the data communication to fall back on other means, such as copying GPU data over a PCI bus. However, this will reduce overall performance. Therefore, it is recommended to always configure GPU resources with full peer-to-peer access between all GPUs. This way optimum performance is always achieved, and you will have the option of selecting Windows or Linux for the operating system.

Partial peer access. Example of topology without full peer access. Notice some elements in the topology matrix show PHB which indicates data will not travel over a fast NVLINK. This topology will not perform at optimum speed in M-Star.

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV1     NV2     NV2     PHB     PHB     PHB     0-63    0-1
GPU1    NV1      X      NV2     NV1     PHB     NV2     PHB     PHB     0-63    0-1
GPU2    NV1     NV2      X      NV2     PHB     PHB     NV1     PHB     0-63    0-1
GPU3    NV2     NV1     NV2      X      PHB     PHB     PHB     NV1     0-63    0-1
GPU4    NV2     PHB     PHB     PHB      X      NV1     NV1     NV2     0-63    0-1
GPU5    PHB     NV2     PHB     PHB     NV1      X      NV2     NV1     0-63    0-1
GPU6    PHB     PHB     NV1     PHB     NV1     NV2      X      NV2     0-63    0-1
GPU7    PHB     PHB     PHB     NV1     NV2     NV1     NV2      X      0-63    0-1

Full peer access. Example of topology with full peer access. All elements show NV indicating full nvlink connected topology.

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-23,48-71      0
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-23,48-71      0
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-23,48-71      0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-23,48-71      0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    24-47,72-95     1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    24-47,72-95     1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    24-47,72-95     1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      24-47,72-95     1

The NVidia DGX-1 system is known to have incomplete peer access. This system has 2 sets of 4 GPUs, with partial connection between the 2 sets. Each set of 4 GPUs has full peer access. This is referred to as the so called “8 GPU hybrid cube-mesh interconnection network topology”. These systems are becoming less common in new installations but are still found in service on existing resources. The AWS P3.16XLARGE instance is one such example of a DGX-1 system with incomplete peer access that is still in use today in the year 2022. More information about P3 and P4 instance comparison is found on the AWS website at – https://aws.amazon.com/blogs/compute/amazon-ec2-p4d-instances-deep-dive/ . To run 8 GPU cases on AWS, you will need to use the P4dn.24xlarge instance. This instance has full peer access as noted in the above article.

The NVidia DGX-2 system is known to have complete peer access.

Post¶

The M-Star Post Processer is built on Kitware’s Paraview technology. This system reads in the solver output files and processes them. The user may then write out data, plots, images, videos, etc from this program. Note that it is not license controlled so has no connections to license servers.