Hardware Guide¶

Updated August 2025

GPU technology is constantly changing, so it can be confusing to know what hardware to purchase. The purpose of this document is explain the current state of GPU technology and provide recommendations.

tl;dr¶

Here are some options to consider for the impatient. Many users will land on a variation of the High End Workstation which provides a great price-performance ratio without jumping to the more expensive server hardware. However, users with complex modeling needs and large models may need to consider server class hardware.

Entry Level PC

1x NVidia GeForce 40-series or 50-series, 32–64GB System memory, 1TB Disk

Middle-of-the-Road Workstation

1x NVidia RTX 6000 Ada, 64–128GB System memory, 2TB Disk

High End Workstation

1x NVidia RTX PRO 6000, 256GB System memory, 2TB Disk

Note: Add another GPU dedicated for display when running Microsoft Windows.

Entry Level Server

4x H100 SXM with full NVLINK/NVSWITCH

High End Server

8x H100 SXM or B100 SXM with full NVLINK/NVSWITCH

Ultra Mega High End Server

16x or more B200 SXM with full NVLINK/NVSWITCH

First Considerations¶

The first consideration in designing a computer to run M-Star should be what GPUs you want to use, you can then design the computer around those GPUs. To determine GPUs, take into consideration the average size problem you need to solve. Larger simulations with additional physics will require more GPU memory. Simple simulations with moderate to coarse fluid resolution will require less GPU memory.

While you can speed up M-Star by running across more GPUs, bear in mind that each GPU needs a sufficient amount of load in order to provide the best efficiency. For example, if you run a simulation with 1M nodes on 8x A100s, it would not run much faster than a single A100. This is due to a bottleneck in performance in the GPU-to-GPU communication when the simulation is too thinly spread across your GPUs. It is best to design the computing resource based on how much GPU memory you actually need. This behavior is further discussed in Scaling Performance.

Important

Design your computing resource based on the GPU memory requirement for typical M-Star models you will run.

How much GPU memory do I need?¶

The size of the simulation, in terms of lattice density and particle count, is limited by the local GPU RAM. As a first order approximation, 1 GB of GPU RAM can support 2–4 million grid points and 1 million particles. Adding scalar fields, custom variables may change this scaling.

Loosely speaking, most simulations contain 1–100 million lattice points and/or 1–10 million particles. These simulations can typically be performed on a single performance GPU, which typically contains 16–80 GB of RAM. Simulations with larger memory requirements may require a multi-GPU configuration.

What memory bandwidth do I need?¶

Bandwidth is the rate at which data can be transferred between the GPU’s processing cores and its on‑board memory (VRAM). For NVidia GPUs it is reported in gigabytes per second (GB/s). Higher bandwidth allows the GPU to access and process data faster, improving overall performance.

After total memory capacity and raw compute performance (see TFlops), bandwidth is the next specification to compare. When two GPUs have similar memory size and computational speed, prefer the one with the higher memory bandwidth.

Note

Bandwidth matters most for memory‑bound workloads (large lattices or particle counts with comparatively low arithmetic intensity). If your models fit comfortably in memory and are compute‑bound, differences in bandwidth may have less impact.

GPU Spec Tables¶

Always reference the official datasheet provided by NVidia for official specfications. When multiple GPU variants are available, the one with more memory and/or cores is always listed.

Tables are grouped into three main types: Data Center, Workstation, and Consumer. Each table is sorted by TFLOPS in descending order.

Name: Name of the GPU.

TFlops: Theoretical Single Precision Teraflops (floating point operations per second) of a single GPU based on the Boost clock frequency and CUDA cores. Typically referred to as FP 32 TFlops in NVidia data sheets and other sources.
Memory: Amount of memory of a single GPU in gigabytes.
Memory Bandwidth: Peak memory bandwidth of a single GPU in gigabytes per second (GB/s). Higher is better for memory‑bound workloads. See What memory bandwidth do I need?.
NVLink N: The number of GPUs that may be connected to each other via NVLINK. A value of zero indicates NVlink is not supported. For more information, see GPU Topology.
ECC: Error Correcting Code. It is ‘y’ if this feature is supported. ECC prevents data corruption in memory.

Note

Regarding NVLink:

Most PCIe-based GPUs allow for either zero or two GPU connections. For example, if NVLink N = 2, this means a single NVLink bridge may be used to connect two GPUs.

In contrast, SXM-based GPUs typically allow for many NVLink connections to be made via NVLINK/NVSWITCH hardware.

Data Center GPUs¶

This category contains the top performing GPUs. These are typically recommended for server class hardware in a data center and for solving the largest problems.

Data Center Class¶
Name	TFlops	Memory (GB)	Memory Bandwidth (GB/s)	NVLink N	ECC
B200 PCIe	80	192	8000	576	y
B100 SXM	60	192	8000	576	y
H100 SXM	67	80	3360	8	y
H100 PCIe	51	80	2000	2	y
A100 SXM	19.5	80	1935	16	y
A100 PCIe	19.5	80	1935	2	y
L40S	91.6	48	864	0	y
L40	90.5	48	864	0	y
A40	37.4	48	696	2	y
A10	31.2	24	600	2	y
L4	30.3	24	300	0	y
A30	10.3	24	933	2	y

Workstation GPUs¶

PCIe-based GPUs intended for workstations and servers. These are recommended for middle range memory capacity.

Workstation Class¶
Name	TFlops	Memory (GB)	Memory Bandwidth (GB/s)	NVLink N	ECC
RTX PRO 6000 Workstation	125	96	1792	0	y
RTX PRO 6000 Server	120	96	1597	0	y
RTX PRO 6000 Max-Q Workstation	110	96	1792	0	y
RTX 6000 Ada	91.1	48	960	0	y
RTX PRO 5000	73.2	48	1344	0	y
RTX A6000	38.7	48	768	2	y
RTX 5000 Ada	65.3	32	576	0	y
Quadro GV100	14.8	32	870	2	y
RTX PRO 4000	46.9	24	672	0	y
RTX A5500	34.1	24	768	2	y
RTX A5000	27.8	24	768	2	y
RTX 4000 Ada	26.7	20	360	0	y
RTX A4500	23.7	20	640	2	y
RTX A4000	19.2	16	448	0	y
RTX A2000	8	12	288	0	y

Consumer/Gaming GPUs¶

Consumer (gaming) GPUs are PCIe boards intended primarily for gaming and general desktop workloads. They offer a lower cost per unit of theoretical compute but typically lack data‑center/workstation features such as NVLink, ECC memory, and broader validation/testing. The trade‑off is reduced reliability features and support versus workstation parts. For modest single‑GPU simulation that fit comfortably in memory, a high‑end consumer GPU can provide excellent price/performance.

Consumer/Gaming Class¶
Name	TFlops	Memory (GB)	Memory Bandwidth (GB/s)	NVLink N	ECC
RTX 5090	106.1	32	1792	0	n
RTX 4090	82.6	24	1008	0	n
RTX 3090 Ti	40	24	1008	2	n
RTX 3090	35.7	24	936	2	n
RTX 5080	56.8	16	960	0	n
RTX 4080	48.7	16	717	0	n
RTX 5070 Ti	43.9	16	896	0	n
RTX 5060 Ti (16GB)	23.7	16	448	0	n
RTX 4070 Ti	40.1	12	504	0	n
RTX 3080 Ti	34.1	12	912	0	n
RTX 5070	30.9	12	672	0	n
RTX 3080	29.8	12	912	0	n
RTX 3060	12.7	12	360	0	n
RTX 5060 Ti (8GB)	23.7	8	448	0	n
RTX 3070 Ti	21.7	8	608	0	n
RTX 3070	20.4	8	448	0	n
RTX 5060	19.2	8	448	0	n
RTX 3060 Ti	16.2	8	448	0	n
RTX 3050	9.1	8	224	0	n

Note

A Note on Specialized AI Hardware (e.g., DGX Spark)

While users may encounter new desktop systems like the NVIDIA DGX Spark (powered by the GB10 Superchip) marketed with high “PetaFLOP” performance, these systems are not recommended for M-Star CFD. This hardware is highly optimized for low-precision AI inference (e.g., FP4) and not the single-precision (FP32) floating-point compute that is critical for CFD simulations. The GB10’s actual FP32 compute is only ~31 TFlops—comparable to a consumer-grade RTX 5070—and its unified memory architecture, while large, offers exceptionally low memory bandwidth (under 300 GB/s). M-Star workloads are often memory-bound and require high TFlops for good performance- the DGX Spark may be poor price-performance choice that would create a significant computational bottleneck.

CPU¶

M-Star CFD is not a CPU-bound process. This selection is left to the user.

System Memory¶

We recommend 1.5–2x the amount of total GPU memory in the machine. For example if you have two GPUs with 48GB of memory each, you would multiply that by two and get 96GB for the system memory. ECC memory should be preferred for shared workstations and server class hardware.

Disk Storage¶

The requirements for disk storage can vary wildly depending on how a simulation is configured. A good starting point is to have 1–2TB of working storage for M-Star. This should be fast storage, preferably local SSD-based storage to optimize the write speed of large output files.