CFD on AWS (PCluster, EFS, NICE DCV)

Overview

This guide will help setup a cluster. These are the general steps we are taking:

  • Setup EFS Storage backend

  • Setup AWS Parallel Cluster

  • Install CUDA, OpenMPI, M-Star CFD on EFS share

  • Setup Remote DCV GUI

  • Run test case to verify workflow

Note

Some AWS regions may have insufficient GPU compute node capacity at times. If you run into issues with GPU nodes not coming up, this be a reason why. Refer to the troubleshooting section in the AWS ParallelCluster documentation and contact your AWS support rep for further assistance.

Prerequisites

  • AWS account

  • Quota supporting resources you want to access

  • Local Python environment

  • Local SSH client

  • Local SCP client

Expected workflow

  • Preprocess an M-Star case using a remote desktop

  • Login to pcluster head node and submit the job for processing into a SLURM queue

  • Post Process the results using remote desktop

EFS

We will define an EFS volume with the following characteristics:

  • No encryption

  • In Same VPC as parallel cluster

  • Modify security group to allow all traffic from any inbound source (required so the pcluster resources can access. you can potentially limit this access more by subnet or something if you prefer)

Follow these steps:

  1. Go to the AWS web console and navigate to the EFS console

  2. Click Create file system

  3. Click Customize

  4. Set name to pcluster

  5. Select the VPC you are using for pcluster

  6. Disable encryption

  7. Set other options according to your needs (Lifecycle management, data redundency, etc)

  8. Click Create

  9. Wait for the file system to be created

  10. Click on the new file system in the list

  11. Go to the Network tab

  12. Take note of the security group used (eg. sg-##########)

  13. Go to the EC2 tab in the AWS console

  14. Click on Security Groups

  15. Find the security group noted above and click on it

  16. Go to Inbound rules tab

  17. Click Edit inbound rules

  18. Add a new rule that allows all traffic from any IP. This is required so any instance in the VPC can access the EFS share.

Parallel Cluster

Reference Documentation:

We will be creating an AWS Parallel Cluster version 3 setup. This will create a cluster that keeps a t2.medium instance active at all times. When jobs are submitted, p3.2xlarge (1x V100) instances will be spun up to accomodate the resource requirement. Compute resources are spun down when no longer needed. A maximum of 10 compute instances are allowed. Shared storage is accomplished with an EBS volume mounted at /w on all instances. If we ever delete this cluster, the shared volume will not be deleted.

Anticipated costs:

  • Cost to keep t2.medium head node alive

  • Cost to spin up compute nodes and execute jobs

  • EFS share storage costs

  • EBS costs associated with systems

  • Data transfer out

Install ParallelCluster

Follow steps here: https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-parallelcluster.html

Notes:

  • Use Anaconda on windows to setup a new conda environment for pcluster

  • Use conda install nodejs to perform post installation step

Configure ParallelCluster

Follow the steps here: https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-configuring.html

Here is our configuration. This configuration keeps 1 GPU compute node active at all times. This is useful when first setting up so you don’t have to wait for the compute nodes to be spun up by SLURM. If you need to change the values for the MinCount, you can do so and then update the AWS configuration.

Region: us-east-1

Image:
Os: ubuntu2004

HeadNode:
InstanceType: t2.medium
Networking:
    SubnetId: subnet-###################
Ssh:
    KeyName: #####YOURKEY#######
LocalStorage:
    RootVolume:
    Size: 100

Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: queue
    ComputeResources:
    - Name: compute
    InstanceType: p3.2xlarge
    MinCount: 1
    MaxCount: 10
    Networking:
    SubnetIds:
    - subnet-###################

SharedStorage:
- MountDir: /w
    Name: Shared
    StorageType: Efs
    EfsSettings:
    FileSystemId: fs--###################

Create the cluster

pcluster create-cluster --cluster-name test --cluster-configuration cluster-config.yaml

Monitor the cluster

pcluster describe-cluster --cluster-name test --region us-east-1

Wait for it to come up prior to continuing through this guide

Obtain public IP address from AWS console. Needed to access via SSH/SCP.

Make Changes to the cluster

Any changes that impact compute nodes requires that the compute fleet itself be stopped prior to the change.:

pcluster update-compute-fleet --region us-east-1 --status STOP_REQUESTED --cluster-name test
# .. wait (may take a few minutes)
pcluster update-cluster --cluster-name test --cluster-configuration cluster-config.yaml
# .. wait (may take a few minutes)
pcluster update-compute-fleet --region us-east-1 --status START_REQUESTED --cluster-name test
# .. wait (may take a few minutes)

Login to the cluster

Find the public IP address from the AWS console and login using an SSH client and your private key file.

Install CUDA toolkit

This step is required so we can compile OpenMPI with CUDA support.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda-toolkit-11-4

Install OpenMPI

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
cd openmpi-4.1.1/
mkdir build
../configure --prefix=/w/a/openmpi --with-cuda
make && make install

Create an environment file for openmpi. Paste the following into the file /w/a/openmpi/env.sh

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
export PATH=$DIR/bin:$PATH
export LD_LIBRARY_PATH=$DIR/lib:$LD_LIBRARY_PATH

Install M-Star CFD

wget #### Paste download link here ####
ls
mkdir 3.3.53
cd 3.3.53/
tar xzf ../mstarcfd-solver-3.3.53-oracle7_cuda-11.3_openmpi-4.1.tar.gz

Install license for M-Star

Install license file or define PORT@HOST info needed for jobs

NICE DCV Setup

  1. Follow the quick start instructions for using the standard NICE DCV AWS AMI. Remote Visualization (NICE DCV) . Then return here to continue the setup.

  2. Login to the NICE DCV instance with SSH or use a NICE DCV session with a terminal to continue the instructions.

  3. Install EFS utils:

    sudo yum install -y amazon-efs-utils
    
  4. Create directory and mount EFS file sytem. Copy/paste the filesytem ID you created in the EFS instructions above:

    cd
    mkdir efs
    sudo mount -t efs fs-@@@@@@@@:/ efs
    
  5. Check the mounted filesytem:

    cd efs
    ls
    touch testfile
    rm testfile
    
  6. At this point you may want to edit your /etc/fstab file to permanently mount the EFS share

  7. Connect and login to your NICE DCV instance using the UI client. You may need to set a password and create a DCV session if you have not done so already:

    sudo passwd ec2-user
    sudo dcv create-session test --owner ec2-user      # See the DCV documentation for creating sessions at boot time
    
  8. Open the M-Star GUI:

    source efs/a/3.3.53/mstar.sh
    export mstar_LICENSE=efs/a/mstar.lic  # Or set mstar_LICENSE=port@host for license servers
    mstar-gui
    
  9. Open M-Star Post:

    source efs/a/3.3.53/mstar.sh
    MStarPost
    

Workflow Example

  • Connect to the head node via SSH (Head Node)

  • Connect to the GUI node via NICE DCV Client (GUI Node)

On GUI Node

  1. Open M-Star GUI

  2. Create a new case from template. Pick the Agitated Case

  3. Save the case files to the efs volume under a new case directory, eg. ~/efs/r/mycase

On Head Node

  1. Navigate to the case directory

  2. Create a run.sh script for SLURM. Edit the environment file paths and license information for your setup.:

    #!/bin/bash
    #SBATCH --ntasks=2
    #SBATCH --gpus-per-task=1
    
    source /w/a/openmpi/env.sh
    source /w/a/3.3.53/mstar.sh
    export mstar_LICENSE=/w/a/mstar.lic
    
    mpirun mstar-cfd-mgpu -i input.xml -o out --gpu-auto > log.txt 2>&1
    
  3. Submit the job to the SLURM queue:

    sbatch run.sh
    
  4. Monitor for job startup:

    # Watch log file
    tail -f log.txt
    
    # Queue/job information
    squeue
    

On GUI Node

  1. Open a terminal window

  2. Source M-Star CFD:

    source efs/a/3.3.53/mstar.sh
    
  3. Navigate to the case directory:

    cd efs/r/mycase
    
  4. Open the results in M-Star Post. Note that you need to pass the case out directory as a command line parameter here:

    MStarPost out
    
  5. Use M-Star Post to view the results

Next steps