AWS Advanced (EFS, PCluster, NICE DCV)¶

Overview¶

This guide will help set up a cluster. These are the general steps we are taking:

Set up EFS storage backend.
Set up AWS Parallel Cluster.
Install CUDA, Open MPI, M-Star CFD on EFS share.
Set up remote DCV GUI.
Run test case to verify workflow.

Note

Some AWS regions may have insufficient GPU compute node capacity at times. If you run into issues with GPU nodes not coming up, this may be a reason why. Refer to the troubleshooting section in the AWS ParallelCluster documentation, and contact your AWS support rep for further assistance.

Prerequisites¶

AWS account
Quota supporting resources you want to access
Local Python environment
Local SSH client
Local SCP client

Expected Workflow¶

Pre-process an M-Star case using a remote desktop.
Login to pcluster head node and submit the job for processing into a SLURM queue.
Post-process the results using remote desktop.

EFS¶

We will define an EFS volume with the following characteristics:

No encryption
Same VPC as parallel cluster
Modified security group to allow all traffic from any inbound source (This is required so the pcluster resources can have access. You can potentially limit this access by subnet, if you prefer.)

Follow these steps:

Go to the AWS web console and navigate to the EFS console.
Select Create file system.
Select Customize, set name to pcluster.
Select the VPC you are using for pcluster.
Disable encryption.
Set other options according to your needs (Lifecycle management, data redundency, etc.).
Select Create, wait for the file system to be created.
Click on the new file system in the list.
Go to the Network tab, take note of the security group used (eg., sg-##########).
Go to the EC2 tab in the AWS console, select Security Groups.
Find the security group noted above and click on it.
Go to the Inbound rules tab, click Edit inbound rules.
Add a new rule which allows all traffic from any IP. This is required so any instance in the VPC can access the EFS share.

Parallel Cluster¶

Reference Documentation:

ParallelCluster: https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html
EFS: https://docs.aws.amazon.com/efs/latest/ug/whatisefs.html
M-Star installation instructions: Linux (Single Node)

We will be creating an AWS Parallel Cluster version 3 setup. This will create a cluster that keeps a t2.medium Instance active at all times. When jobs are submitted, p3.2xlarge (1x V100) Instances will be spun up to accomodate the resource requirement. Compute resources are spun down when no longer needed. A maximum of 10 compute Instances are allowed. Shared storage is accomplished with an EBS volume mounted at /w on all Instances. If we ever delete this cluster, the shared volume will not be deleted.

Anticipated costs:

Cost to keep t2.medium head node alive
Cost to spin up compute nodes and execute jobs
Cost to share EFS storage
EBS costs associated with systems
Data transfer out

Install ParallelCluster¶

Follow the steps here.

Notes:

Use Anaconda on Windows to set up a new conda environment for pcluster.
Use conda install nodejs to perform post installation step.

Configure ParallelCluster¶

Follow the steps here.

Here is our configuration. This configuration keeps one GPU compute node active at all times. This is useful when first setting up so you don’t have to wait for the compute nodes to be spun up by SLURM. If you need to change the values for the MinCount, you can do so and then update the AWS configuration.

Region: us-east-1

Image:
Os: ubuntu2004

HeadNode:
InstanceType: t2.medium
Networking:
    SubnetId: subnet-###################
Ssh:
    KeyName: #####YOURKEY#######
LocalStorage:
    RootVolume:
    Size: 100

Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: queue
    ComputeResources:
    - Name: compute
    InstanceType: p3.2xlarge
    MinCount: 1
    MaxCount: 10
    Networking:
    SubnetIds:
    - subnet-###################

SharedStorage:
- MountDir: /w
    Name: Shared
    StorageType: Efs
    EfsSettings:
    FileSystemId: fs--###################

Create the Cluster¶

pcluster create-cluster --cluster-name test --cluster-configuration cluster-config.yaml

Monitor the Cluster¶

pcluster describe-cluster --cluster-name test --region us-east-1

Wait for it to come up prior to continuing through this guide.

Obtain public IP address from AWS console. This is needed to access via SSH/SCP.

Make Changes to the Cluster¶

Any changes that impact compute nodes require that the compute fleet itself be stopped prior to the change.

pcluster update-compute-fleet --region us-east-1 --status STOP_REQUESTED --cluster-name test
# .. wait (may take a few minutes)
pcluster update-cluster --cluster-name test --cluster-configuration cluster-config.yaml
# .. wait (may take a few minutes)
pcluster update-compute-fleet --region us-east-1 --status START_REQUESTED --cluster-name test
# .. wait (may take a few minutes)

Install CUDA Toolkit¶

This step is required so we can compile Open MPI with CUDA support.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda-toolkit-11-4

Install Open MPI¶

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
cd openmpi-4.1.1/
mkdir build
../configure --prefix=/w/a/openmpi --with-cuda
make && make install

Create an environment file for Open MPI. Paste the following into the file: /w/a/openmpi/env.sh

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
export PATH=$DIR/bin:$PATH
export LD_LIBRARY_PATH=$DIR/lib:$LD_LIBRARY_PATH

Install M-Star CFD¶

wget #### Paste download link here ####
ls
mkdir 3.3.53
cd 3.3.53/
tar xzf ../mstarcfd-solver-3.3.53-oracle7_cuda-11.3_openmpi-4.1.tar.gz

Install License for M-Star¶

Install license file or define PORT@HOST information needed for jobs.

NICE DCV Setup¶

Follow the quick start instructions for using the standard NICE DCV AWS AMI: Remote Visualization (NICE DCV). Then return here to continue the setup.
Login to the NICE DCV Instance with SSH or use a NICE DCV session with a terminal to continue the instructions.
Install EFS utils.
```
sudo yum install -y amazon-efs-utils
```
Create directory and mount EFS file system. Copy and paste the file system ID you created in the EFS instructions above.
```
cd
mkdir efs
sudo mount -t efs fs-@@@@@@@@:/ efs
```
Check the mounted file system.
```
cd efs
ls
touch testfile
rm testfile
```
At this point you may want to edit your /etc/fstab file to permanently mount the EFS share.
Connect and login to your NICE DCV instance using the UI client. You may need to set a password and create a DCV session if you have not done so already.
```
sudo passwd ec2-user
sudo dcv create-session test --owner ec2-user      # See the DCV documentation for creating sessions at boot time
```

Open the M-Star GUI.

source efs/a/3.3.53/mstar.sh
export mstar_LICENSE=efs/a/mstar.lic  # Or set mstar_LICENSE=port@host for license servers
mstar-gui

Open M-Star Post.
```
source efs/a/3.3.53/mstar.sh
MStarPost
```

Workflow Example¶

Connect to the head node via SSH (Head Node).
Connect to the GUI node via NICE DCV Client (GUI Node).

Starting on GUI Node¶

Open M-Star Pre.
Create a new case from template. Pick the Agitated Case.
Save the case files to the efs volume under a new case directory, eg., ~/efs/r/mycase.

On to Head Node¶

Navigate to the case directory.

Create a run.sh script for SLURM. Edit the environment file paths and license information for your setup.

#!/bin/bash
#SBATCH --ntasks=2
#SBATCH --gpus-per-task=1

source /w/a/openmpi/env.sh
source /w/a/3.3.53/mstar.sh
export mstar_LICENSE=/w/a/mstar.lic

mpirun mstar-cfd-mgpu -i input.xml -o out --gpu-auto > log.txt 2>&1

Submit the job to the SLURM queue.
```
sbatch run.sh
```

Monitor for job startup.

# Watch log file
tail -f log.txt

# Queue/job information
squeue

Returning to GUI Node¶

Open a terminal window.
Source M-Star CFD.
```
source efs/a/3.3.53/mstar.sh
```
Navigate to the case directory.
```
cd efs/r/mycase
```
Open the results in M-Star Post. Note that you need to pass the case out directory as a command line parameter here.
```
MStarPost out
```
Use M-Star Post to view the results.

Next steps¶

Kill a SLURM job. Identify the job ID and kill the job.

# look for your job id
squeue

# kill the job with id=5
scancel 5

If your cases are heavy on disk IO, you may want to look into using the Lustre file system on AWS.