Reading Time: 11 minutes

Amazon EC2 P4d instances deliver the highest performance for machine learning (ML) training and high-performance computing (HPC) applications in the cloud. Amazon EC2 P4d instances are deployed in hyperscale clusters called EC2 UltraClusters that comprise the high-performance compute, networking, and storage in the cloud. Each EC2 UltraCluster of P4d instances comprises more than 4,000 of the latest NVIDIA A100 GPUs, Petabit-scale non-blocking networking infrastructure, and high throughput low latency storage with FSx for Lustre. Customers can easily scale from a few to thousands of NVIDIA A100 GPUs in the EC2 UltraClusters based on their ML or HPC project needs. Unlike on-premises systems, customers can access virtually unlimited compute and storage capacity, scale their infrastructure based on business needs, and spin up a multi-node ML training job or a tightly coupled distributed HPC application in minutes, without any setup or maintenance costs. This blog will help you launch a high-performance HPC cluster in the cloud using EC2 UltraClusters of P4d Instances. You will set up the underlying networking for the cluster, deploy FSx for Lustre and P4d cluster, and delete your AWS resources.

In this blog, we will cover:

Amazon EC2 P4d Instances
Features of Amazon EC2 P4d Instances
Performance of Amazon EC2 P4d instances
Use cases of Amazon EC2 P4d instances
Hands-on – Deploy Amazon EC2 P4d instances in EC2 UltraClusters
Conclusion

Amazon EC2 P4d Instances

Amazon EC2 P4 instances, the latest generation of GPU-based instances deliver the best performance for machine learning training and high-performance computing in the cloud.

P4d instances are deployed in EC2 UltraClusters, which are hyper-scale clusters. Each EC2 UltraCluster has over 4,000 NVIDIA A100 Tensor Core GPUs, Petabit-scale networking, and scalable low-latency storage using FSx for Lustre. Each EC2 UltraCluster is one of the most powerful supercomputers in the world. In EC2 SuperClusters, anyone can quickly spin up P4d instances.

How to deploy Amazon EC2 P4d instances in EC2 UltraClusters to get highest performance for ML training and HPC in the Cloud?

The key specifications of P4d instances are:

Elastic Fabric Adapter (EFA)
Enhanced Networking
EBS Optimized
Intel AVX, Intel AVX2, Intel AVX-512, and Intel Turbo
3.0 GHz 2nd Generation Intel Xeon Scalable (Cascade Lake) processors

Features of Amazon EC2 P4d Instances

The following are the features of P4d Instances:

Up to 8 NVIDIA A100 Tensor Core GPUs
400 Gbps instance networking with support for Elastic Fabric Adapter (EFA) and NVIDIA GPUDirect RDMA (remote direct memory access)
600 GB/s peer-to-peer GPU communication with NVIDIA NVSwitch
3.0 GHz 2nd Generation Intel Xeon Scalable (Cascade Lake) processors
Deployed in EC2 UltraClusters consisting of more than 4,000 NVIDIA A100 Tensor Core GPUs, Petabit-scale networking, and scalable low latency storage with Amazon FSx for Lustre

Performance of Amazon EC2 P4d instances

Natural Language Processing: For a BERT-Large model using the TensorFlow framework trained on the Wikipedia-corpus dataset, P4d instances are 3x faster than the P3 instances.

Image Classification: For a ResNet50 model using the TensorFlow framework trained on the Imagenet2012 dataset, P4d instances are 2.1x faster than the P3 instances.

Speech To Text: For a Jasper model using the PyTorch framework trained on the LibriSpeech dataset, P4d instances are 2.3x faster than the P3 instances.

Use cases of Amazon EC2 P4d instances

Machine learning workloads such as natural language comprehension, perception model training for autonomous vehicles, picture classification, object recognition, and recommendation engines are among the applications.

Increased GPU speed can help customers train larger, more complicated models in less time, while extra GPU memory can help customers train larger, more complex models. P4’s improved processing performance and GPU memory can be used for seismic analysis, drug discovery, DNA sequencing, and insurance risk modeling by HPC clients.

Hands-on- Deploy Amazon EC2 P4d instances in EC2 UltraClusters

We will demonstrate how we can launch a high-performance HPC cluster in the cloud using EC2 UltraClusters of P4d Instances. Amazon EC2 P4d instances deliver the highest performance for machine learning (ML) training and high-performance computing (HPC) applications in the cloud. These instances are the first in the cloud to support 400 Gbps instance networking. P4d instances provide up to 60% lower cost to train ML models, including an average of 2.5x better performance for deep learning models compared to previous generation P3 and P3dn instances. Researchers, data scientists, and developers can leverage P4d instances to train ML models for use cases such as natural language processing, object detection and classification, and recommendation engines, as well as run HPC applications such as pharmaceutical discovery, seismic analysis, and financial modeling. EC2 UltraClusters of P4d instances combine high-performance compute, networking, and storage into one of the most powerful supercomputers in the world. Navigating to the VPC dashboard, we will first create a private subnet in a VPC. Then we will navigate to the NAT Gateway dashboard and create a new NAT gateway with the private subnet attached. We will then create a route table thereby editing its routing entries and the subnet associations. To access the EC2 UltraCluster, we will create 2 security groups, one for EFA and the other one for external SSH. We will then launch an FSx for the Lustre file system followed by launching a cluster of EC2 P4d instances with 4 EFA ENI’s. Finally, we will have a look at the steps of mounting the file system to the EC2 instance.

To implement this, we will do the following:

Login to your AWS console and navigate to the dashboard.
Navigate to the VPC dashboard.
Create a private subnet in the selected VPC.
Create a NAT Gateway attaching the private subnet to it.
Navigate to the Route Tables dashboard and create a new Route Table.
Add routes to the route table and edit the subnet associations.
Navigate to the Security Groups dashboard in the EC2 console.
Create 2 security groups, one for EFA and the other one for External SSH with the shown Inbound and Outbound rules.
Navigate to the S3 console and create a new bucket that is to be linked to the File System.
Navigate to the File System dashboard and create a new file system with the configuration steps shown below.
Once the file system becomes available, navigate to the EC2 dashboard and launch a P4d instance following the steps shown below.
Add 3 more network interfaces with Elastic fabric Adapter checked while creating the instance.
Once launched, you can access the instance via the command line using SSH and mount the file system using the steps provided.
Finally, you can start training the models on the P4d instance.
In case you are just following the hands-on for learning purposes, make sure to eliminate all the resources you created throughout the hands-on.

Search for the VPC service and navigate to the VPC dashboard.

Select subnets from the left navigation pane.

Click on Create Subnet.

Choose the VPC from the dropdown.

Enter a name for the subnet and select an availability zone. Enter the IPv4 in the IPv4 CIDR Block as per the selected VPC. Enter Tags if needed. Click on Create Subnet.

You will be able to see the created subnet in the list of subnets. Select NAT gateways from the left navigation pane.

Click on Create NAT gateway.

Enter a name for your NAT gateway and choose the options as shown in the image below.

Enter Tags if any need for your NAT gateway. Click on Create NAT gateway.

On success, you will see the screen as shown below.

Select Route Tables from the left navigation pane and click on Create route table.

Enter a name for your route table. Select the VPC in which you created your subnets. Enter Tags if needed and click on Create route table.

On success, you will see the screen as shown in the image below.

Scroll down to the bottom and click on Edit Routes.

Make an entry in the table as shown in the image below. Click on Save Changes.

On success, you will see the entry created in the Routes.

Now, select the Subnet associations tab.

Or you can select the route table and click on Edit Subnet associations.

Select the private subnet and click on Save association.

On success, you will see the message as shown in the image below.

Now, search for the EC2 service and navigate to its dashboard.

Select Security Groups from the left navigation pane.

Click on Create security group.

Enter a name for the first security group and select the vpc in which the subnet exists. Add an inbound rule as shown in the image below.

Add the outbound rule and tags if needed and click on Create security group.

Similarly, create another security group for SSH access and add the inbound rule as shown in the image below.

Add the outbound rule and tags if any needed and click on Create security groups.

On success, you will see the message as shown in the image below.

Navigate to the S3 console. Click on Create Bucket.

Enter a name for the bucket.

Block all public access for your bucket.

Make the changes as shown in the image below and click on Create bucket.

On success, you will see the message as shown in the image below.

Now, search for the service FSx.

Click on Create file system.

Select Amazon FSx for the Lustre option.

Fill in the form similar to the screenshot shown below with the following parameters:

Deployment & storage type: Scratch,SSD

Throughput per unit of storage: 200 MBs/TB

Storage Capacity: 2.4TiB

For Network and Security:

Virtual Private Cloud: VPC of the private subnet created earlier

VPC Security Groups: Choose the EFA security group you created earlier

Subnet: Private subnet created earlier

Choose an S3 bucket for data ingestion that we created above and make the options as shown in the screenshot below.

Review the settings and scroll down and click on Create File System.

The creation might take a few minutes.

Once the status changes to Available, click on the File system name.

Note down the Mount name and you can review all the configurations made.

Now, navigate back to the EC2 dashboard and click on Launch instances.

Select the AMI with A100 support as well as have the FSx client driver installed.

For Instance, Type choose: p4d.24xlarge. Click on Next: configure instance details.

For the instance, details choose the number of instances you want in the count. Choose the VPC and private subnet created earlier. Select a placement group created as a cluster.

For network interfaces add 3 more network interfaces with “Elastic Fabric Adapter” checked by clicking on Add Device.

Set the NetworkCardIndex for each EFA adapter to 0,1,2,3. Click on Next: Add storage.

You can add additional storage if you need any. Click on Next: tags.

Add tags for your EC2 instance if needed. Click on Next: Configure Security Group.

Choose the security groups created earlier for SSH and EFA access. Click on Review and Launch.

Review all the settings and click on Launch.

Create a new key-value pair or select an existing one. Click on Launch instance.

In a few minutes, the instance will launch and then you can connect to it via the command line and start training your models.

To mount the file system, on the file system console, select the file system and click on Attach. You can follow the same steps to mount the file system.

Conclusion

In this blog, we explored EC2 P4d Instances, their features, benefits, performance, and use cases. We also saw how we can launch a high-performance HPC cluster in the cloud using EC2 UltraClusters of P4d Instances. Navigating to the VPC dashboard, we first created a private subnet in a VPC. Then we navigated to the NAT Gateway dashboard and created a new NAT gateway with the private subnet attached. We then created a route table thereby editing its routing entries and the subnet associations. To access the EC2 Ultracluster, we created 2 security groups, one for EFA and the other one for external SSH. We then launched an FSx for the Lustre file system followed by launching a cluster of EC2 P4d instances with 4 EFA ENIs. Finally, we had a look at the steps of mounting the file system to the EC2 instance. We will discuss more use cases of P4d EC2 Ultracluster in our upcoming blogs. Stay tuned to keep getting all updates about our upcoming new blogs on AWS and relevant technologies.

For any further queries, feel free to post your comments, we are happy to help!

Meanwhile …

Keep Exploring -> Keep Learning -> Keep Mastering

This blog is part of our effort towards building a knowledgeable and kick-ass tech community. At Workfall, we strive to provide the best tech and pay opportunities to AWS-certified talents. If you’re looking to work with global clients, build kick-ass products while making big bucks doing so, give it a shot at workfall.com/partner today.

How to deploy Amazon EC2 P4d instances in EC2 UltraClusters to get highest performance for ML training and HPC in the Cloud?

How to deploy Amazon EC2 P4d instances in EC2 UltraClusters to get highest performance for ML training and HPC in the Cloud?

Amazon EC2 P4d Instances

Features of Amazon EC2 P4d Instances

Performance of Amazon EC2 P4d instances

Use cases of Amazon EC2 P4d instances

Hands-on- Deploy Amazon EC2 P4d instances in EC2 UltraClusters

Conclusion

Workfall

Related Posts