AWS

Getting Started with AWS EMR

Amazon cloud provider offers an EMR service through which clusters can be launched in minutes without worrying about cluster management and node provisioning. It allows the storage and computing services of Amazon to grow independently leading to better resource utilization. It allows the user to store data in an Amazon S3 bucket and run it using compute services of the platform.

Let’s start with the Amazon EMR service.

Getting Started with AWS EMR

Amazon EMR is a data management service that uses various frameworks for big data analysis by creating clusters using Amazon EC2 instances and its workflow has been explained below:

Plan & Configure: To create an EMR cluster, the user needs to plan the storage required to manage big data and then choose the frameworks to analyze big data.

Manage: Managing the cluster can be done by connecting to it and then submitting the data on the cluster to check the results before terminating the cluster:

Clean Up: This step is for terminating the cluster and its resources and it is important as idle clusters can cost the user a lot:

Node in EMR

An EMR cluster is a combination of EC2 instances and each instance is called a Node and its types are explained below:

Master Node: It is the main node or the leader node which is responsible for managing all the resources of the cluster.

Core Node: It hosts Hadoop Distributed File System (HDFS) data and runs the tasks of the primary node and the primary Node manages tasks for the core node.

Task Node: These nodes do not host data but they run tasks for previous nodes and it is a helper node which means it is not mandatory to create while launching the EMR cluster:

Create EMR Cluster

To create a cluster on the EMR service of the AWS, head into the EMR dashboard by searching the service from the Amazon Console:

On this page, select “Clusters” from the left panel and click on the “Create cluster” button:

On the cluster creation page, click on the “Go to advance options” link:

Software Configuration: On the Advance settings page, the user can choose various open-source data processing frameworks, and the service also offers the creation of multiple nodes on EC2 instance:

Hardware Configuration: On this page, the user can configure the resources required for the EMR cluster that is available on the cloud:

Cluster Nodes and Instances: This section offers the user to configure node types which will create the EC2 instances having configured resources:

Security: On the last page, select the EC2 private key pair file which can be created on the Key Pair page from the EC2 dashboard to connect to the nodes:

The EMR cluster will be displayed on its page:

You have successfully created an EMR cluster on AWS.

Conclusion

AWS EMR service is used to create clusters to plan storage for big data to be used with the help of distributed file system. Each cluster is created with multiple nodes (EC2 instances) attached to it which can create and connect to the blank virtual machine on the cloud. These clusters can be used to manage big data on the cloud without any resources being used from your system.

About the author

Talha Mahmood

As a technical author, I am eager to learn about writing and technology. I have a degree in computer science which gives me a deep understanding of technical concepts and the ability to communicate them to a variety of audiences effectively.