Saturday, May 1, 2021

System design: Amazon EMR | My first 20 minutes study

May 1, 2021

Here is the article. 

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache SparkApache HiveApache HBaseApache FlinkApache Hudi, and Presto. Amazon EMR makes it easy to set up, operate, and scale your big data environments by automating time-consuming tasks like provisioning capacity and tuning clusters. With EMR you can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. You can run workloads on Amazon EC2 instances, on Amazon Elastic Kubernetes Service (EKS) clusters, or on-premises using EMR on AWS Outposts.

Easy to use

You can use EMR Studio, an integrated development environment (IDE), to easily develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio uses AWS Single Sign-On and allows you to log in directly with your corporate credentials. It provides fully managed Jupyter Notebooks and collaboration with peers using code repositories such as GitHub and BitBucket.

Low cost

EMR pricing is simple and predictable: You pay a per-instance rate for every second used, with a one-minute minimum charge. You can launch a 10-node EMR cluster for as little as $0.15 per hour. You can save 50-80% on the cost of the instances by selecting Amazon EC2 Spot for transient workloads and Reserved Instances for long-running workloads. You can also use Savings Plans.

Elastic

Unlike the rigid infrastructure of on-premises clusters, EMR decouples compute and storage, giving you the ability to scale each independently and take advantage of the tiered storage of Amazon S3. With EMR, you can provision one, hundreds, or thousands of compute instances or containers to process data at any scale. The number of instances can be increased or decreased automatically using Auto Scaling (which manages cluster sizes based on utilization) and you only pay for what you use.

Reliable

Spend less time tuning and monitoring your cluster. EMR is tuned for the cloud and constantly monitors your cluster — retrying failed tasks and automatically replacing poorly performing instances. Clusters are highly available and automatically failover in the event of a node failure. EMR provides the latest stable open source software releases, so you don’t have to manage updates and bug fixes, which leads to fewer issues and less effort to maintain your environment.

Secure

EMR automatically configures EC2 firewall settings, controlling network access to instances and launches clusters in an Amazon Virtual Private Cloud (VPC). Server-side encryption or client-side encryption can be used with the AWS Key Management Service or your own customer-managed keys. EMR makes it easy to enable other encryption options, like in-transit and at-rest encryption, and strong authentication with Kerberos. You can use AWS Lake Formation or Apache Ranger to apply fine-grained data access controls for databases, tables, and columns.

Flexible

You have complete control over your EMR clusters and your individual EMR jobs. You can launch EMR clusters with custom Amazon Linux AMIs and easily configure the clusters using scripts to install additional third party software packages. EMR enables you to reconfigure applications on running clusters on the fly without the need to relaunch clusters. Also, you can customize the execution environment for individual jobs by specifying the libraries and runtime dependencies in a Docker container and submit them with your job.

No comments:

Post a Comment