Migrating the Elephant: From On-Premise Hadoop to Amazon EMR - TriNimbus

Migrating the Elephant: From On-Premise Hadoop to Amazon EMR

The continuous growth in Amazon’s Big Data analytics portfolio is not surprising with the breadth of services it offers, such as Amazon Athena, Amazon Redshift, Amazon Kinesis, and AWS Glue. While these solutions are all great, they do not always work for large on-prem Hadoop migrations. This is mainly due to the fact that the on-prem analytics pipeline relies on Hadoop and Spark jobs, and code change is not always a valid option due to various constraints.

This blog is a summary of our larger White Paper available on the same topic. If you’d like the complete paper, it’s available for download here.

Enter Amazon EMR, one of the earliest analytics offerings from Amazon. EMR has gone through numerous improvements since its original release in 2009. Today, it’s a mature and reliable platform to run Hadoop, Hive, and Spark workloads on the Cloud without any changes to original code.

Related: Finding a Better Way To Manage Your Ever Increasing Mountains Of Data
Related: When Data is Big, Expertise Should be Too

What is Amazon EMR?

Amazon EMR is a scalable, easy-to-use, fully-managed service for running Apache Hadoop (and associated services such as Spark) in a simple and cost-efficient way on the Cloud. The idea of migrating your on-prem OLAP Hive/Spark workloads to Amazon EMR can be appealing given its impressive offering.

Some of the advantages that Amazon EMR offers over on-prem Hadoop include:

  1. The ability to leverage S3 for data storage, which can be used for storing raw and processed data in a reliable, cost efficient way, thus separating storage and compute layers and being less reliant on HDFS.
  2. Built-in auto-scaling for worker nodes, based on an existing (or custom) CloudWatch metric. More load? No problem.
  3. Easy to use, fully automated cluster provisioning.
  4. Cost efficiency, especially for batch analytics and ad-hoc query workloads. No more costly idle clusters as worker nodes can be scaled-out as required and scaled back in when no longer needed for long running clusters, while an entire cluster can be spun up and shut down for shorter, ad-hoc workloads.

With all of the immediate advantages, migrating a (potentially multi-petabyte) production, live cluster to AWS is never an easy task due to the sheer number of components and services involved.

While migrating your on-prem Hadoop workloads can be a tempting idea due to the numerous benefits AWS has to offer, such as flexibility and scalability, there are several migration considerations that should be taken into account before starting your journey on the Cloud. To learn about these elements in detail, click here to download the complete on-prem Hadoop to Amazon EMR Whitepaper.

Big Data Requires Big Expertise

Data Processing pipelines migration is a complex task which will require a thorough understanding of your business logic (associated jobs and flows) and open source and custom ad-hoc tooling (or enterprise solutions) for migration and verifications.

When running your workloads on the Cloud, be ready to plan and architect for failures and treat all your data processing resources as ephemeral.

Given the mammoth proportions of such a migration, consider seeking help from an AWS partner with expertise in Big Data migrations. At TriNimbus, we’re proud to be recognized as an AWS Big Data Competency Partner, and would love to hear from you to not only help with your migration, but to also add value to your long term success on AWS.

On top of the mentioned considerations, your Data Infrastructure team should be ready to ramp up on new tools that will help them manage their data processing platform reliably and efficiently on AWS. If you have any questions or would like help with migrating your workloads, please contact us.