AWS Data Warehouse and Analytics Services
In this final article of our three-part blog series, we will introduce you to two popular data services from Amazon Web Services (AWS): Redshift and Elastic Map Reduce (EMR). These services are ideal for AWS customers to store large volumes of structured, semi-structured or unstructured data and query them quickly.
Amazon Redshift is a fully-managed data warehouse platform from AWS. Customers can store large volumes of structured, relational datasets in Redshift tables and run analytical workloads on those tables. This can be an ideal solution for processing and summarizing high-volume sales, clickstream or other large datasets.
Although you can create data warehouses in RDS, Redshift would be a better choice for the following reasons:
- Amazon Redshift has been created as a Massively Parallel Processing (MPP) data warehouse from ground-up. This means data is distributed to more than one node in a Redshift cluster (although you can create one-node clusters too). Redshift uses the combined power of all the computers in a cluster to process this data in a fast and efficient manner.
- A Redshift cluster can be scaled up from a few gigabytes to more than a petabyte. That’s not possible with RDS. With RDS, you can create a single, large instance with multi-AZ deployment and one or more read-replicas. The read-replicas can help increase read performance and the multi-AZ secondary node will keep database online during failovers, but the actual data processing still happens in one node only. With Redshift, it’s not uncommon to see 50 to 100-node clusters, all the nodes taking part in data storage and processing.
- The storage space in Redshift can be used more efficiently than RDS with suitable column encoding and data distribution styles. With proper column encoding and data distribution, Redshift can squeeze large amounts of data in fewer data pages, thereby dramatically reducing the table sizes. Also, a Redshift data page in 2 MB, compared to typical 8 KB of a relational database. This also helps storing larger amounts of data per page and increases read performance.
- Amazon Redshift offers a number of ways to monitor cluster and query performance. It’s simple to see each individual running query and its query plan from Redshift console. It’s also very easy to see how much resource a running query is consuming. This feature is not readily available in RDS yet.
- Finally, Redshift offers a way to prioritize different types of analytic workloads in the cluster. This allows specific types of data operations to have more priority than others. This also ensures any single query or data load doesn’t bring down the entire system.
- This prioritization is made possible with the Workload Management (WLM) configuration. With WLM, administrators can assign groups of similar queries to different workload queues. Each queue is then assigned a portion of the cluster’s resources. When a query running in a queue uses up all its resources or reaches the concurrency limit, it must wait. Meanwhile, unblocked queries in other queues can still run.
- Data warehouse hosting very large amount of data
- Part of an enterprise data lake
Elastic MapReduce (EMR)
Amazon Elastic MapReduce (EMR) is AWS’ managed Hadoop environment in the cloud. We have already seen some of the managed systems like RDS, DynamoDB or Redshift, and EMR is no different. Like RDS, customers can spin up Apache Hadoop clusters in EMR by selecting a few options in a series of wizard-like screens.
Anyone with experience manually installing a multi-node Hadoop cluster would appreciate the time and effort it takes to install all the prerequisites, the core software, any additional components and finalize any configuration. With EMR, all this is done behind-the-scenes, so users don’t need to worry.
EMR also has the ability to make its clusters “transient.” This means an EMR cluster doesn’t have to run when it’s not needed. A cluster can be spun up, made to process data in one or more series of “steps” and then spun down. The results of the processing can be written to S3 for later consumption. Traditional Hadoop installations are quite monolithic in nature with sometimes hundreds of nodes sitting idle when no jobs are running. With EMR, this waste can be minimized.
Finally, EMR adds a new type of file system for Hadoop: the EMR File System. EMRFS extends Amazon S3 as the file system for the Hadoop cluster. With EMRFS, data in a cluster is not lost when it’s terminated.
- Any processing workload requiring a Hadoop back-end (e.g. Hive, HBase, Pig, Sqoop etc.)
- Enterprise data lakes
In this three-part blog series we had a brief introduction to some of the most commonly used AWS services. The storage, database and analytics services have evolved over time and have become more robust and scalable as customers have tested them with a multitude of use cases.
The following table shows a “cheat sheet” of the various AWS technologies, their core functions and where you would implement each. Take a look:
|AWS Technology||What is it?||Where do you use it?|
|Simple Storage Service||A highly available and durable file system||
|Amazon Glacier||A file system for long term data storage||
|Relational Database Service||A fully managed database service for Oracle, Microsoft SQL Server, Aurora, PostgreSQL, MySQL, MariaDB, etc.||
|DynamoDB||A fully managed NoSQL database||
|Elastic Compute Cloud with Elastic Block Storage||A virtual host with attached storage||
|Elastic Compute Cloud with Elastic File System||A virtual host with a mounted file system||
|Amazon Redshift||A petabyte scale data warehouse||
|Amazon ElastiCache||High performance in-memory database (Redis or memcached)||
|Elastic MapReduce||A fully managed Hadoop environment||
There are also a number of auxiliary services that work as “glue” between these primary services. These auxiliary services include Amazon Data Migration Service (DMS), ElasticSearch, Data Pipeline, AWS Glue, Athena, Kinesis or Lambda. Using these tools, customers can build complex data pipelines with relative ease. These tools are also serverless, which means they can scale up or down automatically as needed.
Also, please note that we have not provided any pricing details for any of the services we discussed, nether did we talk about EC2 or RDS instance classes or their capacities. That’s because pricing varies over time and also differs between regions and Amazon brings out new classes of servers at regular intervals.