Sign up for a live Kubernetes or DevSecOps demo

Click here
Sadequl Hussain

Sadequl Hussain

Sadequl Hussain is an information technologist, trainer, and guest blogger for Machine Data Almanac. He comes from a strong database background and has 20 years experience in development, infrastructure engineering, database management, training, and technical authoring. He loves working with cloud technologies and anything related to databases and big data. When he is not blogging or making training videos, he can be found spending time with his young family.

Posts by Sadequl Hussain

Blog

How to Monitor AWS CloudTrail Logs with Sumo Logic

Blog

How to Read, Search, and Analyze AWS CloudTrail Logs

Blog

What is AWS CloudTrail?

Blog

What is Serverless Architecture?

Blog

A Primer on Building a Monitoring Strategy for Amazon RDS

In a previous blog post, we talked about Amazon Relational Database Service (RDS). RDS is one of the most popular cloud-based database services today and extensively used by Amazon Web Services (AWS) customers for its ease of use, cost-effectiveness and simple administration. Although as a managed service, RDS doesn’t require database administrators (DBAs) to do many of the day-to-day tasks, it still needs to be monitored for performance and availability. That’s because Amazon doesn’t auto-tune any database performance — this is a shared responsibility of the customer. That’s why there should be a monitoring strategy and processes in place for DBAs and operation teams to keep an eye on their RDS fleet. In this blog post, we will talk about an overall best-practice approach for doing this. Why Database Monitoring Keeping a database monitoring regimen in place, no matter how simple, can help address potential issues proactively before they become incidents, and cost additional time and money. Most AWS infrastructure teams typically have decent monitoring in place for different types of resources like EC2, ELB, Auto Scaling Groups, Logs, etc. Database monitoring often comes at a later stage or is ignored altogether. With RDS, it’s also easy to overlook due to the low-administration nature of the service. The DBA or the infrastructure managers should therefore invest some time in formulating and implementing a database monitoring policy. Please note that designing an overall monitoring strategy is an involved process and is not just about defining database counters to monitor. It also includes areas like: Service Level Agreement Classifying incident types (Critical, Serious, Moderate, Low etc.) Creating RACI (Responsible, Accountable, Consulted, Informed) matrix Defining escalation paths etc.. A detailed discussion of all these topics is beyond the scope of this article, so we will concentrate on the technical part only. What to Monitor Database monitoring, or RDS monitoring in this case, is not about monitoring only database performance. A monitoring strategy should include the following broad categories and their components: Monitoring category Examples of what to monitor Availability Is the RDS instance or cluster endpoint accessible from client tools? Is there any instance stopping, starting, failed over or being deleted? Is there a failover of multi-AZ instances? Recoverability Is the RDS instance being backed up – both automatically and manually? Are individual databases being backed up successfully? Health and Performance What’s the CPU, memory and disk space currently in use? What’s the query latency? What’s the disk read/write latency? What’s the disk queue length? How many database connections are active? Are there any blocking and waiting tasks? Are there any errors or warnings reported in database log files? Are these related to application queries? Are they related to non-optimal configuration values? Are any of the scheduled jobs failing? Manageability Are there any changes in the RDS instances’ Tags Security groups Instance properties Parameter and option groups? Who made those changes and when? Security Which users are connecting to the database instance? What queries are they running? Cost How much each RDS instance is costing every month? While many of these things can be monitored directly in AWS, Sumo Logic can greatly help with understanding all of the logs and metrics that RDS produces. In this article, we will talk about what AWS offers for monitoring RDS. As we go along, we will point out where we think Sumo Logic can make the work easier. Monitoring Amazon CloudWatch You can start monitoring RDS using metrics from Amazon CloudWatch. Amazon RDS, like any other AWS service, exposes a number of metrics which are available through CloudWatch. There are three ways to access these metrics: From AWS Console Using AWS CLI Using REST APIs The image below shows some of these metrics from the RDS console: Amazon CloudWatch shows two types of RDS metrics: Built-in Metrics Enhanced Monitoring Metrics Built-in Metrics These metrics are available from any RDS instance. They are collected from the hypervisor of the host running the RDS virtual machine. Some of the metrics may not be available for all database engines, but the important ones are common. It is recommended the following RDS metrics are monitored from CloudWatch Metric What it means Why you should monitor it CPUUtilization % CPU load in the RDS instance. A consistent high value means one or more processes are waiting for CPU time while one or more processes are blocking it. DiskQueueDepth The number of input and output requests waiting for the disk resource. A consistent high value means disk resource contention – perhaps due to locking, long running update queries etc. DatabaseConnections The number of database connections against the RDS instance. A sudden spike should be investigated immediately. It may not mean a DDOS attack, but a possible issue with the application generating multiple connections per request. FreeableMemory The amount of RAM available in the RDS instance, expressed in bytes. A very low value means the instance is under memory pressure. FreeStorageSpace Amount of disk storage available in bytes. A small value means disk space is running out. ReadIOPS The average number of disk read operations per second. Should be monitored for sudden spikes. Can mean runaway queries. WriteIOPS The average number of disk write operations per second. Should be monitored for sudden spikes. Can mean a very large data modification ReadLatency The average time in milliseconds to perform a read operation from the disk. A higher value may mean a slow disk operation, probably caused by locking. WriteLatency The average time in milliseconds to perform a write operation to disk. A higher value may means disk contention. ReplicaLag How far in time, the read replica of MySQL, MariaDB or PostgreSQL instance is lagging behind from its master A high lag value can means read operations from replica is not serving the current data. Amazon RDS Aurora engine also exposes some extra counters which are really useful for troubleshooting. At the time of writing, Aurora supports MySQL and PostgreSQL only. We recommend monitoring these counters: Metric What it means Why you should monitor it DDLLatency The average time in milliseconds to complete Data Definition Language (DDL) commands like CREATE, DROP, ALTER etc. A high value means the database is having performance issues running DDL commands. This can be due to exclusive locks on objects. SelectLatency The average time in milliseconds to complete SELECT queries. A high value may mean disk contention, poorly written queries, missing indexes etc. InsertLatency The average time in milliseconds to complete INSERT commands. A high value may mean locking or poorly written INSERT command. DeleteLatency The average time in milliseconds to complete DELETE commands. A high value may mean locking or poorly written DELETE command. UpdateLatency The average time in milliseconds to complete UPDATE commands. A high value may mean locking or poorly written UPDATE command. Deadlocks The average number of deadlocks happening per second in the database. More than 0 should be a concern – it means the application queries are running in such a way that they are blocking each other frequently. BufferCacheHitRatio The percentage of queries that can be served by data already stored in memory It should be a high value, near 100, meaning queries are don’t have to access disk for fetching data. Queries The average number of queries executed per second This should have a steady, average value. Any sudden spike or dip should be investigated. You can use the AWS documentation for a complete list of built-in RDS metrics. Enhanced Monitoring Metrics RDS also exposes “enhanced monitoring metrics.” These are collected by agents running on the RDS instances’ operating system. Enhanced monitoring can be enabled when an instance is first created or it can be enabled later. It is recommended enabling it because it offers a better view of the database engine. Like built-in metrics, enhanced metrics are available from the RDS console. Unlike built-in metrics though, enhanced metrics are not readily accessible from CloudWatch Metrics console. When enhanced monitoring is enabled, CloudWatch creates a log group called RDSOSMetrics in CloudWatch Logs: Under this log group, there will be a log stream for each RDS instance with enhanced monitoring. Each log stream will contain a series of JSON documents as records. Each JSON document will show a series of metrics collected at regular intervals (by default every minute). Here is a sample excerpt from one such JSON document: { “engine”: “Aurora”, “instanceID”: “prodataskills-mariadb”, “instanceResourceID”: “db-W4JYUYWNNIV7T2NDKTV6WJSIXU”, “timestamp”: “2018-06-23T11:50:27Z”, “version”: 1, “uptime”: “2 days, 1:31:19”, “numVCPUs”: 2, “cpuUtilization”: { “guest”: 0, “irq”: 0.01, “system”: 1.72, “wait”: 0.27, “idle”: 95.88, “user”: 1.91, “total”: 4.11, “steal”: 0.2, “nice”: 0 },…… It’s possible to create custom CloudWatch metrics from these logs and view those metrics from CloudWatch console. This will require some extra work. However, both built-in and enhanced metrics can be streamed to Sumo Logic from where you can build your own charts and alarms. Regardless of platform, it is recommended to monitor the enhanced metrics for a more complete view of the RDS database engine. The following counters should be monitored for Amazon Aurora, MySQL, MariaDB, PostgreSQL, or Oracle: Metric Group Metric What it means and why you should monitor cpuUtilization user % of CPU used by user processes.

AWS

July 17, 2018

Blog

Comparing AWS Data Warehouse & Analytics Services — Migrating to AWS Part 3

AWS Data Warehouse and Analytics Services In this final article of our three-part blog series, we will introduce you to two popular data services from Amazon Web Services (AWS): Redshift and Elastic Map Reduce (EMR). These services are ideal for AWS customers to store large volumes of structured, semi-structured or unstructured data and query them quickly. Amazon Redshift Amazon Redshift is a fully-managed data warehouse platform from AWS. Customers can store large volumes of structured, relational datasets in Redshift tables and run analytical workloads on those tables. This can be an ideal solution for processing and summarizing high-volume sales, clickstream or other large datasets. Although you can create data warehouses in RDS, Redshift would be a better choice for the following reasons: Amazon Redshift has been created as a Massively Parallel Processing (MPP) data warehouse from ground-up. This means data is distributed to more than one node in a Redshift cluster (although you can create one-node clusters too). Redshift uses the combined power of all the computers in a cluster to process this data in a fast and efficient manner. A Redshift cluster can be scaled up from a few gigabytes to more than a petabyte. That’s not possible with RDS. With RDS, you can create a single, large instance with multi-AZ deployment and one or more read-replicas. The read-replicas can help increase read performance and the multi-AZ secondary node will keep database online during failovers, but the actual data processing still happens in one node only. With Redshift, it’s not uncommon to see 50 to 100-node clusters, all the nodes taking part in data storage and processing. The storage space in Redshift can be used more efficiently than RDS with suitable column encoding and data distribution styles. With proper column encoding and data distribution, Redshift can squeeze large amounts of data in fewer data pages, thereby dramatically reducing the table sizes. Also, a Redshift data page in 2 MB, compared to typical 8 KB of a relational database. This also helps storing larger amounts of data per page and increases read performance. Amazon Redshift offers a number of ways to monitor cluster and query performance. It’s simple to see each individual running query and its query plan from Redshift console. It’s also very easy to see how much resource a running query is consuming. This feature is not readily available in RDS yet. Finally, Redshift offers a way to prioritize different types of analytic workloads in the cluster. This allows specific types of data operations to have more priority than others. This also ensures any single query or data load doesn’t bring down the entire system. This prioritization is made possible with the Workload Management (WLM) configuration. With WLM, administrators can assign groups of similar queries to different workload queues. Each queue is then assigned a portion of the cluster’s resources. When a query running in a queue uses up all its resources or reaches the concurrency limit, it must wait. Meanwhile, unblocked queries in other queues can still run. Use Cases Data warehouse hosting very large amount of data Part of an enterprise data lake Elastic MapReduce (EMR) Amazon Elastic MapReduce (EMR) is AWS’ managed Hadoop environment in the cloud. We have already seen some of the managed systems like RDS, DynamoDB or Redshift, and EMR is no different. Like RDS, customers can spin up Apache Hadoop clusters in EMR by selecting a few options in a series of wizard-like screens. Anyone with experience manually installing a multi-node Hadoop cluster would appreciate the time and effort it takes to install all the prerequisites, the core software, any additional components and finalize any configuration. With EMR, all this is done behind-the-scenes, so users don’t need to worry. EMR also has the ability to make its clusters “transient.” This means an EMR cluster doesn’t have to run when it’s not needed. A cluster can be spun up, made to process data in one or more series of “steps” and then spun down. The results of the processing can be written to S3 for later consumption. Traditional Hadoop installations are quite monolithic in nature with sometimes hundreds of nodes sitting idle when no jobs are running. With EMR, this waste can be minimized. Finally, EMR adds a new type of file system for Hadoop: the EMR File System. EMRFS extends Amazon S3 as the file system for the Hadoop cluster. With EMRFS, data in a cluster is not lost when it’s terminated. Use Cases Any processing workload requiring a Hadoop back-end (e.g. Hive, HBase, Pig, Sqoop etc.) Enterprise data lakes Conclusion In this three-part blog series we had a brief introduction to some of the most commonly used AWS services. The storage, database and analytics services have evolved over time and have become more robust and scalable as customers have tested them with a multitude of use cases. The following table shows a “cheat sheet” of the various AWS technologies, their core functions and where you would implement each. Take a look: AWS Technology What is it? Where do you use it? Simple Storage Service A highly available and durable file system Static website hosting Backup location Log file storage Data pipeline source and destination Amazon Glacier A file system for long term data storage Backup archival Critical file archiving Relational Database Service A fully managed database service for Oracle, Microsoft SQL Server, Aurora, PostgreSQL, MySQL, MariaDB, etc. Web-site backend for content management systems Enterprise application backend DynamoDB A fully managed NoSQL database User preferences Clickstream data Games, IoT data Elastic Compute Cloud with Elastic Block Storage A virtual host with attached storage Hosting databases of any kind Elastic Compute Cloud with Elastic File System A virtual host with a mounted file system Data analytics Media server Amazon Redshift A petabyte scale data warehouse Enterprise data warehouse Part of data lake Amazon ElastiCache High performance in-memory database (Redis or memcached) Mobile games IoT applications Elastic MapReduce A fully managed Hadoop environment Any application that requires a Hadoop back-end Apache Spark backend Part of data lake There are also a number of auxiliary services that work as “glue” between these primary services. These auxiliary services include Amazon Data Migration Service (DMS), ElasticSearch, Data Pipeline, AWS Glue, Athena, Kinesis or Lambda. Using these tools, customers can build complex data pipelines with relative ease. These tools are also serverless, which means they can scale up or down automatically as needed. Also, please note that we have not provided any pricing details for any of the services we discussed, nether did we talk about EC2 or RDS instance classes or their capacities. That’s because pricing varies over time and also differs between regions and Amazon brings out new classes of servers at regular intervals. Additional Resources Comparing AWS S3 and Glacier Data Storage Services – Migrating to AWS Part 1 Comparing RDS, DynamoDB & Other Popular Database Services – Migrating to AWS Part 2 AWS 101: An Overview of Amazon Web Services

AWS

July 12, 2018

Blog

Comparing RDS, DynamoDB & Other Popular Database Services – Migrating to AWS Part 2

Blog

Comparing AWS S3 and Glacier Data Storage Services - Migrating to AWS Part 1