A Primer on Building a Monitoring Strategy for Amazon RDS
Sign Up Free Request Demo

A Primer on Building a Monitoring Strategy for Amazon RDS

In a previous blog post, we talked about Amazon Relational Database Service (RDS). RDS is one of the most popular cloud-based database services today and extensively used by Amazon Web Services (AWS) customers for its ease of use, cost-effectiveness and simple administration.

Although as a managed service, RDS doesn’t require database administrators (DBAs) to do many of the day-to-day tasks, it still needs to be monitored for performance and availability.  That’s because Amazon doesn’t auto-tune any database performance — this is a shared responsibility of the customer.

That’s why there should be a monitoring strategy and processes in place for DBAs and operation teams to keep an eye on their RDS fleet. In this blog post, we will talk about an overall best-practice approach for doing this.

Why Database Monitoring

Keeping a database monitoring regimen in place, no matter how simple, can help address potential issues proactively before they become incidents, and cost additional time and money. Most AWS infrastructure teams typically have decent monitoring in place for different types of resources like EC2, ELB, Auto Scaling Groups, Logs, etc. Database monitoring often comes at a later stage or is ignored altogether. With RDS, it’s also easy to overlook due to the low-administration nature of the service. The DBA or the infrastructure managers should therefore invest some time in formulating and implementing a database monitoring policy.

Please note that designing an overall monitoring strategy is an involved process and is not just about defining database counters to monitor. It also includes areas like:

  • Service Level Agreement
  • Classifying incident types (Critical, Serious, Moderate, Low etc.)
  • Creating RACI (Responsible, Accountable, Consulted, Informed) matrix
  • Defining escalation paths etc..

A detailed discussion of all these topics is beyond the scope of this article, so we will concentrate on the technical part only.

What to Monitor

Database monitoring, or RDS monitoring in this case, is not about monitoring only database performance. A monitoring strategy should include the following broad categories and their components:

Monitoring category Examples of what to monitor
Availability
  • Is the RDS instance or cluster endpoint accessible from client tools?
  • Is there any instance stopping, starting, failed over or being deleted?
  • Is there a failover of multi-AZ instances?
Recoverability
  • Is the RDS instance being backed up – both automatically and manually?
  • Are individual databases being backed up successfully?
Health and Performance
  • What’s the CPU, memory and disk space currently in use?
  • What’s the query latency?
  • What’s the disk read/write latency?
  • What’s the disk queue length?
  • How many database connections are active?
  • Are there any blocking and waiting tasks?
  • Are there any errors or warnings reported in database log files?
    • Are these related to application queries?
    • Are they related to non-optimal configuration values?
  • Are any of the scheduled jobs failing?
Manageability
  • Are there any changes in the RDS instances’
    • Tags
    • Security groups
    • Instance properties
    • Parameter and option groups?
  • Who made those changes and when?
Security
  • Which users are connecting to the database instance?
  • What queries are they running?
Cost
  • How much each RDS instance is costing every month?

While many of these things can be monitored directly in AWS, Sumo Logic can greatly help with understanding all of the logs and metrics that RDS produces.

In this article, we will talk about what AWS offers for monitoring RDS. As we go along, we will point out where we think Sumo Logic can make the work easier.

Monitoring Amazon CloudWatch

You can start monitoring RDS using metrics from Amazon CloudWatch. Amazon RDS, like any other AWS service, exposes a number of metrics which are available through CloudWatch. There are three ways to access these metrics:

  • From AWS Console
  • Using AWS CLI
  • Using REST APIs

The image below shows some of these metrics from the RDS console:

Amazon CloudWatch shows two types of RDS metrics:

  • Built-in Metrics
  • Enhanced Monitoring Metrics

Built-in Metrics

These metrics are available from any RDS instance. They are collected from the hypervisor of the host running the RDS virtual machine. Some of the metrics may not be available for all database engines, but the important ones are common.

It is recommended the following RDS metrics are monitored from CloudWatch

Metric What it means Why you should monitor it
CPUUtilization % CPU load in the RDS instance. A consistent high value means one or more processes are waiting for CPU time while one or more processes are blocking it.
DiskQueueDepth The number of input and output requests waiting for the disk resource. A consistent high value means disk resource contention – perhaps due to locking, long running update queries etc.
DatabaseConnections The number of database connections against the RDS instance. A sudden spike should be investigated immediately. It may not mean a DDOS attack, but a possible issue with the application generating multiple connections per request.
FreeableMemory The amount of RAM available in the RDS instance, expressed in bytes. A very low value means the instance is under memory pressure.
FreeStorageSpace Amount of disk storage available in bytes. A small value means disk space is running out.
ReadIOPS The average number of disk read operations per second. Should be monitored for sudden spikes. Can mean runaway queries.
WriteIOPS The average number of disk write operations per second. Should be monitored for sudden spikes. Can mean a very large data modification
ReadLatency The average time in milliseconds to perform a read operation from the disk. A higher value may mean a slow disk operation, probably caused by locking.
WriteLatency The average time in milliseconds to perform a write operation to disk. A higher value may means disk contention.
ReplicaLag How far in time, the read replica of MySQL, MariaDB or PostgreSQL instance is lagging behind from its master A high lag value can means read operations from replica is not serving the current data.

 

Amazon RDS Aurora engine also exposes some extra counters which are really useful for troubleshooting. At the time of writing, Aurora supports MySQL and PostgreSQL only. We recommend monitoring these counters:

Metric What it means Why you should monitor it
DDLLatency The average time in milliseconds to complete Data Definition Language (DDL) commands like CREATE, DROP, ALTER etc. A high value means the database is having performance issues running DDL commands. This can be due to exclusive locks on objects.
SelectLatency The average time in milliseconds to complete SELECT queries. A high value may mean disk contention, poorly written queries, missing indexes etc.
InsertLatency The average time in milliseconds to complete INSERT commands. A high value may mean locking or poorly written INSERT command.
DeleteLatency The average time in milliseconds to complete DELETE commands. A high value may mean locking or poorly written DELETE command.
UpdateLatency The average time in milliseconds to complete UPDATE commands. A high value may mean locking or poorly written UPDATE command.
Deadlocks The average number of deadlocks happening per second in the database. More than 0 should be a concern – it means the application queries are running in such a way that they are blocking each other frequently.
BufferCacheHitRatio The percentage of queries that can be served by data already stored in memory It should be a high value, near 100, meaning queries are don’t have to access disk for fetching data.
Queries The average number of queries executed per second This should have a steady, average value. Any sudden spike or dip should be investigated.

You can use the AWS documentation for a complete list of built-in RDS metrics.

Enhanced Monitoring Metrics

RDS also exposes “enhanced monitoring metrics.” These are collected by agents running on the RDS instances’ operating system. Enhanced monitoring can be enabled when an instance is first created or it can be enabled later. It is recommended enabling it because it offers a better view of the database engine.

Like built-in metrics, enhanced metrics are available from the RDS console. Unlike built-in metrics though, enhanced metrics are not readily accessible from CloudWatch Metrics console. When enhanced monitoring is enabled, CloudWatch creates a log group called RDSOSMetrics in CloudWatch Logs:

Under this log group, there will be a log stream for each RDS instance with enhanced monitoring. Each log stream will contain a series of JSON documents as records. Each JSON document will show a series of metrics collected at regular intervals (by default every minute). Here is a sample excerpt from one such JSON document:

{
“engine”: “Aurora”,
“instanceID”: “prodataskills-mariadb”,
“instanceResourceID”: “db-W4JYUYWNNIV7T2NDKTV6WJSIXU”,
“timestamp”: “2018-06-23T11:50:27Z”,
“version”: 1,
“uptime”: “2 days, 1:31:19”,
“numVCPUs”: 2,
“cpuUtilization”: {
“guest”: 0,
“irq”: 0.01,
“system”: 1.72,
“wait”: 0.27,
“idle”: 95.88,
“user”: 1.91,
“total”: 4.11,
“steal”: 0.2,
“nice”: 0
},……

It’s possible to create custom CloudWatch metrics from these logs and view those metrics from CloudWatch console. This will require some extra work. However, both built-in and enhanced metrics can be streamed to Sumo Logic from where you can build your own charts and alarms. Regardless of platform, it is recommended to monitor the enhanced metrics for a more complete view of the RDS database engine.

The following counters should be monitored for Amazon Aurora, MySQL, MariaDB, PostgreSQL, or Oracle:

Metric Group Metric What it means and why you should monitor
cpuUtilization user % of CPU used by user processes.

A high percentage means CPU is under pressure from user processes.

wait % of CPU not being used when processes wait for IO.

A high value can mean IO bottleneck.

diskIO avgQueueLen Number of IO requests waiting in the disk queue This metric is not available for Aurora engine.

A high queue length (above 2) means IO contention.

readIOsPS Number of read operations per second. This metric is not available for Aurora engine.

A high value means good performance.

writeIOsPS Number of write operations per second, This metric is not available for Aurora engine.

A high value means good performance.

fileSys usedPercent % of file system disk space used.

This can be used to monitor any disk pressure.

loadAverageMinute one,

five or

fifteen

Number of processes which tried to access CPU in the last one, five or fifteen minutes.

This can also show how busy the system is. A a continued spike needs to be investigated.

memory free Amount of unassigned memory in KB.

A low value means the system is under memory pressure.

tasks blocked Number of blocked tasks.

Ideally it should be very low (0 or 1). A high number should be investigated as OS processes should not get blocked under normal operation.

processList name Name of each process running in the OS.
memoryUsedPc % of memory used by the named process.
cpuUsedPc % of CPU used by the named process.

For SQL Server RDS, we recommend monitoring the following:

Metric Group Metric What it means and why you should monitor
cpuUtilization user % of CPU used by user programs.

A high percentage means CPU is under pressure from user processes.

disks usedPc % of disk space used.

A high value means a capacity issue.

availPc % of disk space available.

A low value means a capacity issue.

memory physAvailKb Amount of physical memory available in KB.

A low value means the system is under memory pressure.

processList name Name of each process running in the OS.
cpuUsedPc % of CPU used by the named process.
memUsedPc % of memory used by the named process.

Performance Insights

Performance Insights is a query monitoring tool for RDS. At the time of writing it’s available for Aurora PostgreSQL-compatible edition, but Amazon plans to ship it for other database engines over time. Performance Insights collects metric data from queries running on the RDS instance and can help identify slow running queries. Enabling it is recommended for RDS because:

  • It provides a real-time, rolling one-hour performance profile of the instance
  • It shows an instance’s load by plotting its Average Active Sessions (AAS) against time. This load can be further sliced by:
    • SQL queries
    • Hosts
    • Users
    • Waits

Any spike in the load by any of these can indicate a bottleneck.

Automating Monitoring with Alarms and Dashboards

So far we talked about what RDS counters should be monitored. Now we will talk about how to monitor them. It would be extremely time-consuming – if not impractical – to monitor every RDS instance for each metric we listed above. Instead, we recommend using alarms and dashboards.

  • Some alarms should be used for alerting on critical and high priority events, like the ones that relate to a system’s stability. They should be attended immediately.
  • Some alarms should be used for alerting on changes in metric values. These alerts are not critical, but require attention.
  • Dashboards should be used for long term trend analysis, ongoing troubleshooting and retrospective investigation

We recommend creating alarms for the following types of events as “critical.”

Trigger alarm when Comments
An instance or cluster endpoint cannot be contacted This can achieved by running a Lambda function which checks the availability of an RDS instance or cluster. When the instance or cluster is unresponsive, the function can write a CloudWatch log event, which can be monitored.
One or more databases in the instance is inaccessible This can be done by monitoring RDS logs and recording events in the CloudWatch Logs.
An instance has been stopped This can be monitored through RDS events subscription.
An instance has failed over to secondary This can be monitored through RDS events subscription.

The following types of alarms should be treated as “high priority.” These alarms are triggered for changes in metric values:

Trigger alarm when Priority
% CPUUtilization > 90 for more than 10 minutes High
fileSys.usedPercent > 80 or

disks>usedPc > 80

High
DiskQueueDepth > 2 for more than 10 minutes or

diskIO.avgQueueLen > 2 for more than 10 minutes

High
ReadLatency > 10 milliseconds for more than 10 minutes High
WriteLatency > 10 milliseconds for more than 10 minutes High
BufferCacheHitRatio < 100 for more than 10 minutes High
memory.free < 20480 KB for for more than 10 minutes or

memory.physAvailKb < 20480 KB for for more than 10 minutes

High

Dashboards should be created for a bird’s eye-view of the overall health status from different metrics. While dashboards can include a variety of metrics, we recommend creating two types of dashboards:

  • One with widgets created from high priority metrics listed above. This is the dashboard a DBA should check first thing in the morning
  • One with general health related metrics like
    • ReadIOPs, WriteIOPs
    • Latency (SELECT, INSERT, UPDATE, DELETE)
    • DatabaseConnections etc.

You can, of course, create alarms and dashboards in AWS Cloudwatch. If you are streaming CloudWatch Logs to Sumo Logic, you can build the dashboards and alarms there, which we will discuss in the next article in the series.

Monitoring RDS Logs

Every RDS instance generates its own log files. Database logs are invaluable for troubleshooting performance problems because they report on events, warnings and errors in the database engine itself, not the RDS service.

Different database engines have different types of log files:

  • MS SQL Server
    • Error Log
    • Agent Log
  • MySQL / MariaDB
    • General Log
    • Error Log
    • Slow query Log
  • PostgreSQL
    • Query and Error Log
  • Oracle
    • Alert Log

We recommend collecting logs from your RDS instance and parsing them for important events like:

  • Inaccessible or offline databases.
  • Blocking queries.
  • Slow or long running queries.
  • Startup and shutdown of daemons and any failures.
  • Deadlocks
  • Repeated unsuccessful login attempts
  • High replication latency
  • Instance failovers
  • Instance shutdowns
  • Database maintenance plans, indexing, vacuums, and statistics updates.

RDS log files can be accessed in a number of ways:

  • RDS console.
  • RDS CLI commands.
  • REST API.
  • Database-specific commands.
  • CloudWatch Logs (depending on the engine)

The image below shows a number of SQL Server logs available from the RDS console:

From the console, it’s possible to select a log and view its static copy (View) or watch it “live” (Watch) or download it (Download). The downloaded copy can then be viewed with any text editor. Some database engines like MySQL, MariaDB or MySQL Aurora can send their logs to CloudWatch. Here is an example of a MariaDB error log file in CloudWatch Logs:

We recommend using an automated method for monitoring RDS logs. If your RDS fleet is using only one type of database engine and streams logs to CloudWatch, the process is straightforward.

For a heterogeneous RDS fleet, we recommend creating a custom application to download all logs locally and upload to a specific S3 location. Such application can be written with RDS CLI commands like describe-db-log-files or download-db-log-file-portion.

Sumo Logic can ingest your RDS logs directly from CloudWatch Logs, an S3 bucket, or through a script that will pull the logs from an API and forward over to Sumo Logic. This will be covered in the next entry in the series.

Monitoring RDS Events

Amazon RDS also keeps track of “events” related to database instances. These events are generated by the RDS service, not by the database engine. The events are generated in response to actions in the console, API calls or CLI commands. Some examples of RDS events are:

  • When a new instance is created.
  • When an existing instance is shut down or terminated.
  • When an instance’ configuration changes.

The image below shows an example:

There are different categories of RDS events:

Event Source Types of events covered
Instances Events related to database instances.
Security Groups Events related to security groups attached to DB instances.
Parameter Groups Events related to RDS parameter groups.
Snapshots Events related to instance snapshots.
DB Clusters Events related to multi-instance clusters.
DB Cluster Snapshots Events related to multi-instance cluster snapshots.

There are many events across all these categories. The AWS documentation lists all the event IDs and their meanings in detail. We recommend monitoring the following types of events from that list:

  • Events related to the availability of RDS instance.
  • Events related to failures in RDS instance.
  • Events related to disk space problem in RDS instance.
  • Events related to failovers, startups and shutdowns of RDS instance.
  • Events related to backups and snapshots of RDS instance.
  • Events related to any configuration changes in RDS instance.

To make the process automated, DBAs should create subscriptions for events they want to monitor. A subscription sends out a notification to a recipient when an event occurs. One such event can be a manual instance snapshot. Such event can be logged when a scheduled backup job starts. If the DBA does not receive a notification next morning, he can investigate the issue and take remediation steps.

Monitoring CloudTrail

AWS CloudTrail is a service which when enabled for an account, will log every API call made to any AWS resource endpoint in that account. CloudTrail logs are saved in a S3 bucket and can be monitored as well. It’s mainly used for security auditing and compliance purposes and we recommend making CloudTrail a part of your RDS monitoring. Among other things, a CloudTrail log event shows:

  • The event name.
  • The event date and time.
  • The IP address where the event generated.
  • AWS user name / access key ID generating the event.

CloudTrail logs are JSON documents with different fields. These documents can be consumed and parsed by a tool like Sumo Logic. We recommend monitoring CloudTrail for the following RDS API calls:

RDS API Event Why it should to be monitored
DeleteDBInstance To audit  when a DB instance is deleted.
DeleteDBCluster To audit when a DB cluster is deleted.
ModifyDBInstance To audit when a DB instance is modified.
ModifyDBCluster To audit when a DB cluster is modified.
FailoverDBCluster To audit when a cluster has a forced failover.
ModifyDBParameterGroup To audit when any parameter group is changed.
RebootDBInstance To audit when a DB instance is rebooted.
PromoteReadReplica To audit when a DB read replica is promoted to a DB instance.
StopDBInstance To audit when a DB instance is stopped.

Monitoring with Amazon Trusted Advisor

Finally, we will talk about Amazon Trusted Advisor, a service which can analyze an account’s resources for best practices, security, performance, fault tolerance and cost. Although Trusted Advisor is not strictly a performance monitoring tool, we recommend periodically running it for the following types of audits:

  • Idle instances. Identifying such instances can save costs.
  • Security groups allowing unrestricted access. This can identify vulnerable instances.
  • Instances with no automatic backup enabled. This can identify recoverability issues.
  • Instances snapshots marked as public. This can expose data to unauthorized access.

Conclusion

Hopefully this article has given some solid ground for monitoring RDS. We recommend users design their monitoring solution before implementing anything. This can involve:

  • Creating a list of RDS instances to be monitored. For example, low usage  non-production instances may not need monitoring.
  • Creating a list of common metrics to monitor across all instances.
  • Assessing monitoring prerequisites (e.g. S3 buckets, IAM roles and privileges, SNS topics, etc.).

We also mentioned Sumo Logic in a few places. Be on the lookout for the next entry in the series that will show how Sumo Logic can help make RDS monitoring a very simple matter.

Additional Resources

Get Started Today!

Sign up for your FREE Sumo Logic Trial.

Sign Up Free

Request A Free Sumo Logic Demo

Fill out the form below and a Sumo Logic representative will contact you to schedule your free demo.
“Sumo Logic brings everything together into one interface where we can quickly scan across 1,000 servers and gigabytes of logs and quickly identify problems. It’s awesome software and awesome support.”

Jon Dokuli,
VP of Engineering

Thank you for signing up for Sumo Logic.

We are creating your account now.
Please check your email.
Need more help? Contact Us
Sign up for Sumo Logic Free*
Sign up for Sumo Logic Free*
  • No credit card required to sign-up
  • Create your account in minutes
  • No expiration date*
  • *After 30 day trial period, reverts to Sumo Logic Free
    • Please Enter your email address.
    • Please enter a valid email address.
    • This email is already in use for another account.
    • Please use your company email to create an account.
    • Please agree to the Service License.
    • Free trial provisioning is temporarily offline, please call 855-LOG-SUMO to get started.
    View All Pricing Options
    Already have an account? Login