Monitoring AWS Elastic Load Balancing (AWS ELB) with Cloudwatch
Sign Up Free Request Demo

Monitoring AWS Elastic Load Balancing with Cloudwatch

Quick Refresher – What is AWS Elastic Load Balancing?

A key part of any modern application is the ability to spread the load of user requests to your application across multiple resources, which makes it much easier to scale as traffic naturally goes up and down during the day and the week. Amazon Web Services’ answer to load balancing in the cloud is the Elastic Load Balancer (AWS ELB) service – Classic ELB and Application ELB. AWS ELB integrates seamlessly with Amazon’s other cloud services, automatically spinning up new ELB instances without manual intervention to meet high demand periods and scaling them back, in off peak hours to get the most out of your IT budget, while also providing a great experience to your users. AWS provides the ability to monitor your ELB configuration through AWS Cloudwatch with detailed metrics about the requests made to your load balancers. There is a wealth of data in these metrics generated by ELB, and it is extremely simple to set up. And best of all, these metrics are included with the service!

Understanding AWS Cloudwatch metrics for AWS ELB

First, you need to understand the concept “Namespace”. For every service monitored by AWS Cloudwatch, there is a Namespace dimension that tells you where the data is coming. For each of the three ELB services, there is a corresponding namespace as well.

Namespace

Namespace
Classic Load Balancers AWS/ELB
Application Load Balancers AWS/ApplicationELB
Network Load Balancers AWS/NetworkELB

One of the most important aspects to understand with Cloudwatch metrics are the “dimensions”. Dimensions tell you the identity of what is monitoring – what it is and where it is from. For this type of metric, there are two key dimensions:

Dimension Description
Availability Zone What Availability Zone the ELB Instance is in
LoadBalancerName The name of the ELB instance

Note: AWS automatically provides rollup metrics over dimensions as well. So, for example, if you see a measurement with no Load Balancer dimension, but still has an Availability Zone (AZ), that is a rollup over all of the Load Balancers in that AZ.

Another part of the metrics are the “Statistic”. Cloudwatch metrics are not raw measurements, but are actually aggregated up to more digestible data volumes. So, in order to not lose the behavior of the underlying data, Cloudwatch provides several statistics which can use depending on what you need:

Statistic Description
Minimum The minimum value over the reporting period (typically 1 min)
Maximum The minimum value over the reporting period (typically 1 min)
Sum The sum of all values over the reporting period (typically 1 min)
Average The average value over the reporting period (typically 1 min)
SampleCount The number of samples over the reporting period (typically 1 min)

 

What are the key metrics to watch?

There are a lot of metrics gathered by Cloudwatch, but we can divide those into two main categories: Metrics about the Load Balancer, Metrics about the Backend Instances. We will show you the key ones to watch, and what statistics are appropriate when analyzing the metric.

Key performance indicators for the load balancer

The key performance indicators (KPIs) will help you understand how the actual ELB instances are performing and how they are interacting with the incoming requests, as opposed to how your backend instances may be responding to the traffic.

Metric What it Means and How to Use it Statistics to Use
RequestCount This metric tracks the number of requests that the load balancer, or group of load balancers, has received. This is the baseline metric for any kind of traffic analysis, particularly if you don’t have auto-scaling enabled. Sum (other statistics aren’t useful)
SurgeQueueLength This tells you the number of inbound requests waiting to be accepted and processed by a backend instance. This can tell you if you need to scale out your backend resources. Maximum is the most useful, but Average and Minimum can be helpful in addition to Maximum.
SpilloverCount This is the number of rejected requests because the surge queue is full.

When the queue length reaches 1,024 the clients will receive 503 errors and the request will be denied. This is obviously bad, and in a healthy environment this metric should be zero.

Sum (other statistics aren’t useful)
HTTPCode_ELB_4XX The number of 4XX type status codes generated. These indicated a client/browser error (e.g. page doesn’t exist). You can look in the ELB logs for more details about the client errors, and take corrective action (usually a redirect for a common 404 Page not Found error). Sum (other statistics aren’t useful)
HTTPCode_ELB_5XX The number of 5XX type status codes generated. These indicate a server error because the request couldn’t be properly handled – either because there were no healthy backend instances or because the SurgeQueue is full. There are a few key error codes which indicate different problems:

502 – Bad Gateway – In the case the backend instance responded, but the load balancer could not parse the response because of a load balancer error or because the response was malformed.

503 – Bad Gateway – This code will be returned either because the backend instances returned an error or because the SurgeQueue is full (see above)

504 – Gateway Timeout – You will see this error if the response time exceeds the ELB instances idle timeout. You should look at scaling up, or tuning, your backend instances. If you are supporting slow operations (ex. file uploads), you should consider increasing the idle timeout.

Sum (other statistics aren’t useful)

 

Key performance indicators for the backend

These metrics will tell how your backend instances are handling the incoming load, and whether you need to adjust capacity or fix issues.

Metric What it Means and How to Use it Statistics to Use
HealthyHostCount and UnHealthyHostCount The number of backend instances registered to your ELB instance that have passed the health checks (HealthyHostCount) and the number that have failed the health checks (UnHealthyHostCount).

When a backend instance fails a health check, it is flagged and taken out of rotation. This is most often due to the instance exceeding the idle timeout. You can correlate this with the Latency and SurgeQueueLength metrics to balance the latency experienced by your users with the backend capacity.

Note: If you have cross-zone load balancing enabled, the number of healthy instances is calculated across all Availability Zones. Otherwise, it is measured per Availability Zone.

Average and Maximum are the most useful
Latency For an HTTP listener (web requests), this is the total time, in seconds, from the load balancer sending the request to a backend instance until the instance initiated a response back. For a TCP listener, this is the time required to establish a connection to a backend instance.

This metric tracks the latency created by your backend instances, not the load-balancers themselves. It is a good indicator of your application performance as experienced by your users. A high latency can be indicative of overload backend instances, misconfigured idle timeout or network issues. You can check some good troubleshooting tips here.

Average is the most useful statistic, though Maximum can indicate issues to investigate. Minimum is not as useful.
HTTPCode_Backend_2XX and HTTPCode_Backend_3XX These metrics indicated the number of successful (2XX) and redirected (3XX) requests. This are useful as compare to the error codes below. Sum (other statistics aren’t useful)
HTTPCode_Backend_4XX and HTTPCode_Backend_5XX These metrics indicated the number of client errors (4XX) and server errors (5XX). You will typically want to correlate these errors with the web access logs of your application to understand what the actual errors were. Sum (other statistics aren’t useful)
BackendConnectionErrors This metric tracks the number of attempted, but unsuccessful, connections to backend instances. Since the load balancer may retry connections where there are errors, this number may exceed the request rate. It also includes errors due to unsuccessful health checks.

This metric is secondary indicators of serious backend issues, in addition to the other measurements above.

Sum (other statistics aren’t useful)

What KPIs should you monitor and alert on

As with most systems, you do not want to alert on all ELB metrics, but rather focus on the few that are good indicators of customer facing issues that need to be resolved quickly.

As mentioned above, for the front end ELB performance you want to track whether your incoming request volume is outpacing your backend capacity. You can track this by monitoring the SurgeQueueLength and the SpilloverCount metrics. This will tell you whether user requests are being denied because of lack of capacity or serious issues. You should also watch the Latency metric, since it is a direct measurement of user experience. It probably makes sense to compare this to a historical baseline, instead of a static value, since each application may act differently.

For the backend, you should monitor UnHealthyHostCount and the 5XX metrics – HTTPCode_ELB_5XX and HTTPCode_Backend_5XX. If you correlate this with Latency and SurgeQueueLength, you can get ahead of issues and make sure that server issues are not affecting user experience.

Together is better

As with any application performance monitoring scenario, it is essential to monitor the whole application stack, as well as the KPIs. In this guide you will find details on how to monitor other services, and how to approach monitoring your core application. By pulling all of this data together, you can track issues from the front-end experience down to the core back-end components, and resolve issues quickly.

Related

Get Started Today!

Sign up for your FREE Sumo Logic Trial.

Free Trial

Request A Free Sumo Logic Demo

Fill out the form below and a Sumo Logic representative will contact you to schedule your free demo.
“Sumo Logic brings everything together into one interface where we can quickly scan across 1,000 servers and gigabytes of logs and quickly identify problems. It’s awesome software and awesome support.”

Jon Dokuli,
VP of Engineering

Sign up for Sumo Logic Free*
Sign up for Sumo Logic Free*
  • No credit card required to sign-up
  • Create your account in minutes
  • No expiration date*
  • *After 30 day trial period, reverts to Sumo Logic Free
    Thank you for signing up for Sumo Logic.
    We are creating your account now. Please check your email.
    View All Pricing Options
    Already have an account? Login