Back to blog results

April 9, 2018 By Ben Newton

Monitoring AWS Elastic Load Balancing with Cloudwatch

Quick Refresher – What is AWS Elastic Load Balancing?

A key part of any modern application is the ability to spread the load of user requests to your application across multiple resources, which makes it much easier to scale as traffic naturally goes up and down during the day and the week. Amazon Web Services’ answer to load balancing in the cloud is the Elastic Load Balancer (AWS ELB) service – Classic ELB and Application ELB. AWS ELB integrates seamlessly with Amazon’s other cloud services, automatically spinning up new ELB instances without manual intervention to meet high demand periods and scaling them back, in off peak hours to get the most out of your IT budget, while also providing a great experience to your users. AWS provides the ability to monitor your ELB configuration through AWS Cloudwatch with detailed metrics about the requests made to your load balancers. There is a wealth of data in these metrics generated by ELB, and it is extremely simple to set up. And best of all, these metrics are included with the service!

Understanding AWS Cloudwatch metrics for AWS ELB

First, you need to understand the concept “Namespace”. For every service monitored by AWS Cloudwatch, there is a Namespace dimension that tells you where the data is coming. For each of the three ELB services, there is a corresponding namespace as well.

Namespace

Namespace
Classic Load Balancers AWS/ELB
Application Load Balancers AWS/ApplicationELB
Network Load Balancers AWS/NetworkELB

One of the most important aspects to understand with Cloudwatch metrics are the “dimensions”. Dimensions tell you the identity of what is monitoring – what it is and where it is from. For this type of metric, there are two key dimensions:

Dimension Description
Availability Zone What Availability Zone the ELB Instance is in
LoadBalancerName The name of the ELB instance

Note: AWS automatically provides rollup metrics over dimensions as well. So, for example, if you see a measurement with no Load Balancer dimension, but still has an Availability Zone (AZ), that is a rollup over all of the Load Balancers in that AZ.

Another part of the metrics are the “Statistic”. Cloudwatch metrics are not raw measurements, but are actually aggregated up to more digestible data volumes. So, in order to not lose the behavior of the underlying data, Cloudwatch provides several statistics which can use depending on what you need:

Statistic Description
Minimum The minimum value over the reporting period (typically 1 min)
Maximum The minimum value over the reporting period (typically 1 min)
Sum The sum of all values over the reporting period (typically 1 min)
Average The average value over the reporting period (typically 1 min)
SampleCount The number of samples over the reporting period (typically 1 min)

What are the key metrics to watch?

There are a lot of metrics gathered by Cloudwatch, but we can divide those into two main categories: Metrics about the Load Balancer, Metrics about the Backend Instances. We will show you the key ones to watch, and what statistics are appropriate when analyzing the metric.

Key performance indicators for the load balancer

The key performance indicators (KPIs) will help you understand how the actual ELB instances are performing and how they are interacting with the incoming requests, as opposed to how your backend instances may be responding to the traffic.

Metric What it Means and How to Use it Statistics to Use
RequestCount This metric tracks the number of requests that the load balancer, or group of load balancers, has received. This is the baseline metric for any kind of traffic analysis, particularly if you don’t have auto-scaling enabled. Sum (other statistics aren’t useful)
SurgeQueueLength This tells you the number of inbound requests waiting to be accepted and processed by a backend instance. This can tell you if you need to scale out your backend resources. Maximum is the most useful, but Average and Minimum can be helpful in addition to Maximum.
SpilloverCount This is the number of rejected requests because the surge queue is full.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Ben Newton

Ben Newton

Ben is a veteran of the IT Operations market, with a two decade career across large and small companies like Loudcloud, BladeLogic, Northrop Grumman, EDS, and BMC. Ben got to do DevOps before DevOps was cool, working with government agencies and major commercial brands to be more agile and move faster. More recently, Ben spent 5 years in product management at Sumo Logic, and is now running product marketing for Operations Analytics at Sumo Logic. His latest project, Masters of Data, has let him combine his love of podcasts and music with his love of good conversations.

More posts by Ben Newton.

People who read this also enjoyed