Kubernetes is a powerful tool that greatly simplifies the process of deploying complex containerized applications at scale. Yet, like any powerful software platform, Kubernetes must be monitored effectively in order to do its job well. Without the right Kubernetes monitoring tools and procedures in place, teams risk sub-optimal performance or even complete failure of their Kubernetes-based workloads.
To help get started building and implementing a Kubernetes monitoring strategy, this page breaks down the different parts of Kubernetes and explains how to monitor each. It also discusses key Kubernetes monitoring metrics and offers tips on best practices that will help you get the most out of your Kubernetes monitoring strategy.
Introduction to Kubernetes
Kubernetes is composed of many different parts. To monitor Kubernetes successfully, you first need to understand what each of these parts does and how they fit together to comprise the complete Kubernetes platform.
Anatomy of Kubernetes
Here's a breakdown of the various pieces that form Kubernetes:
- Nodes: Individual servers running within a Kubernetes environment. Nodes provide the basic host infrastructure for Kubernetes. They can be physical or virtual machines.
- Clusters: A cluster is a set of servers that host a Kubernetes workload. Each Kubernetes cluster must contain at least one so-called master node (which runs the services that glue the rest of the cluster together). The rest of the cluster is comprised of worker nodes, which provide additional infrastructure resources. Typically, there is one cluster per Kubernetes installation, although it's possible to have multiple clusters within the same installation using the federation model.
- Namespaces: Namespaces allow you to define which users (or groups) can access which resources within a cluster. You can divide your Kubernetes cluster into namespaces in order to allow multiple users to share the same cluster in a secure way. Namespaces allow you to share a cluster among multiple users, instead of having to create separate clusters for each user.
- Pods: Groups of containers that are deployed to host a specific application. A pod could consist of just one container, or it could include multiple containers (such as one that hosts an application frontend, and another that facilitates access to the application data).
- ReplicaSet: A policy that you can optionally be set to tell Kubernetes to maintain a certain number of copies of the same pod.
- Deployment: A set of configuration options that tell Kubernetes how a given workload should run. Deployments include ReplicaSet specifications as well as other options.
- DaemonSets: A configuration policy that can be set to tell Kubernetes to run a copy of a pod on all nodes, or only on certain nodes. Normally, Kubernetes makes determinations about pod placement automatically, but you can use a DaemonSet to govern this behavior.
- Services: Groups of pods that host copies of the same workload. By mapping different workloads to each other based on Services instead of individual pods, you ensure that your workloads keep communicating with each other even if one pod instance is replaced by another.
Kubernetes can run on many different types of infrastructure. It works with physical servers or virtual servers running in an on-premises data center. Kubernetes can also be hosted in the cloud. You can even run it on a PC or laptop using a specialized distribution like MicroK8s or K3s, which are designed to let you set up Kubernetes environments on a local workstation for testing or learning purposes. (You would not want to host a production workload this way.)
Kubernetes is primarily a Linux technology, and master nodes have to run on Linux-based servers. However, worker nodes can be Windows or Linux systems.
Each node in a Kubernetes cluster runs a few key pieces of software, which serve to allow nodes to communicate with each other and host containerized applications:
Kubelet: An agent that allows worker nodes to communicate with the master node. The agent lets the master check to see which worker nodes are running, and it lets nodes send and receive data.
Kube Proxy: This component provides the basis for Kubernetes Services, enabling reliable and consistent communication between pods.
Container runtime software: The software that executes individual containers. Docker is the most widely known container runtime, but Kubernetes can work with a variety of other runtimes, such as rkt and runc.
Kubernetes Control Plane
Beyond software running on individual nodes, there are a few other essential software components in a Kubernetes installation. Typically, these run on the master node:
- API Server: This server exposes the Kubernetes API, which lets users interact with the Kubernetes cluster.
- Etcd: A key-value store that houses all of the data the Kubernetes cluster needs to run.
- Scheduler: This service watches for new pods and assigns them to nodes.
- Controller Manager: Software that continuously monitors Kubernetes configurations and checks them against the actual state of the cluster. When these items do not match, the Controller takes steps to correct them. In this way, Kubernetes enables a declarative approach to software management in which users create configurations that specify how the software should behave, and the software conforms to those configurations automatically.
Benefits of Kubernetes
Kubernetes offers a range of benefits. The most obvious is that it automates complex tasks that human admins would otherwise have to perform by hand, such as starting and stopping containers or deciding which nodes to host containers on. In a large-scale environment composed of dozens of containers and servers, an automation tool like Kubernetes is the only way to manage the environment effectively.
But Kubernetes's features go beyond simple orchestration. As noted above, it also allows admins to take a declarative approach to software management. This means that admins can simply specify what they want to happen, and Kubernetes figures out how to make it happen on its own. This approach is simpler, and less prone to oversights than a so-called imperative approach, in which admins would have to specify manually how each component of the software environment should operate.
Kubernetes also offers some security benefits. It provides certain features, including role-based access control and pod security settings that allow admins to build in security protections for workloads running on a Kubernetes cluster. This does not mean that Kubernetes is a security tool, or that Kubernetes alone delivers full security; it doesn't, and admins should take extra steps to protect against vulnerabilities that Kubernetes can't address (such as malicious code inside containers, or vulnerabilities on host operating systems). Still, the security features that Kubernetes offers provide some advantages that would not exist if you attempted to orchestrate your containers manually.
Finally, it is arguably an advantage that Kubernetes is a fully open-source platform, and as such is compatible with a wide range of host environments and third-party tools. The risk of being locked into one vendor is low if you use Kubernetes. Even if you adopt a Kubernetes distribution that is tied to a particular vendor (like OpenShift, which is a Red Hat platform; or Rancher), you can always migrate to a different Kubernetes distribution if you choose while taking most of your existing configuration with you.
Monitoring Kubernetes Clusters
Now that we know what goes into monitoring Kubernetes, let's discuss how to monitor all of the pieces. In order to simplify your Kubernetes monitoring strategy, it's helpful to break monitoring operations down into different parts, each focused on a different "layer" of your Kubernetes environment or part of the overall workload (clusters, pods, applications and the end-user experience).
Monitoring Kubernetes Clusters
The highest-level component of Kubernetes is the cluster. As noted above, in most cases each Kubernetes installation consists of only one cluster. Thus, by monitoring the cluster, you gain an across-the-board view of the overall health of all of the nodes, pods and applications that form your cluster. (If you use federation to maintain multiple clusters, you'll have to monitor each cluster separately. But that is not very difficult, especially because it would be very rare to have more than two or three clusters per organization.)
Specific areas to monitor at the cluster level include:
- Cluster usage: Which portion of your cluster infrastructure is currently in use? Cluster usage lets you know when it's time to add more nodes so that you don't run out of resources to power your workloads. Or, if your cluster is significantly under-utilized, tracking cluster usage will help you know it's time to scale down so that you're not paying for more infrastructure than you need.
- Node consumption: You should also track the load on each node. If some nodes are experiencing much more usage than others, you may want to rebalance the distribution of workloads using DaemsonSets.
- Failed pods: Pods are destroyed naturally as part of normal operations. But if there is a pod that you think should be running but is not active anywhere on your cluster, that's an important issue to investigate.
Monitoring Kubernetes Pods
While cluster monitoring provides a high-level overview of your Kubernetes environment, you should also collect monitoring data from individual pods. This data will provide much deeper insight into the health of pods (and the workloads that they host) than you can glean by simply identifying whether or not pods are running at the cluster level.
When monitoring pods, you'll want to focus on:
- Pod deployment patterns: Monitoring how pods are being deployed – which nodes they are running on and how resources are distributed to them – helps identify bottlenecks or misconfigurations that could compromise the high availability of pods.
- Total pod instances: You want enough instances of a pod to ensure high availability, but not so many that you waste hosting resources running more pod instances than you need.
- Expected vs. actual pod instances: You should also monitor how many instances for each pod are actually running, and compare it to how many you expected to be running. If you find that these numbers are frequently different, it could be a sign that your ReplicaSets are misconfigured and/or that your cluster does not have enough resources to achieve the desired state regarding pod instances.
Monitoring Applications Running in Kubernetes
Although applications are not a specifically defined component within Kubernetes, hosting applications is a reason why you ultimately use Kubernetes in the first place. Thus, it's important to monitor the applications being hosted on your cluster (or clusters) by checking:
- Application availability: Are your apps actually up and responding?
- Application responsiveness: How long do your apps take to respond to requests? Do they struggle to maintain acceptable rates of performance under heavy load?
- Transaction traces: If your apps are experiencing performance or availability problems, transaction traces can help to troubleshoot them.
- Errors: When application errors occur, it's important to get to the bottom of them before they impact end-users.
Problems revealed by application monitoring in Kubernetes could be the result of an issue with your Kubernetes environment, or they could be rooted in your application code itself. Either way, you'll want to be sure to identify the problem so you can correct it
Monitoring End-user Experience when Running Kubernetes
Like applications, the end-user experience is not a technical part of the Kubernetes platform. But delivering a positive experience for end-users – meaning the people who use the applications hosted on Kubernetes – is a critical consideration for any successful Kubernetes strategy.
Toward that end, it's important to collect data that provides insight into the performance and usability of applications. We discussed some of this above in the context of monitoring for application responsiveness, which provides insight into performance. When it comes to assessing usability, performing both synthetic and real-user monitoring is critical for understanding how users are interacting with Kubernetes workloads and whether there are any adjustments you can make within Kubernetes (such as enhancing your application frontend) to improve usability.
Monitoring Kubernetes in a Cloud Environment
In addition to the various Kubernetes monitoring considerations described above, which apply to any type of Kubernetes environment, there are some special factors to weigh when you're running Kubernetes in the cloud.
In a cloud-based installation, you'll also need to monitor for:
- Cloud APIs: Your cloud provider has its own APIs, which your Kubernetes installation will use to request resources.
- IAM events: Monitoring for IAM activity, like logins or permissions changes, is important for staying on top of security in a cloud-based environment.
- Cost: Cloud bills can get large quickly. Performing cost monitoring will help ensure you are not overspending on your cloud-based Kubernetes service.
Network performance: In the cloud, the network is often the biggest performance bottleneck for your applications. Monitoring the cloud network to ensure that it is moving data as quickly as you need, helps to safeguard against network-related performance issues.
Metrics to Monitor in Kubernetes
Now that we know which types of monitoring to perform for Kubernetes, let's discuss the specific metrics to collect in order to achieve visibility into a Kubernetes installation.
Common metrics refer to metrics you can collect from the code of Kubernetes itself, which is written in Golang. This information helps you understand what is happening deep under the hood of Kubernetes.
A summary of the GC invocation durations
Number of OS threads created
Number of goroutines that currently exist
API Server, Controller Manager
Counter of etcd helper cache hits
API Server, Controller Manager
Counter of etcd helper cache misses
API Server, Controller Manager
Latency in microseconds of adding an object to etcd cache
API Server, Controller Manager
Latency in microseconds of getting an object from etcd cache
API Server, Controller Manager
Etcd request latency summary in microseconds for each operation and object type
API Server Metrics
Since APIs serve as the glue that binds the Kubernetes frontend together, API metrics are crucial for achieving visibility into the API Server – and, by extension, into your entire frontend.
Count of apiserver requests broken out for each verb, API resource, client, and HTTP response contentType and code
Response latency distribution in microseconds for each verb, resource and subresource
Since Etcd stores all of the configuration data for Kubernetes itself, Etcd metrics deliver critical visibility into the state of your cluster.
1 if a leader exists, 0 if not
Number of leader changes
Number of proposals that have been applied
Number of proposals that have been committed
Number of proposals that are pending
Number of proposals that have failed
Actual size of database usage after a history compaction
Latency distributions of commit called by the backend
Latency distributions of fsync calle by wal
Total number of bytes received by gRPC clients
Total number of bytes sent by gRPC clients
Total number of gRPC’s started on the server
Total number of gRPC’s handled on the server
Monitoring latency in the Scheduler helps identify delays that may arise and prevent Kubernetes from deploying pods smoothly.
The end-to-end scheduling latency, which is the sum of the scheduling algorithm latency and the binding latency
Controller Manager Metrics
Watching the requests that the Controller makes to external APIs helps ensure that workloads can be orchestrated successfully, especially in cloud-based Kubernetes deployments.
The latency of the cloud provider API call
Cloud provider API request errors
Kube-State-Metrics is an optional Kubernetes add-on that generates metrics from the Kubernetes API. These metrics cover a range of resources; following, are the most valuable ones.
The current phase of the pod
Limit on CPU cores that can be used by the container
Limit on the amount of memory that can be used by the container
The number of requested cores by a container
The number of requested memory bytes by a container
Will be 1 if the container is ready, and 0 if it is in a not ready state
Total number of restarts of the container
The reason that the container is in a terminated state
The reason that the container is in a waiting state
The number of nodes that should be running the pod
The number of nodes that should be running the pod, but are not able to
The number of desired pod replicas for the Deployment
The number of unavailable replicas per Deployment
Whether a node can schedule new pods or not
The total CPU resources available on the node
The total memory resources available on the node
The number of pods the node can schedule
The current status of the node
Monitoring the Kubelet agent will help ensure that the Control Plane can communicate effectively with each of the nodes that Kubelet runs on. Beyond the common GoLang runtime metrics described above, Kubelet exposes some internals about its actions that are good to track as well.
The number of containers that are currently running
The cumulative number of runtime operations available by the different operation types
The latency of each operation by type in microseconds
Monitoring standard metrics from the operating systems that power Kuberntees nodes provides insight into the health of each node. Common node metrics to monitor include CPU load, memory consumption, filesystem activity and usage and network activity.
While metrics from Kubernetes can provide insight into many parts of your workload, you should also home in on individual containers to monitor for resource consumption. CAdvisor, which analyzes resource usage inside containers, is helpful for this purpose.
When you need to investigate an issue revealed by metrics, logs are invaluable for diving deeper by collecting information that goes beyond metrics themselves. Kubernetes offers a range of logging facilities for most of its components. Applications themselves also typically generate log data.
Monitoring Kubernetes-Powered Apps with Sumo Logic
With the Sumo Logic Kubernetes App, collecting and analyzing monitoring data from across your Kubernetes environment is simple. Sumo Logic does the hard work of collecting metrics from Kubernetes’s many different parts, then aggregates them and provides you with data analytics tools for making sense of all of the data.
Just as Kubernetes makes it practical to manage complex containerized applications at scale, Sumo Logic makes it possible to monitor Kubernetes itself – a task that would be all but impossible without Sumo Logic’s ability to streamline the monitoring and analytics process.