Monitoring Kubernetes: Best Practices and Methods

Kubernetes is a powerful tool that greatly simplifies the process of deploying complex containerized applications at scale. Yet, like any powerful software platform, Kubernetes must be monitored effectively in order to do its job well. Without the right Kubernetes monitoring tools and procedures in place, teams risk sub-optimal performance or even complete failure of their Kubernetes-based workloads.

To help get started building and implementing a Kubernetes monitoring strategy, this page breaks down the different parts of Kubernetes and explains how to monitor each. It also discusses key Kubernetes monitoring metrics and offers tips on best practices that will help you get the most out of your Kubernetes monitoring strategy.

Introduction to Kubernetes

Kubernetes is composed of many different parts. To monitor Kubernetes successfully, you first need to understand what each of these parts does and how they fit together to comprise the complete Kubernetes platform.

Anatomy of Kubernetes

Here's a breakdown of the various pieces that form Kubernetes:

  • Nodes: Individual servers running within a Kubernetes environment. Nodes provide the basic host infrastructure for Kubernetes. They can be physical or virtual machines.
  • Clusters: A cluster is a set of servers that host a Kubernetes workload. Each Kubernetes cluster must contain at least one so-called master node (which runs the services that glue the rest of the cluster together). The rest of the cluster is comprised of worker nodes, which provide additional infrastructure resources. Typically, there is one cluster per Kubernetes installation, although it's possible to have multiple clusters within the same installation using the federation model.
  • Namespaces: Namespaces allow you to define which users (or groups) can access which resources within a cluster. You can divide your Kubernetes cluster into namespaces in order to allow multiple users to share the same cluster in a secure way. Namespaces allow you to share a cluster among multiple users, instead of having to create separate clusters for each user.
  • Pods: Groups of containers that are deployed to host a specific application. A pod could consist of just one container, or it could include multiple containers (such as one that hosts an application frontend, and another that facilitates access to the application data).
  • ReplicaSet: A policy that you can optionally be set to tell Kubernetes to maintain a certain number of copies of the same pod.
  • Deployment: A set of configuration options that tell Kubernetes how a given workload should run. Deployments include ReplicaSet specifications as well as other options.
  • DaemonSets: A configuration policy that can be set to tell Kubernetes to run a copy of a pod on all nodes, or only on certain nodes. Normally, Kubernetes makes determinations about pod placement automatically, but you can use a DaemonSet to govern this behavior.
  • Services: Groups of pods that host copies of the same workload. By mapping different workloads to each other based on Services instead of individual pods, you ensure that your workloads keep communicating with each other even if one pod instance is replaced by another.

Infrastructure Layer

Kubernetes can run on many different types of infrastructure. It works with physical servers or virtual servers running in an on-premises data center. Kubernetes can also be hosted in the cloud. You can even run it on a PC or laptop using a specialized distribution like MicroK8s or K3s, which are designed to let you set up Kubernetes environments on a local workstation for testing or learning purposes. (You would not want to host a production workload this way.)

Kubernetes is primarily a Linux technology, and master nodes have to run on Linux-based servers. However, worker nodes can be Windows or Linux systems.

Node Components

Each node in a Kubernetes cluster runs a few key pieces of software, which serve to allow nodes to communicate with each other and host containerized applications:

Kubelet: An agent that allows worker nodes to communicate with the master node. The agent lets the master check to see which worker nodes are running, and it lets nodes send and receive data.

Kube Proxy: This component provides the basis for Kubernetes Services, enabling reliable and consistent communication between pods.

Container runtime software: The software that executes individual containers. Docker is the most widely known container runtime, but Kubernetes can work with a variety of other runtimes, such as rkt and runc.

Kubernetes Control Plane

Beyond software running on individual nodes, there are a few other essential software components in a Kubernetes installation. Typically, these run on the master node:

  • API Server: This server exposes the Kubernetes API, which lets users interact with the Kubernetes cluster.
  • Etcd: A key-value store that houses all of the data the Kubernetes cluster needs to run.
  • Scheduler: This service watches for new pods and assigns them to nodes.
  • Controller Manager: Software that continuously monitors Kubernetes configurations and checks them against the actual state of the cluster. When these items do not match, the Controller takes steps to correct them. In this way, Kubernetes enables a declarative approach to software management in which users create configurations that specify how the software should behave, and the software conforms to those configurations automatically.

Benefits of Kubernetes

Kubernetes offers a range of benefits. The most obvious is that it automates complex tasks that human admins would otherwise have to perform by hand, such as starting and stopping containers or deciding which nodes to host containers on. In a large-scale environment composed of dozens of containers and servers, an automation tool like Kubernetes is the only way to manage the environment effectively.

But Kubernetes's features go beyond simple orchestration. As noted above, it also allows admins to take a declarative approach to software management. This means that admins can simply specify what they want to happen, and Kubernetes figures out how to make it happen on its own. This approach is simpler, and less prone to oversights than a so-called imperative approach, in which admins would have to specify manually how each component of the software environment should operate.

Kubernetes also offers some security benefits. It provides certain features, including role-based access control and pod security settings that allow admins to build in security protections for workloads running on a Kubernetes cluster. This does not mean that Kubernetes is a security tool, or that Kubernetes alone delivers full security; it doesn't, and admins should take extra steps to protect against vulnerabilities that Kubernetes can't address (such as malicious code inside containers, or vulnerabilities on host operating systems). Still, the security features that Kubernetes offers provide some advantages that would not exist if you attempted to orchestrate your containers manually.

Finally, it is arguably an advantage that Kubernetes is a fully open-source platform, and as such is compatible with a wide range of host environments and third-party tools. The risk of being locked into one vendor is low if you use Kubernetes. Even if you adopt a Kubernetes distribution that is tied to a particular vendor (like OpenShift, which is a Red Hat platform; or Rancher), you can always migrate to a different Kubernetes distribution if you choose while taking most of your existing configuration with you.

Monitoring Kubernetes Clusters

Now that we know what goes into monitoring Kubernetes, let's discuss how to monitor all of the pieces. In order to simplify your Kubernetes monitoring strategy, it's helpful to break monitoring operations down into different parts, each focused on a different "layer" of your Kubernetes environment or part of the overall workload (clusters, pods, applications and the end-user experience).

Monitoring Kubernetes Clusters

The highest-level component of Kubernetes is the cluster. As noted above, in most cases each Kubernetes installation consists of only one cluster. Thus, by monitoring the cluster, you gain an across-the-board view of the overall health of all of the nodes, pods and applications that form your cluster. (If you use federation to maintain multiple clusters, you'll have to monitor each cluster separately. But that is not very difficult, especially because it would be very rare to have more than two or three clusters per organization.)

Specific areas to monitor at the cluster level include:

  • Cluster usage: Which portion of your cluster infrastructure is currently in use? Cluster usage lets you know when it's time to add more nodes so that you don't run out of resources to power your workloads. Or, if your cluster is significantly under-utilized, tracking cluster usage will help you know it's time to scale down so that you're not paying for more infrastructure than you need.
  • Node consumption: You should also track the load on each node. If some nodes are experiencing much more usage than others, you may want to rebalance the distribution of workloads using DaemsonSets.
  • Failed pods: Pods are destroyed naturally as part of normal operations. But if there is a pod that you think should be running but is not active anywhere on your cluster, that's an important issue to investigate.

Monitoring Kubernetes Pods

While cluster monitoring provides a high-level overview of your Kubernetes environment, you should also collect monitoring data from individual pods. This data will provide much deeper insight into the health of pods (and the workloads that they host) than you can glean by simply identifying whether or not pods are running at the cluster level.

When monitoring pods, you'll want to focus on:

  • Pod deployment patterns: Monitoring how pods are being deployed – which nodes they are running on and how resources are distributed to them – helps identify bottlenecks or misconfigurations that could compromise the high availability of pods.
  • Total pod instances: You want enough instances of a pod to ensure high availability, but not so many that you waste hosting resources running more pod instances than you need.
  • Expected vs. actual pod instances: You should also monitor how many instances for each pod are actually running, and compare it to how many you expected to be running. If you find that these numbers are frequently different, it could be a sign that your ReplicaSets are misconfigured and/or that your cluster does not have enough resources to achieve the desired state regarding pod instances.

Monitoring Applications Running in Kubernetes

Although applications are not a specifically defined component within Kubernetes, hosting applications is a reason why you ultimately use Kubernetes in the first place. Thus, it's important to monitor the applications being hosted on your cluster (or clusters) by checking:

  • Application availability: Are your apps actually up and responding?
  • Application responsiveness: How long do your apps take to respond to requests? Do they struggle to maintain acceptable rates of performance under heavy load?
  • Transaction traces: If your apps are experiencing performance or availability problems, transaction traces can help to troubleshoot them.
  • Errors: When application errors occur, it's important to get to the bottom of them before they impact end-users.

Problems revealed by application monitoring in Kubernetes could be the result of an issue with your Kubernetes environment, or they could be rooted in your application code itself. Either way, you'll want to be sure to identify the problem so you can correct it

Monitoring End-user Experience when Running Kubernetes

Like applications, the end-user experience is not a technical part of the Kubernetes platform. But delivering a positive experience for end-users – meaning the people who use the applications hosted on Kubernetes – is a critical consideration for any successful Kubernetes strategy.

Toward that end, it's important to collect data that provides insight into the performance and usability of applications. We discussed some of this above in the context of monitoring for application responsiveness, which provides insight into performance. When it comes to assessing usability, performing both synthetic and real-user monitoring is critical for understanding how users are interacting with Kubernetes workloads and whether there are any adjustments you can make within Kubernetes (such as enhancing your application frontend) to improve usability.

Monitoring Kubernetes in a Cloud Environment

In addition to the various Kubernetes monitoring considerations described above, which apply to any type of Kubernetes environment, there are some special factors to weigh when you're running Kubernetes in the cloud.

In a cloud-based installation, you'll also need to monitor for:

  • Cloud APIs: Your cloud provider has its own APIs, which your Kubernetes installation will use to request resources.
  • IAM events: Monitoring for IAM activity, like logins or permissions changes, is important for staying on top of security in a cloud-based environment.
  • Cost: Cloud bills can get large quickly. Performing cost monitoring will help ensure you are not overspending on your cloud-based Kubernetes service.

Network performance: In the cloud, the network is often the biggest performance bottleneck for your applications. Monitoring the cloud network to ensure that it is moving data as quickly as you need, helps to safeguard against network-related performance issues.

Metrics to Monitor in Kubernetes

Now that we know which types of monitoring to perform for Kubernetes, let's discuss the specific metrics to collect in order to achieve visibility into a Kubernetes installation.

Common Metrics

Common metrics refer to metrics you can collect from the code of Kubernetes itself, which is written in Golang. This information helps you understand what is happening deep under the hood of Kubernetes.

Metric

Components

Description

go_gc_duration_seconds

All

A summary of the GC invocation durations

go_threads

All

Number of OS threads created

go_goroutines

All

Number of goroutines that currently exist

etcd_helper_cache_hit_count

API Server, Controller Manager

Counter of etcd helper cache hits

etcd_helper_cache_miss_count

API Server, Controller Manager

Counter of etcd helper cache misses

etcd_request_cache_add_latencies_summary

API Server, Controller Manager

Latency in microseconds of adding an object to etcd cache

etcd_request_cache_get_latencies_summary

API Server, Controller Manager

Latency in microseconds of getting an object from etcd cache

etcd_request_latencies_summary

API Server, Controller Manager

Etcd request latency summary in microseconds for each operation and object type

API Server Metrics

Since APIs serve as the glue that binds the Kubernetes frontend together, API metrics are crucial for achieving visibility into the API Server – and, by extension, into your entire frontend.

Metric

Description

apiserver_request_count

Count of apiserver requests broken out for each verb, API resource, client, and HTTP response contentType and code

apiserver_request_latencies

Response latency distribution in microseconds for each verb, resource and subresource

Etcd Metrics

Since Etcd stores all of the configuration data for Kubernetes itself, Etcd metrics deliver critical visibility into the state of your cluster.

Metric

Description

etcd_server_has_leader

1 if a leader exists, 0 if not

etcd_server_leader_changes_seen_total

Number of leader changes

etcd_server_proposals_applied_total

Number of proposals that have been applied

etcd_server_proposals_committed_total

Number of proposals that have been committed

etcd_server_proposals_pending

Number of proposals that are pending

etcd_server_proposals_failed_total

Number of proposals that have failed

etcd_debugging_mvcc_db_total_size_in_bytes

Actual size of database usage after a history compaction

etcd_disk_backend_commit_duration_seconds

Latency distributions of commit called by the backend

etcd_disk_wal_fsync_duration_seconds

Latency distributions of fsync calle by wal

etcd_network_client_grpc_received_bytes_total

Total number of bytes received by gRPC clients

etcd_network_client_grpc_sent_bytes_total

Total number of bytes sent by gRPC clients

grpc_server_started_total

Total number of gRPC’s started on the server

grpc_server_handled_total

Total number of gRPC’s handled on the server

Scheduler Metrics

Monitoring latency in the Scheduler helps identify delays that may arise and prevent Kubernetes from deploying pods smoothly.

Metric

Description

scheduler_e2e_scheduling_latency_microseconds

The end-to-end scheduling latency, which is the sum of the scheduling algorithm latency and the binding latency

Controller Manager Metrics

Watching the requests that the Controller makes to external APIs helps ensure that workloads can be orchestrated successfully, especially in cloud-based Kubernetes deployments.

Metric

Description

cloudprovider_*_api_request_duration_seconds

The latency of the cloud provider API call

cloudprovider_*_api_request_errors

Cloud provider API request errors

Kube-State-Metrics

Kube-State-Metrics is an optional Kubernetes add-on that generates metrics from the Kubernetes API. These metrics cover a range of resources; following, are the most valuable ones.

Metric

Description

kube_pod_status_phase

The current phase of the pod

kube_pod_container_resource_limits_cpu_cores

Limit on CPU cores that can be used by the container

kube_pod_container_resource_limits_memory_bytes

Limit on the amount of memory that can be used by the container

kube_pod_container_resource_requests_cpu_cores

The number of requested cores by a container

kube_pod_container_resource_requests_memory_bytes

The number of requested memory bytes by a container

kube_pod_container_status_ready

Will be 1 if the container is ready, and 0 if it is in a not ready state

kube_pod_container_status_restarts_total

Total number of restarts of the container

kube_pod_container_status_terminated_reason

The reason that the container is in a terminated state

kube_pod_container_status_waiting

The reason that the container is in a waiting state

kube_daemonset_status_desired_number_scheduled

The number of nodes that should be running the pod

kube_daemonset_status_number_unavailable

The number of nodes that should be running the pod, but are not able to

kube_deployment_spec_replicas

The number of desired pod replicas for the Deployment

kube_deployment_status_replicas_unavailable

The number of unavailable replicas per Deployment

kube_node_spec_unschedulable

Whether a node can schedule new pods or not

kube_node_status_capacity_cpu_cores

The total CPU resources available on the node

kube_node_status_capacity_memory_bytes

The total memory resources available on the node

kube_node_status_capacity_pods

The number of pods the node can schedule

kube_node_status_condition

The current status of the node

Kubelet Metrics

Monitoring the Kubelet agent will help ensure that the Control Plane can communicate effectively with each of the nodes that Kubelet runs on. Beyond the common GoLang runtime metrics described above, Kubelet exposes some internals about its actions that are good to track as well.

Metric

Description

kubelet_running_container_count

The number of containers that are currently running

kubelet_runtime_operations

The cumulative number of runtime operations available by the different operation types

kubelet_runtime_operations_latency_microseconds

The latency of each operation by type in microseconds

Node Metrics

Monitoring standard metrics from the operating systems that power Kuberntees nodes provides insight into the health of each node. Common node metrics to monitor include CPU load, memory consumption, filesystem activity and usage and network activity.

Container Metrics

While metrics from Kubernetes can provide insight into many parts of your workload, you should also home in on individual containers to monitor for resource consumption. CAdvisor, which analyzes resource usage inside containers, is helpful for this purpose.

Log Data

When you need to investigate an issue revealed by metrics, logs are invaluable for diving deeper by collecting information that goes beyond metrics themselves. Kubernetes offers a range of logging facilities for most of its components. Applications themselves also typically generate log data.

Monitoring Kubernetes-Powered Apps with Sumo Logic

With the Sumo Logic Kubernetes App, collecting and analyzing monitoring data from across your Kubernetes environment is simple. Sumo Logic does the hard work of collecting metrics from Kubernetes’s many different parts, then aggregates them and provides you with data analytics tools for making sense of all of the data.

Just as Kubernetes makes it practical to manage complex containerized applications at scale, Sumo Logic makes it possible to monitor Kubernetes itself – a task that would be all but impossible without Sumo Logic’s ability to streamline the monitoring and analytics process.