Sumo Logic named a Visionary in the Gartner Magic Quadrant for SIEM Read the Report

Back to blog results

June 12, 2018 By Frank Reno

Monitoring Kubernetes: What to Monitor (Crash Course, Part 2)

In Part one of our three-part series on monitoring Kubernetes, we dove deep into the Kubernetes architecture to get a firm foundation on how Kubernetes works. Understanding what an application does and how it functions is critical to monitoring it effectively. Now it’s time to journey a bit deeper into each of those components and understand what you need to keep a close eye on.

What It Means to Monitor an Application

When monitoring any application, you always need to be able to answer two critical questions: What is happening? Why is this happening? Access to machine data from your applications to address these questions ensures that you can quickly identify issues and take action to remedy. This machine data comes in two forms, metrics and logs.

Time series metrics tell you what is happening, for example how much load is on the system or what current resource consumption is. Metrics are like the various indicators in your car similar to the check engine light that goes on when there is a problem. They provide you insight into the current state of behavior and can be a warning sign of issues to come.

Logs tell you why something is happening. They provide a record of the events that are occurring in your application. They can be informative, providing context around the actions the app is taking, or they can be errors indicating why something is breaking. Logs are like the results of the diagnostic tool your mechanic plugs in to determine why the check engine light is on.

[EBOOK] Kubernetes Observability

Learn how to monitor, troubleshoot, and secure your Kubernetes environment with Sumo Logic.

Machine Data From Kubernetes And What It Orchestrates

Like any application, Kubernetes has a comprehensive set of machine data to enable you always answer the what and why when it comes to monitoring it. You also need machine data for the applications running inside of Kubernetes. Let’s dive into the critical data you need to collect from every Kubernetes cluster.

Common Metrics

Kubernetes is written in GoLang and reveals some essential metrics about the GoLang runtime. These metrics are necessary to keep an eye on the state of what is happening in your GoLang processes. There are also critical metrics related to etcd. Multiple components interact with etcd and keeping an eye on those interactions gives you insights into potential etcd issues. Below are some of the top GoLang stats and common etcd metrics to collect that are exposed by most Kubernetes components.

go_gc_duration_secondsAllA summary of the GC invocation durations.
go_threadsAllNumber of OS threads created.
go_goroutinesAllNumber of goroutines that currently exist.
etcd_helper_cache_hit_countAPI Server, Controller ManagerCounter of etcd helper cache hits.
etcd_helper_cache_miss_countAPI Server, Controller ManagerCounter of etcd helper cache miss.
etcd_request_cache_add_latencies_summaryAPI Server, Controller ManagerLatency in microseconds of adding an object to etcd cache.
etcd_request_cache_get_latencies_summaryAPI Server, Controller ManagerLatency in microseconds of getting an object from etcd cache.
etcd_request_latencies_summaryAPI Server, Controller ManagerEtcd request latency summary in microseconds for each operation and object type.

Kubernetes Control Plane

As we learned, the Kubernete Control Plane is the engine that powers Kubernetes. It consists of multiple parts working together to orchestrate your containerized applications. Each piece serves a specific function and exposes its own set of metrics to monitor the health of that component. To effectively monitor the Control Plane, visibility into each components health and state is critical.

API Server

The API Server provides the front-end for the Kubernetes cluster and is the central point that all components interact. The following table presents the top metrics you need to have clear visibility into the state of the API Server.

apiserver_request_countCount of apiserver requests broken out for each verb, API resource, client, and HTTP response contentType and code.
apiserver_request_latenciesResponse latency distribution in microseconds for each verb, resource and subresource.


Etcd is the backend for Kubernetes. It is a consistent and highly-available key-value store where all Kubernetes cluster data resides. All the data representing the state of the Kubernetes cluster resides in Etcd. The following are some of the top metrics to watch in Etcd.

etcd_server_has_leader1 if a leader exists, 0 if not.
etcd_server_leader_changes_seen_totalNumber of leader changes.
etcd_server_proposals_applied_totalNumber of proposals that have been applied.
etcd_server_proposals_committed_totalNumber of proposals that have been committed.
etcd_server_proposals_pendingNumber of proposals that are pending.
etcd_server_proposals_failed_totalNumber of proposals that have failed.
etcd_debugging_mvcc_db_total_size_in_bytesActual size of database usage after a history compaction.
etcd_disk_backend_commit_duration_secondsLatency distributions of commit called by the backend.
etcd_disk_wal_fsync_duration_secondsLatency distributions of fsync calle by wal.
etcd_network_client_grpc_received_bytes_totalTotal number of bytes received by gRPC clients.
etcd_network_client_grpc_sent_bytes_totalTotal number of bytes sent by gRPC clients.
grpc_server_started_totalTotal number of gRPC’s started on the server.
grpc_server_handled_totalTotal number of gRPC’s handled on the server.


Scheduler watches the Kubernetes API for newly created pods and determines which node should run those pods. It makes this decision based on the data it has available including the collective resource availability as well as the resource requirements of the pod. Monitoring scheduling latency ensures you have visibility into any delays the Scheduler is facing.

scheduler_e2e_scheduling_latency_microsecondsThe end-to-end scheduling latency, which is the sum of the scheduling algorithm latency and the binding latency.

Controller Manager

Controller Manager is a daemon which embeds all the various control loops that run to ensure the desired state of your cluster is met. It watches the API server and takes action depending on the current state versus the desired state. It’s important to keep an eye on the requests it is making to your Cloud provider to ensure the controller manager can successfully orchestrate. Currently, these metrics are available for AWS, GCE, and OpenStack.

cloudprovider_*_api_request_duration_secondsThe latency of the cloud provider API call.
cloudprovider_*_api_request_errorsCloud provider API request errors.


Kube-State-Metrics is a Kubernetes add-on that provides insights into the state of Kubernetes. It watches the Kubernetes API and generates various metrics, so you know what is currently running. Metrics are generated for just about every Kubernetes resource including pods, deployments, daemonsets, and nodes. Numerous metrics are available, capturing various information and below are some of the key ones.

kube_pod_status_phaseThe current phase of the pod.
kube_pod_container_resource_limits_cpu_coresLimit on CPU cores that can be used by the container.
kube_pod_container_resource_limits_memory_bytesLimit on the amount of memory that can be used by the container.
kube_pod_container_resource_requests_cpu_coresThe number of requested cores by a container.
kube_pod_container_resource_requests_memory_bytesThe number of requested memory bytes by a container.
kube_pod_container_status_readyWill be 1 if the container is ready, and 0 if it is in a not ready state.
kube_pod_container_status_restarts_totalTotal number of restarts of the container.
kube_pod_container_status_terminated_reasonThe reason that the container is in a terminated state.
kube_pod_container_status_waitingThe reason that the container is in a waiting state.
kube_daemonset_status_desired_number_scheduledThe number of nodes that should be running the pod.
kube_daemonset_status_number_unavailableThe number of nodes that should be running the pod, but are not able to.
kube_deployment_spec_replicasThe number of desired pod replicas for the Deployment.
kube_deployment_status_replicas_unavailableThe number of unavailable replicas per Deployment.
kube_node_spec_unschedulableWhether a node can schedule new pods or not.
kube_node_status_capacity_cpu_coresThe total CPU resources available on the node.
kube_node_status_capacity_memory_bytesThe total memory resources available on the node
kube_node_status_capacity_podsThe number of pods the node can schedule.
kube_node_status_conditionThe current status of the node.

Node Components

As we learned in Part 1 of our journey into Kubernetes, the Nodes of a Kubernetes cluster are made up of multiple parts, and as such you have numerous pieces to monitor.


Keeping a close eye on Kubelet ensures that the Control Plane can always communicate with the node that Kubelet is running on. In addition to the common GoLang runtime metrics, Kubelet exposes some internals about its actions that are good to track.

kubelet_running_container_countThe number of containers that are currently running.
kubelet_runtime_operationsThe cumulative number of runtime operations available by the different operation types.
kubelet_runtime_operations_latency_microsecondsThe latency of each operation by type in microseconds.

Node Metrics

Visibility into the standard host metrics of a node ensures you can monitor the health of each node in your cluster, avoiding any downtime as a result of an issue with a particular node. You need visibility into all aspects of the node, including CPU and Memory consumption, System Load, filesystem activity and network activity.

Container Metrics

Monitoring of all of the Kubernetes metrics is just one piece of the puzzle. It is imperative that you also have visibility into your containerized applications that Kubernetes is orchestrating. At a minimum, you need access to the resource consumption of those containers. Kubelet access the container metrics from CAdvisor, a tool that can analyze resource usage of containers and makes them available. These include the standard resource metrics like CPU, Memory, File System and Network usage.

As we can see, there are many vital metrics inside your Kubernetes cluster. These metrics ensure you can always answer what is happening not only in Kubernetes and the components of it, but also your applications running inside of it.


Logs are how we can answer why something is happening. They provide information regarding what the code is doing and the actions it is taking. Kubernetes delivers a wealth of logging for each of its components giving you insights into the decisions it is making. All of your containerized workloads are also generating logs, providing information into the decisions the code is making and actions it is taking. Access to these logs ensures you have comprehensive visibility to monitor and troubleshoot your applications.

How Do I Collect This Data?

Now that we understand what machine data is available to us, how do we get to this data? The good news is that Kubernetes makes most of this data readily available, you just need the right tool to gather and view it.

Collecting Logs

As containers are running inside of Kubernetes, the logs files are written to the Node that the container is running on. Every node in your cluster will have the logs from every container that runs on that node. We developed an open-source FluentD plugin that runs on every node in your cluster as a Daemonset. The plugin is responsible for reading the logs and securely sending them to Sumo Logic. It also enriches the logs with valuable metadata from Kubernetes.

When you create objects in Kubernetes, you can assign custom key-value pairs to each of those objects, called labels. These labels can help you organize and track additional information about each object. For example, you might have a label that represents the application name, the environment the pod is running in or perhaps what team owns this resource. These labels are entirely flexible and can be defined as you need. Our FluentD plugin ensures those labels get captured along with the logs giving you continuity between the resources you have created and the log files they are producing.

The plugin can be used on any Kubernetes cluster, whether using a managed service like Amazon EKS or a cluster you are running entirely on your own. It works as simple as deploying a Daemonset to Kubernetes. We provide default configuration that can work out-of-box with nearly any Kubernetes cluster and a rich set of configuration options so you can fine tune it to your needs.

Collecting Metrics

Every component of Kubernetes exposes its metrics in a Prometheus format. The running processes behind those components serve up the metrics on an HTTP URL that you can access. For example, the Kubernetes API Server serves it metrics on https://$API_HOST:443/metrics. You simply need to scrape these various URLs to obtain the metrics.

We developed a new tool specifically designed to ingest Prometheus formatted metrics into Sumo Logic. This tool can run standalone as a script on some box or can run as a container. It can ingest data from any target that provides Prometheus formatted metrics, including those from Kubernetes. And today, we are open-sourcing that tool for all of our customers to use to start ingesting Prometheus formatted metrics.

The Sumo Logic Prometheus Scraper can be configured to point to multiple targets serving up Prometheus metrics. It supports the ability to include or exclude metrics and provides you full control to the metadata that you send to Sumo Logic. The ability to include metadata ensures that when it comes to Kubernetes, you can capture the same valuable metadata that you have in logs in the metrics. You can use this metadata when searching thru your logs and your metrics, and use them together to have a unified experience when navigating your machine data.

Next Steps

Great, now that we know how we can get this data into Sumo Logic, let’s see these tools in action and start to make this data work for to gain profound insights into everything in our Kubernetes Clusters.

In the third and final post in this series, we will go through the steps to use these tools to collect all of this data into Sumo Logic and show how Sumo Logic can help you monitor and troubleshoot everything in your Kubernetes Cluster.

Additional Resources

  • Read the press release on our latest product enhancements unveiled at DockerCon
  • Read the first blog in our three-part Kubernetes series to learn about the K8s anatomy
  • Sign up for Sumo Logic for free

Navigate Kubernetes with Sumo Logic.

Monitor, troubleshoot and secure your Kubernetes clusters with Sumo Logic Continuous Intelligence solution for Kubernetes.

Chart your course
Frank Reno

Frank Reno

Principal Product Manager

Frank Reno is a Principal Product Manager at Sumo Logic, where he leads Product for Data Collection. He also serves as Sumo Logic's Open Source Ambassador co-leading all efforts around Open Source. He is also an active contributor to Sumo Logic's open source solutions and the general open source community.

More posts by Frank Reno.

People who read this also enjoyed