REPORT

2022 Gartner® Magic Quadrant™ for APM and Observability Read the Report

Back to blog results

June 12, 2018 By Frank Reno

Kubernetes Monitoring: What to Monitor (Crash Course)

Anatomy of Kubernetes

I like to think about Kubernetes like a vehicle. You can use a vehicle to get you to places you want to go, and Kubernetes helps you get to places where you have better control of your containers. Just like a car, k8s consists of multiple layers of components working together to get you to those places.

Let’s dive into some of those individual components, or “car parts” if you will…

[Learn More: Full Stack Kubernetes and Docker Monitoring]

Kubernetes Abstractions

Kubernetes provides multiple abstractions that play an integral role in the orchestration of your containers. These abstractions are like the various components of a car that you use when driving such as the radio, the speedometer or the gas pedal.

  • Pods are the lowest level of compute in Kubernetes. A pod consists of one or more co-located and co-scheduled containers. They share the same network namespace, allowing them to communicate over localhost, and they share the same storage.
  • ReplicaSets ensure that the specified number of pod replicas are running. Should a pod crash, the Replication Controller in the Controller Manager will replace it.
  • Deployments give you declarative management of your ReplicaSets. You describe your desired state, and the deployment handles creating the ReplicaSets.
  • Daemonsets ensure that all desired Nodes are running a copy of a Pod. When a new Node joins the cluster, the DaemonSet will ensure a copy of that Pod is running there.
  • Services provide reliable communication between pods. Since pods can come and go, relying on the IP address is insufficient, and services solve this problem by providing reliable communication between pods.
  • Namespaces are virtual clusters which are backed by the physical cluster. They provide a way to group logical components and define the resources that grouping can use.

[Learn More: Continuous Intelligence Kubernetes]

Infrastructure Layer

At an infrastructure level, a Kubernetes cluster is made up of two types of components: the Master, and the Nodes. These may be virtual or physical machines as Kubernetes can run anywhere. The Master is the brains of the operation, overseeing the Nodes which are the collective resources the Master can use to run your containers. Think of the entire vehicle as a Kubernetes cluster. The Master is like the engine that powers the vehicle, and the wheels are like the Nodes that allow the engine to push it forward.

Node Components

Like the tires of a vehicle are made up of multiple parts, so are the Nodes of a Kubernetes cluster. These components work together to provide the Kubernetes Control Plane with a resource it can use to schedule your containers.

  • Kubelet is an agent running on every node in the cluster that ensures the containers are running and is the mechanism by which the Master communicates with the Node.
  • Kube Proxy is the component that enables the Kubernetes Service abstraction, providing reliable and consistent communication between pods.
  • Container Runtime Software is responsible for running containers. While Docker is the most prevalent, there are several other supported runtimes such as rkt and runc.

Kubernetes Control Plane

The Master runs all the components that make up the Kubernetes Control Plane. The Control Plane components work together to orchestrate your containers. In the same way that an engine has a lot of pieces that work together to power your vehicle, the Kubernetes Control Plane is no different. Each piece is responsible for a specific area that powers k8s.

  • API Server is the component that exposes the Kubernetes API, which is the front-end for the Kubernetes cluster where all Control Plane components and users will interact.
  • Etcd is a consistent and highly-available key-value store where all Kubernetes cluster data resides.
  • Scheduler watches for new pods that need to be scheduled and identifies which Node they should run on.
  • Controller Manager runs multiple controllers responsible for handling pod replication, endpoint creation, taking action when a Node goes down and many other actions. Controllers are loops that continuously check the desired state and take action to ensure that Kubernetes is running in the desired state.

Putting these pieces together, we can start to see the architecture of Kubernetes:

Monitoring Kubernetes

Now that we know what’s under the hood of Kubernetes, we can see it is a complex machine with a lot of components operating together. This machine gives you the ability to manage, deploy and scale your containerized applications with ease. You might even be using a managed service such as Amazon EKS or Google GKE to provide you a highly available, secure, and managed control plane, letting you focus on your containerized applications. In both of these scenarios, the ability to monitor and troubleshoot your applications — and potentially the Kubernetes Control Plane — is critical to delivering reliable, fault-tolerant distributed applications running in containers.

[Learn More: Kubernetes Dashboard]

So how do we monitor all of these pieces? In our next post in this series, we are going to go through each component and identify what is critical to monitor, and how you can use Sumo Logic’s machine data analytics platform to stay on top of what’s happening in your Kubernetes clusters.

What It Means to Monitor an Application with Kubernetes

When monitoring any application, you always need to be able to answer two critical questions: What is happening? Why is this happening? Access to machine data from your applications to address these questions ensures that you can quickly identify issues and take action to remedy. This machine data comes in two forms, metrics and logs.

Time series metrics tell you what is happening, for example how much load is on the system or what current resource consumption is. Metrics are like the various indicators in your car similar to the check engine light that goes on when there is a problem. They provide you insight into the current state of behavior and can be a warning sign of issues to come.

Logs tell you why something is happening. They provide a record of the events that are occurring in your application. They can be informative, providing context around the actions the app is taking, or they can be errors indicating why something is breaking. Logs are like the results of the diagnostic tool your mechanic plugs in to determine why the check engine light is on.

[EBOOK] Kubernetes Observability

Learn how to monitor, troubleshoot, and secure your Kubernetes environment with Sumo Logic.

Machine Data From Kubernetes And What It Orchestrates

Like any application, Kubernetes has a comprehensive set of machine data to enable you always answer the what and why when it comes to monitoring it. You also need machine data for the applications running inside of Kubernetes. Let’s dive into the critical data you need to collect from every Kubernetes cluster.

Common Metrics

Kubernetes is written in GoLang and reveals some essential metrics about the GoLang runtime. These metrics are necessary to keep an eye on the state of what is happening in your GoLang processes. There are also critical metrics related to etcd. Multiple components interact with etcd and keeping an eye on those interactions gives you insights into potential etcd issues. Below are some of the top GoLang stats and common etcd metrics to collect that are exposed by most Kubernetes components.

[Learn More: Advanced Metrics with Kubernetes]

MetricComponentsDescription
go_gc_duration_secondsAllA summary of the GC invocation durations.
go_threadsAllNumber of OS threads created.
go_goroutinesAllNumber of goroutines that currently exist.
etcd_helper_cache_hit_countAPI Server, Controller ManagerCounter of etcd helper cache hits.
etcd_helper_cache_miss_countAPI Server, Controller ManagerCounter of etcd helper cache miss.
etcd_request_cache_add_latencies_summaryAPI Server, Controller ManagerLatency in microseconds of adding an object to etcd cache.
etcd_request_cache_get_latencies_summaryAPI Server, Controller ManagerLatency in microseconds of getting an object from etcd cache.
etcd_request_latencies_summaryAPI Server, Controller ManagerEtcd request latency summary in microseconds for each operation and object type.

Kubernetes Control Plane

As we learned, the Kubernete Control Plane is the engine that powers Kubernetes. It consists of multiple parts working together to orchestrate your containerized applications. Each piece serves a specific function and exposes its own set of metrics to monitor the health of that component. To effectively monitor the Control Plane, visibility into each components health and state is critical.

API Server

The API Server provides the front-end for the Kubernetes cluster and is the central point that all components interact. The following table presents the top metrics you need to have clear visibility into the state of the API Server.

MetricDescription
apiserver_request_countCount of apiserver requests broken out for each verb, API resource, client, and HTTP response contentType and code.
apiserver_request_latenciesResponse latency distribution in microseconds for each verb, resource and subresource.

Etcd

Etcd is the backend for Kubernetes. It is a consistent and highly-available key-value store where all Kubernetes cluster data resides. All the data representing the state of the Kubernetes cluster resides in Etcd. The following are some of the top metrics to watch in Etcd.

MetricDescription
etcd_server_has_leader1 if a leader exists, 0 if not.
etcd_server_leader_changes_seen_totalNumber of leader changes.
etcd_server_proposals_applied_totalNumber of proposals that have been applied.
etcd_server_proposals_committed_totalNumber of proposals that have been committed.
etcd_server_proposals_pendingNumber of proposals that are pending.
etcd_server_proposals_failed_totalNumber of proposals that have failed.
etcd_debugging_mvcc_db_total_size_in_bytesActual size of database usage after a history compaction.
etcd_disk_backend_commit_duration_secondsLatency distributions of commit called by the backend.
etcd_disk_wal_fsync_duration_secondsLatency distributions of fsync calle by wal.
etcd_network_client_grpc_received_bytes_totalTotal number of bytes received by gRPC clients.
etcd_network_client_grpc_sent_bytes_totalTotal number of bytes sent by gRPC clients.
grpc_server_started_totalTotal number of gRPC’s started on the server.
grpc_server_handled_totalTotal number of gRPC’s handled on the server.

Scheduler

Scheduler watches the Kubernetes API for newly created pods and determines which node should run those pods. It makes this decision based on the data it has available including the collective resource availability as well as the resource requirements of the pod. Monitoring scheduling latency ensures you have visibility into any delays the Scheduler is facing.

MetricDescription
scheduler_e2e_scheduling_latency_microsecondsThe end-to-end scheduling latency, which is the sum of the scheduling algorithm latency and the binding latency.

Controller Manager

Controller Manager is a daemon which embeds all the various control loops that run to ensure the desired state of your cluster is met. It watches the API server and takes action depending on the current state versus the desired state. It’s important to keep an eye on the requests it is making to your Cloud provider to ensure the controller manager can successfully orchestrate. Currently, these metrics are available for AWS, GCE, and OpenStack.

MetricDescription
cloudprovider_*_api_request_duration_secondsThe latency of the cloud provider API call.
cloudprovider_*_api_request_errorsCloud provider API request errors.

Kube-State-Metrics

Kube-State-Metrics is a Kubernetes add-on that provides insights into the state of Kubernetes. It watches the Kubernetes API and generates various metrics, so you know what is currently running. Metrics are generated for just about every Kubernetes resource including pods, deployments, daemonsets, and nodes. Numerous metrics are available, capturing various information and below are some of the key ones.

MetricDescription
kube_pod_status_phaseThe current phase of the pod.
kube_pod_container_resource_limits_cpu_coresLimit on CPU cores that can be used by the container.
kube_pod_container_resource_limits_memory_bytesLimit on the amount of memory that can be used by the container.
kube_pod_container_resource_requests_cpu_coresThe number of requested cores by a container.
kube_pod_container_resource_requests_memory_bytesThe number of requested memory bytes by a container.
kube_pod_container_status_readyWill be 1 if the container is ready, and 0 if it is in a not ready state.
kube_pod_container_status_restarts_totalTotal number of restarts of the container.
kube_pod_container_status_terminated_reasonThe reason that the container is in a terminated state.
kube_pod_container_status_waitingThe reason that the container is in a waiting state.
kube_daemonset_status_desired_number_scheduledThe number of nodes that should be running the pod.
kube_daemonset_status_number_unavailableThe number of nodes that should be running the pod, but are not able to.
kube_deployment_spec_replicasThe number of desired pod replicas for the Deployment.
kube_deployment_status_replicas_unavailableThe number of unavailable replicas per Deployment.
kube_node_spec_unschedulableWhether a node can schedule new pods or not.
kube_node_status_capacity_cpu_coresThe total CPU resources available on the node.
kube_node_status_capacity_memory_bytesThe total memory resources available on the node
kube_node_status_capacity_podsThe number of pods the node can schedule.
kube_node_status_conditionThe current status of the node.

Node Components

As we learned in Part 1 of our journey into Kubernetes, the Nodes of a Kubernetes cluster are made up of multiple parts, and as such you have numerous pieces to monitor.

Kubelet

Keeping a close eye on Kubelet ensures that the Control Plane can always communicate with the node that Kubelet is running on. In addition to the common GoLang runtime metrics, Kubelet exposes some internals about its actions that are good to track.

MetricDescription
kubelet_running_container_countThe number of containers that are currently running.
kubelet_runtime_operationsThe cumulative number of runtime operations available by the different operation types.
kubelet_runtime_operations_latency_microsecondsThe latency of each operation by type in microseconds.

Node Metrics

Visibility into the standard host metrics of a node ensures you can monitor the health of each node in your cluster, avoiding any downtime as a result of an issue with a particular node. You need visibility into all aspects of the node, including CPU and Memory consumption, System Load, filesystem activity and network activity.

Container Metrics

Monitoring of all of the Kubernetes metrics is just one piece of the puzzle. It is imperative that you also have visibility into your containerized applications that Kubernetes is orchestrating. At a minimum, you need access to the resource consumption of those containers. Kubelet access the container metrics from CAdvisor, a tool that can analyze resource usage of containers and makes them available. These include the standard resource metrics like CPU, Memory, File System and Network usage.

As we can see, there are many vital metrics inside your Kubernetes cluster. These metrics ensure you can always answer what is happening not only in Kubernetes and the components of it, but also your applications running inside of it.

[Watch: Advanced Kubernetes Metrics]

Logs

Logs are how we can answer why something is happening. They provide information regarding what the code is doing and the actions it is taking. Kubernetes delivers a wealth of logging for each of its components giving you insights into the decisions it is making. All of your containerized workloads are also generating logs, providing information into the decisions the code is making and actions it is taking. Access to these logs ensures you have comprehensive visibility to monitor and troubleshoot your applications.

How Do I Collect This Data?

Now that we understand what machine data is available to us, how do we get to this data? The good news is that Kubernetes makes most of this data readily available, you just need the right tool to gather and view it.

Collecting Logs

As containers are running inside of Kubernetes, the logs files are written to the Node that the container is running on. Every node in your cluster will have the logs from every container that runs on that node. We developed an open-source FluentD plugin that runs on every node in your cluster as a Daemonset. The plugin is responsible for reading the logs and securely sending them to Sumo Logic. It also enriches the logs with valuable metadata from Kubernetes.

When you create objects in Kubernetes, you can assign custom key-value pairs to each of those objects, called labels. These labels can help you organize and track additional information about each object. For example, you might have a label that represents the application name, the environment the pod is running in or perhaps what team owns this resource. These labels are entirely flexible and can be defined as you need. Our FluentD plugin ensures those labels get captured along with the logs giving you continuity between the resources you have created and the log files they are producing.

The plugin can be used on any Kubernetes cluster, whether using a managed service like Amazon EKS or a cluster you are running entirely on your own. It works as simple as deploying a Daemonset to Kubernetes. We provide default configuration that can work out-of-box with nearly any Kubernetes cluster and a rich set of configuration options so you can fine tune it to your needs.

Collecting Metrics

Every component of Kubernetes exposes its metrics in a Prometheus format. The running processes behind those components serve up the metrics on an HTTP URL that you can access. For example, the Kubernetes API Server serves it metrics on https://$API_HOST:443/metrics. You simply need to scrape these various URLs to obtain the metrics.

We developed a new tool specifically designed to ingest Prometheus formatted metrics into Sumo Logic. This tool can run standalone as a script on some box or can run as a container. It can ingest data from any target that provides Prometheus formatted metrics, including those from Kubernetes. And today, we are open-sourcing that tool for all of our customers to use to start ingesting Prometheus formatted metrics.

The Sumo Logic Prometheus Scraper can be configured to point to multiple targets serving up Prometheus metrics. It supports the ability to include or exclude metrics and provides you full control to the metadata that you send to Sumo Logic. The ability to include metadata ensures that when it comes to Kubernetes, you can capture the same valuable metadata that you have in logs in the metrics. You can use this metadata when searching thru your logs and your metrics, and use them together to have a unified experience when navigating your machine data.

The Challenges Of Kubernetes Monitoring

Kubernetes is complex. It consists of multiple components working together to orchestrate your containerized applications seamlessly. However, it’s not always smooth sailing with Kubernetes. Let’s talk about a story of a Kubernetes Platform Engineer trying to diagnose an issue with their application:

  • It all started with a PagerDuty alert, indicating that one of their core applications was having an outage affecting their customers experience.
  • She first dove into the application logs looking for an explanation.
  • Next she when to the Kubernetes events, to see what has changed on the cluster.
  • Next she started to look at the Kubernetes configuration for the affected pods, making sure they had the correct resource requests and limits settings.
  • Next she went to GitHub. Have there been any config changes? Maybe new code was pushed out?
  • She started to think about the applications, their dependencies and how the pieces fit together.
  • She continued down this rabbit hole, starting to look at the underlying infrastructure and network layers.

This story is a story we hear over and over again from our customers. When things go wrong in Kubernetes, you have to navigate a complex web dependencies, from Kubernetes to the underlying infrastructure to the application layer.

Sumo Logic: The DevSecOps Platform For Kubernetes

In order to have observability of a modern application like Kubernetes, you need a Continuous Intelligence platform that gives you discoverability, observability and security of your Kubernetes clusters. I am excited to announce the release of our new solution for Kubernetes. Let’s walk through some of these new capabilities.

Discoverability: Explore Your Kubernetes Clusters

Explore is a single pane of glass that lets you discover your Kubernetes clusters no matter where they are running. Explore is an out of the box, context rich experience that allows our customers to map their Kubernetes objects into Sumo Logic and get immediate rich visibility about the behavior of those objects. This allows customers to perform troubleshooting tasks with less friction. Explore comes with a curated set of views that organize a customer’s metadata, so that they can drill down into their services and applications or into their infrastructure.

Discoverability: Next Generation Dashboards (Closed Beta)

Sumo Logic’s new dashboarding framework is optimized for data dense, interactive visualizations for a unified metrics and logs experience. From extensible variables to brand new visualizations like honeycombs, the new dashboard framework lays the foundation for expressive observability across data streams. Our unified logs and metrics panel builders allow you to layer logs and metrics data on the same panels and make it easy to find your data. Contact your account team to request access to the beta today!

Observability: Comprehensive Collection and Data Enrichment

Sumo Logic collects data from your clusters leveraging cloud native technologies tightly coupled with Kubernetes. We collect logs, metrics, events and security events to ensure you have complete observability of your clusters. Sumo Logic enriches these streams with comprehensive metadata. Metadata drives our new Explorer experience and makes it easy to pivot between streams of data.

Observability: Data Enrichment of Logs

Log metadata allows customers to freely tag their logs with simple key-value pairs. Any Sumo Logic collector and log source will now support adding key-value pair fields. These fields can be used everywhere in Sumo Logic, from searching logs and to securing access to your logs via RBAC. A new fields management page brings proper managements of all fields, whether they were created from our new log metadata capability or field extraction rules. Log metadata is integral to our new collection process for Kubernetes. Sumo Logic automatically captures well known metadata such as pod, container, namesplace, cluster service and deployment with your log streams via the new log metadata feature.

Observability: Transform Your Metrics

Metrics Transformation rules enable you to aggregate and transform your raw time series data into new time series. In addition, it provides the flexibility in the retention of original and newly transformed data. Transformation Rules give you the control to assign a value to your time series data. You can keep high cardinality, high volume data for a shorter retention period and aggregate the raw data to higher level business KPIs for long term trending and storage.

Security: Out of the Box

Observability of Kubernetes is not just about being able to monitor and troubleshoot. You cannot have an observable system without understanding the security of it as well. Sumo Logic is the first DevSecOps platform that delivers continuous intelligence for your Kubernetes clusters with security built right in. Sumo Logic’s solution provides out-of-the-box support for security events with Falco.

Security: Integrated with the Ecosystem

We are launching a suite of new apps built for Kubernetes. Our Kubernetes apps are built for wherever Kubernetes is running. We have refreshed our Kubernetes apps for non-managed, Amazon EKS and GKE and we are also introducing a new App for AKS. We have also partnered with the leading vendors in Kubernetes Security and have apps for Aqua, JFrog Xr-Ray, StackRox and Twistlock.

We have also partnered with the leaders in the CI/CD ecosystem and are happy to announce new integrations with Armory and, CircleCI as well. and CodeFresh.

Sumo Logic: Continuous Intelligence For Kubernetes

Sumo Logic is the first DevSecOps platform that delivers continuous intelligence for your Kubernetes clusters no matter where they run. Our solution gives you the discoverability you need to understand your Kubernetes deeply and with the context needed to navigate at the infrastructure and service and application level. The discoverability is powered by comprehensive observability, capturing all the critical signals from Kubernetes and enriching that data with complete metadata. We provide you with integrated security out of the box and through deep integrations with the wider Kubernetes ecosystem. Be sure to check out more details about our Kubernetes Observability Solution, watch a demo, and sign up for a 30-day free trial!

Next Steps

Great, now that we know how we can get this data into Sumo Logic, let’s see these tools in action and start to make this data work for to gain profound insights into everything in our Kubernetes Clusters.

In the third and final post in this series, we will go through the steps to use these tools to collect all of this data into Sumo Logic and show how Sumo Logic can help you monitor and troubleshoot everything in your Kubernetes Cluster.

Additional Resources

  • Read the press release on our latest product enhancements unveiled at DockerCon
  • Read the first blog in our three-part Kubernetes series to learn about the K8s anatomy
  • Sign up for Sumo Logic for free

Navigate Kubernetes with Sumo Logic.

Monitor, troubleshoot and secure your Kubernetes clusters with Sumo Logic Continuous Intelligence solution for Kubernetes.

Chart your course
Frank Reno

Frank Reno

Principal Product Manager

Frank Reno is a Principal Product Manager at Sumo Logic, where he leads Product for Data Collection. He also serves as Sumo Logic's Open Source Ambassador co-leading all efforts around Open Source. He is also an active contributor to Sumo Logic's open source solutions and the general open source community.

More posts by Frank Reno.

People who read this also enjoyed