---
title: "Why prometheus isn’t enough to monitor complex environments"
page_name: "Why Prometheus isn’t enough to monitor complex environments"
type: "blog"
slug: "prometheus-monitoring"
published_at: "2025-08-07"
modified_at: "2026-02-17"
url: "https://www.sumologic.com/blog/prometheus-monitoring"
canonical: "https://www.sumologic.com/blog/prometheus-monitoring"
markdown_url: "https://www.sumologic.com/blog/prometheus-monitoring.md"
lang: "en"
excerpt: "Learn Prometheus monitoring best practices and architecture, and why it isn’t enough for modern Kubernetes environments alone."
taxonomy_blog_category:
  - "DevOps &amp; IT Operations"
---

[ All blogs ](https://www.sumologic.com/blog "blog")[DevOps &amp; IT Operations](https://www.sumologic.com/blog/devops-it-operations)

# Why Prometheus isn’t enough to monitor complex environments

[David Girvin](#blog-author-block-331)

August 7, 2025

4 min read 

[DevOps &amp; IT Operations](https://www.sumologic.com/blog/devops-it-operations)

##### Table of contents

 

 

 

Modern systems look very different from what they were years ago. Development organizations have moved away from building traditional monoliths towards developing [containerized applications](https://www.sumologic.com/glossary/application-containerization/) running across a highly distributed infrastructure.

While this change has made systems inherently more resilient, the increase in overall complexity has made it more important and challenging to effectively identify and address problems at their root cause when issues occur.

Part of the solution to this challenge lies in leveraging tools and platforms that can effectively monitor the health of services and infrastructure. To that end, this post will explain best practices for Prometheus monitoring of services and infrastructure. In addition, it will outline the reasons why Prometheus alone is not enough to monitor the complex, highly distributed system environments in use today.

## What is Prometheus?

Prometheus is an[ open-source monitoring and alerting toolkit](https://github.com/prometheus) that was first developed by SoundCloud in 2012 for cloud-native metrics monitoring.

In [monitoring and observability](https://www.sumologic.com/blog/beyond-monitoring-power-observability), we have three primary data types: logs, metrics, and traces. Metrics serve as the data stopwatch that helps you track [service level objectives](https://www.sumologic.com/glossary/slo-service-level-objective/) (SLOs) and [service level indicators](https://www.sumologic.com/glossary/sli-service-level-indicator/) (SLIs) in a time series data.

High-cardinality metrics have many unique combinations of labels (e.g., user\_id, region), which can strain Prometheus’ storage and query performance. However, many customers need more out of their [observability](https://www.sumologic.com/glossary/observability/) environments, and these days, most folks have adopted [OpenTelemetry](https://www.sumologic.com/guides/opentelemetry/) to unify collectors and gather data from all three data sources.

## What can be monitored with Prometheus?

Organizations use Prometheus monitoring to collect metrics data regarding service and infrastructure performance. Depending upon the use case, Prometheus metrics may include performance markers such as CPU utilization, memory usage, total requests, requests per second, request count, exception count and more. When leveraged effectively, this collected metrics data can assist organizations in identifying system issues in a timely manner.

## Prometheus server architecture

Prometheus architecture is central to the Prometheus server, which performs the actual monitoring functions. The Prometheus server is made up of three major components: a time series database, a worker for data retrieval, and an HTTP server.

### Time series database

This component is responsible for storing metrics data. This data is stored as a time series, meaning that the data is represented in the database as a series of timestamped data points belonging to the same metric and set of labeled dimensions.

### Worker for data retrieval

This component does exactly what its name implies: it pulls metrics from “targets,” which can be applications, services or other system infrastructure components. It then takes these collected metrics and pushes them to the time series database. The data retrieval worker collects these metrics by scraping HTTP endpoints, also known as a Prometheus instance, on the targets.

By default, the metrics endpoint is &lt; hostaddress &gt;/metrics. You configure Prometheus with a Prometheus[ exporter](https://prometheus.io/docs/instrumenting/exporters/) to monitor a target. At its core, an exporter is a service that fetches metrics from the target, formats them properly, and exposes the /metrics endpoint so that the data retrieval worker can pull the data for storage in the time series database. To push metrics from jobs that cannot be scraped, the Prometheus Pushgateway allows you to push time series from short-lived service-level batch jobs to an intermediary job that Prometheus can scrape.

### HTTP server 

This server accepts queries in a Prometheus query language ([PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/)) to pull data from the time series database. The HTTP server can be leveraged by the Prometheus graph UI or other data visualization tools, such as [Grafana](https://www.sumologic.com/blog/prometheus-vs-grafana/), to provide developers and IT personnel with an interface for querying and visualizing these metrics in a useful, human-friendly format.

## Managing Prometheus alerts

The Prometheus Alertmanager is also worth mentioning here. Rules can be set up within the Prometheus configuration to define limits that will trigger an alert when they are exceeded. When this happens, the Prometheus server pushes alerts to the Alertmanager. From there, the Alertmanager handles deduplicating, grouping and routing these alerts to the proper personnel via email or other alerting integration.

## Why Prometheus on its own isn’t enough

As we know, modern development architectures have a much higher level of complexity than those of more than a decade ago. Today’s systems contain many servers running containerized applications and services, like a Kubernetes cluster. These services are loosely coupled, calling one another to provide functionality to the end user. Architecturally, these services might also be decoupled and run on multiple cloud environments as well. The complex nature of these systems can have the effect of obscuring the causes of failures.

Organizations need granular insight into system behavior to address this challenge, and collecting and [aggregating log event data](https://www.sumologic.com/glossary/log-aggregation) is critical to this pursuit. This log data can correlate with performance metrics, enabling organizations to gain the insights and context necessary for efficient [root cause analysis](https://www.sumologic.com/glossary/root-cause-analysis/). While Prometheus collects metrics, it does not collect log data. Therefore, it does not provide the level of detail necessary to support effective [incident response](https://www.sumologic.com/glossary/incident-response/#:~:text=incident%20response%20functionality-,What%20is%20incident%20response%3F,of%20IT%20or%20security%20incidents.) on its own.

Furthermore, Prometheus faces challenges when scaled significantly — a situation often unavoidable with such highly distributed modern systems. Prometheus wasn’t originally built to query and aggregate metrics from multiple instances. Configuring it to do so requires adding additional complexity to the organization’s Prometheus deployment. This complicates the process of attaining a holistic view of the entire system, which is a critical aspect of performing incident response with any level of efficiency.

Finally, Prometheus wasn’t built to retain metrics data for long periods of time. Access to this type of historical data can be invaluable for organizations managing complex environments. For one, organizations may want to analyze these metrics to detect patterns that occur over a few months or even a year to gain an understanding of system usage during a specific time period. Such insights can dictate scaling strategies when systems may be pushed to their limits.

## Unified collection for Kubernetes monitoring 

While Prometheus is a great tool for gathering high-level metrics for SLOs and SLIs, [site reliability engineers](https://www.sumologic.com/blog/sre-how-the-role-is-evolving) and security analysts must drill down into logs to find what exactly may have gone wrong. That’s why unified telemetry collection across all data types — logs, metrics, and traces is key. We need to shed outdated legacy processes and mindsets to innovate and use the newest best practices to ensure the best possible [digital customer experience](https://www.sumologic.com/solutions/digital-customer-experience/).

All of these challenges are best addressed by leveraging unified [Kubernetes monitoring](https://www.sumologic.com/solutions/kubernetes-monitoring) with [Sumo Logic’s OpenTelemetry Collector ](https://www.sumologic.com/blog/opentelemetry-the-future-of-sumo-logic-observability/)and setting up the [latest Helm Chart](https://help.sumologic.com/release-notes-service/2023/10/25/collection/). Additionally, with [Sumo Logic’s Otel Remote Management](https://www.sumologic.com/blog/otel-remote-management), you can save time setting up and managing your collectors. You can still aggregate Prometheus data next to this collector, but there is no reason to use it as middleware for metrics unless your infrastructure necessitates specific esoteric skill sets. For example, familiarity with PromQL or the need for specific histograms unavailable in the Sumo Logic monitoring environment.

It helps to use OpenTelemetry as a standard to achieve a smaller collection footprint and save time on instrumentation for effective security and monitoring best practices.

Curious to learn more? [Check out best practices for Kubernetes monitoring](https://www.sumologic.com/briefs/kubernetes-observability).

### FAQs

 Are the results of a root cause analysis always accurate?+Not always. The accuracy of the analysis depends on data quality, the expertise of the individuals conducting the analysis and the thoroughness of the investigation process. Log data is at the [atomic level](https://www.sumologic.com/blog/future-sumo-logic-atomic-level-logs) of data, making it the most helpful and accurate for root cause analysis.

 How is Sumo Logic different from other Kubernetes monitoring solutions?+A Kubernetes workload can have many problems and modern application monitoring tools must pinpoint which combination of a pod and node is having issues. Then, drill into the associated container logs to identify the root cause of the issue. Ideally, Kubernetes infrastructure failures should be visualized in a monitoring tool that can capture container metrics, node metrics, resource metrics, Kubernetes cluster logs and trace data in histograms and charts.

Legacy monitoring solutions impose a server-based solution on a microservices problem. Your team wastes precious minutes correlating serious customer and security issues with infrastructure problems at the pod, container and node levels. Sumo Logic has turned this model on its head.

With Sumo Logic you can view your Kubernetes environment in the form of logs, metrics and events in various hierarchies, allowing you to view your cluster through the lens of your choice. For example, we can use native Kubernetes metadata like a namespace to visualize the performance of all pods associated with a namespace.

 Are there any specific challenges or limitations to monitoring a Kubernetes deployment in a multi-cloud environment+- [Resource utilization](https://www.sumologic.com/blog/monitoring-host-process-metrics) across different clouds
- [Cloud security](https://help.sumologic.com/docs/security/cloud-infrastructure-security/introduction/) managing personal data across multiple clouds
- Integrating [monitoring tools](https://www.sumologic.com/solutions/infrastructure-monitoring/) across different cloud platforms can be complex, hindering a [unified view of the entire deployment](https://www.sumologic.com/brief/accelerate-your-sdlc-with-devsecops/)
- Ensuring consistent metrics across diverse cloud environments

 What Kubernetes metrics should you measure?+There are many critical metrics for monitoring Kubernetes clusters. Monitoring occurs at two levels: cluster and pod. Cluster monitoring tracks the health of an entire Kubernetes cluster to verify if nodes function properly and at the right capacity, and how many applications run on a node and how the cluster utilizes resources. Pod monitoring tracks issues affecting individual pod metrics, like resource utilization, application and pod replication or autoscaling metrics.

At the cluster level, you want to measure how many nodes are available and healthy to determine the cloud resources you need to run the cluster. You also need to measure which computing resources your nodes use—including memory, CPU, bandwidth and disk utilization––to know if you should decrease or increase the size or number of nodes in a cluster.

At the pod level, there are three key metrics:

**Container**: network, CPU and memory usage

**Application**: specific to the application and related to its business logic

**Pod health and availability**: how the orchestrator handles a specific pod, health checks, network data and on-progress deployment.

 

### Article Tags

- [DevOps &amp; IT Operations](https://www.sumologic.com/blog/devops-it-operations)

David Girvin

Lead Technical Advocate

David Girvin is a Technical Advocate at Sumo Logic, facilitating technical accuracy in the cloud of marketing. Previously, he was an AppSec / offensive security architect for places like 1Password and Red Canary. When not working, David travels to surf destinations for surfing and foiling.

[](https://www.sumologic.com/feed "RSS Feed")[](https://twitter.com/intent/tweet?text=Why%20Prometheus%20isn%E2%80%99t%20enough%20to%20monitor%20complex%20environments&url=https%3A%2F%2Fwww.sumologic.com%2Fblog%2Fprometheus-monitoring "X")[](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fwww.sumologic.com%2Fblog%2Fprometheus-monitoring "Facebook")[](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fwww.sumologic.com%2Fblog%2Fprometheus-monitoring "Linkedin")

[Previous blog

From weeks to minutes: How Sumo Logic’s historic baselining supercharges UEBA](https://www.sumologic.com/blog/sumo-logic-historic-baselining)[Next blog

SIEM isn’t dead. It’s reborn and finally worth using.](https://www.sumologic.com/blog/evolution-of-siem)

People who read this also enjoyed

[  

Sumo Logic AWS Region European Sovereign Cloud is now generally available

June 2, 2026

 

 ](https://www.sumologic.com/blog/sumo-logic-aws-region-european-sovereign-cloud-generally-available)[  

How to secure cloud workloads without building a full-scale SOC

April 30, 2026

 

 ](https://www.sumologic.com/blog/secure-cloud-workloads-with-limited-resources)[  

Join operator and Query Agent for smarter log analysis

April 22, 2026

 

 ](https://www.sumologic.com/blog/using-the-join-operator)[  

92% of security leaders say their SIEM is effective. 51% say it’s exceptional. What’s living in that gap?

April 16, 2026

 ](https://www.sumologic.com/blog/from-effective-to-exceptional-siem)

[AI Instructions](https://www.sumologic.com/ai-instructions.md)
