2022 Gartner® Magic Quadrant™ SIEM
Get the reportMore
Log management is facilitating, transmitting, analyzing, storing, and archiving large sets of log data. Log management in DevOps and site reliability engineering gathers and analyzes log data from different application systems to monitor and improve performance, identify issues and bugs, and improve security. This process is also known as security log management.
Logs are continuous digital records of events generated by all components of your software stack which includes:
Message bus (Kafka)
In other words, logs are everywhere. Every supporting service or component of a modern cloud application is logging every action or event that takes place within it. An example of log data from a client machine is found below. Specifically, it includes a timestamp followed by the Hostname, process type, log type, application, action, and TCP socket status.
Dec 10 12:58:39 Client-mac socketfilterfw : openvpn-service: Allow TCP LISTEN (in:0 out:2)
Dec 10 13:01:09 Client-mac socketfilterfw : Dropbox: Allow TCP LISTEN (in:0 out:2)
Dec 11 18:40:03 Client-mac socketfilterfw : Office365Service: Allow TCP CONNECT (in:1 out:0)
Dec 12 17:17:25 Client-mac socketfilterfw : Dropbox: Allow TCP LISTEN (in:0 out:4)
Logs provide visibility into the health of the application and infrastructure stacks. The lack of log visibility creates operational challenges when modern applications leverage the cloud, the infrastructure they don’t own, and microservices architecture, where three-tier architecture is transformed into n-tier architecture with many-to-many communications between those services.
If a security incident or operation outage occurs, your DevOps, ITOps, and SecOps team(s) do not have insight that allows them to resolve the issue quickly. This lack of visibility into their stack often creates higher application latency and system outages, resulting in poor customer experience and customer churn.
Knowing where to look to pinpoint problems that cause customer satisfaction issues, applications to slow down, system-wide outages, or security threats, is the primary reason for the existence of logs. Each log contains a stream of events and includes a wealth of data about software and related infrastructure performance, availability, user access, and behavior. By analyzing these logs, one can proactively detect and resolve issues that impact the business.
Log event examples
Some examples of such log events are:
Step 1: Instrument and collect
Install a collector to collect data from any part of your stack. One can collect logs from operating systems, containers, network devices, AWS infrastructure, application access logs, and custom events. Collection can be done using Syslog or applications directly writing the logs into the centralized log management over HTTP. Schema-on-read will save lots of time for you since it eliminates needing to pre-parse a log before ingesting it into a log management system.
Step 2: Centralize and index
Centralize the logs for easy access and visibility into relevant modules. Centralized logs will ensure that users would never have to hop from one server to the other and manually “grep” logs of interest from multiple systems, e.g., to search for a particular string or pattern of text within a log. Indexing allows ITOps and SecOps to quickly search for any term within a log, similar to Google.
Step 3: Search and analyze
After indexing is done, ITOps can search and analyze information and also allows them to create schema on read. Analysis can be done manually, or one can use native machine learning for advanced analytics to identify and compare patterns or spot outliers.
Learn More: Log Analysis
Step 4: Monitor and alert
Monitoring and alerting is the next phase. Log management should be able to integrate with commonly used collaboration software such as Hipchat, Slack and PagerDuty to alert users. Continuous monitoring of a large volume of data and logs is inevitable, but ensuring that users are alerted in time with dynamic thresholds and using advanced analytics powered by machine learning is necessary.
Step 5: Log report and dashboard
It is important to share reports and dashboards so that the entire team can access the same data. An added benefit is that you can create these reports and dashboards just once and then use them repeatedly without requiring other users to recreate them. Relatedly, it is critical to mention that using RBAC (Role-Based Access Control) is mandatory to provide need-to-know access to the team.
The volume of logs that need to be searched is massive and grows continuously. When you look at a log file, it provides just a single point of view, such as an application point of view or server point of view. Without the capability to correlate views across multiple logs from different components, it is very difficult to get a full picture of occurring problems. Some compliance regulations mandate companies to store all logs and not allow developers to access production machines. A real-time, centralized log management solution addresses these requirements in this scenario.
Ideally, you need a centralized log management solution that centralizes all of these logs, correlates, and analyzes them to provide meaningful insights for IT to solve SLA, performance issues, and availability problems.
In the above example, the query counts the total number of deleted resources reported by AWS Config.
To draw meaningful insights from these logs that are everywhere and ever-growing, you need a scalable platform that centralizes all these logs, provides a simple search interface for users to look for common exceptions, applies machine learning to detect patterns in behaviors, and helps users with insightful information to not only reactively fix the issues but also to prevent them from recurring.
Some common use cases of log management solutions are to enable developers, IT operators, and security professionals include:
Troubleshooting and root cause analysis of modern applications: Get alerted on potential operational problems with modern application stacks or any underlying infrastructure component such as performance degradation, outages, or exceptions causing user experience to suffer.
Improve code quality: Improve code quality by troubleshooting development and deployment issues before rolling into production.
Security analytics and compliance: Ensure security and regulatory compliance through cloud audits, detection of potential threats or malicious access, and retaining logs based on the compliance mandate.
Business insight: Identify the behavior pattern, such as feature usage, and get insight into user actions rather than user opinions.
Centralizing your log data means you'll need to consolidate and process information from multiple platforms. Below is information for how to manage log files across different systems, platforms, and use cases.
Sumo Logic is a cloud-native secure, centralized log analytics service that provides insights into logs through pre-built applications, identifying patterns to show outliers in the behaviors of applications and systems. IT teams can then instantly act on these outliers, get to the root cause and prevent any future impact on the business.
Sumo Logic can collect logs from almost any system in nearly any format, and our centralized log management service analyzes over 100 PB of data on an average day.
Sumo Logic provides everything you need to conduct real-time forensics and log management for your IT data without performing complex installations or upgrades and without needing to manage and scale any hardware or storage. With fully elastic scalability, Sumo Logic is a fit for any size deployment.
The following table lists data types and some of the more common sources that produce logs that Sumo Logic can collect. This list is a sample only to provide a general idea of the possible log data sources. It is by no means complete. Learn how Sumo Logic can help you manage log data.
Reduce downtime and move from reactive to proactive monitoring.