Sumo Logic ahead of the packRead article
In a rush?
Take it to go by downloading the free white paper, Key Issues Scaling an ELK Stack.
This is a two-part series focused on ELK stack vs. Sumo Logic. In part one we will highlight the journey of a DevOps engineer moving Elastic Stack from development into the production. We will also highlight typical efforts involved in managing and scaling ELK stack or Elastic stack in the production. In part 2 we will summarize the total cost of ownership considerations from the customers who have migrated from ELK stack to Sumo Logic.
Take it to go by downloading the free white paper, Key Issues Scaling an ELK Stack.
ELK Stack is a very popular open source log management tool. ELK Stack stands for ElasticSearch, Logstash, and Kibana. These are three separate projects and work well with each other.
ELK stack or Elastic Stack can be used for aggregating logs from your application and infrastructure. There is tremendous value in aggregating and analyzing your logs. Bringing innovative products and features faster is one of the key objectives in the new world of cloud applications. To achieve faster innovation in the data-driven digital economy, extracting the value of the machine data by aggregating and analyzing logs, metrics, and events is essential. In this article, we will go through the hypothetical scenario and explore the day in the life of a DevOps Engineer responsible for managing the ELK stack. The story is hypothetical, but learning is based on working closely with over 40 customers. You will be able to identify yourself in this story and, even better, will be able to predict the next phase of your ELK stack evaluation.
Here is how the story starts. Meet Mike, our DevOps engineer. Mike is very bright and entrusted to solve some of the most complicated bugs. Currently, he is tasked with troubleshooting a critical issue in one of the microservices. He cannot reproduce the issue, so he needs logs from the production server around the same time the issue was observed.
Summary: Extremely reactive troubleshooting process. Taking too much time to identify the root cause results in a poor customer experience.
After some research, he quickly downloaded the ELK stack and installed it on a few AWS EC2 instances. This is a typical first step in the ELK Stack journey that starts with one developer. To troubleshoot the issue, Mike, the DevOps engineer, has created an ELK stack with minimum data ingestion and retention. The architecture looks something like in Figure 1 below. LogStash for ingestion, Elastic nodes for indexing and Kibana for visualization. The good news about Mike’s effort was that he could identify the issues in a few days and started working on the patch.
Summary: At this point, a few AWS EC2 instances are added, and some additional costs are incurred.
Sure enough, the value is visible to Sam, the director of DevOps, and he wants to incorporate the ELK stack for his team responsible for 10+ microservices. A single point of failure and lack of monitoring in the above architecture was unacceptable to him. Mike was tasked to devise a solution to remove a single point of failure for Logstash and add a solution to monitor the ELK stack. Mike came up with the architecture in Figure 2 below. Added a couple of Logstash servers and Load balancers and purchased Elastic support to get Monitoring (previously Marvel) for the ELK stack.
Summary: More AWS EC2 instances were added for the Load balancer along with ELK support cost, which will add to the cost.
On Friday night, I'm not sure why IT Gods chose Friday; there was a huge spike in logs due to a failed release. LogStash and the rest of the data pipeline were not designed to handle such a surge in log volume. Data was lost at the critical time when they needed it the most. Somehow, Mike and the other two engineers could save the day, or should I say night? After the postmortem analysis team decided to reduce the impact of spikes on the ELK stack. Mike added the messaging system to manage the impact of the spikes on the data pipeline. Apache Kafka was chosen as the messaging bus. Think of Kafka as the distributed software running on multiple servers. However, it was now getting to the point where Mike was investing a significant amount of time maintaining the ELK stack and spending far less on new features or patching application issues.
Summary: The possibility of data loss was reduced, but it added considerable work for Mike, the DevOps engineer, and additional cost to run Kafka on EC2 instances.
Meanwhile, Sam, Director of DevOps, struggled to meet the uptime commitment for ELK Stack on the current budget and resources. The budget was significant but he needed more headcount and dollars to keep this ELK stack operational. He pitched an idea to Didi, VP of engineering, to roll out the ELK stack for the entire engineering organization. Sure enough, Didi was concerned about security and scalability. VP of engineering asked how are you securing the ELK stack? How can we ensure we can accommodate 200+ engineers to log in to the system without the system getting crushed?
Sam, Director of DevOps went back to Mike and two other engineers who already had gathered enough tribal knowledge on ELK stack. After an initial investigation and testing, Mike and his engineers developed the architecture below. Added Shield (now known as Security) for security so that users will be able to log in using their credentials. There were times when the Kibana dashboard link was shared with other engineering and support teams, resulting in ELK performance issues. ELK stack used to get crushed because of too many users running queries or the support team running a bad query, which brought down the entire ELK cluster. To avoid that problem, Mike and the team created different roles for the Elasticsearch cluster, such as Master Node, Client Node, and Data node. Ensuring that an odd number of master nodes with at least three dedicated master nodes are part of the cluster. Dedicated master nodes were assigned to reduce the occurrence of split-brain issues. Please check Jespen's GitHub repository and the blog for more information on Split Brain issues. Data node for indexing and Client node for scatter-gather operations. When the Client node receives the query, it knows the specific Data nodes that should service the query, thereby reducing the load on other data nodes. As soon as the architecture below was open to the entire engineering organization, data exploded in size. The team was again faced with the challenge of increasing the budget or reducing the data.
Summary: More cost, complexity and work for Mike the DevOps engineer.
To reduce the cost and still have access to the data, Mike and the team partitioned data nodes into hot and cold data nodes. Hot data nodes to store 30 days of logs. For better performance, EBS volumes (SSD drives if on-prem) were used on Hot data nodes. For cold data nodes, the team chose lower-cost storage, such as lower-cost EBS without provisioned IO (or SATA for on-premises) and compromised performance when searched beyond 30 days. If S3 (for more information on S3, please visit Top 10 things you didn't know about S3 by chief architect Stefan Zier and Joshua Levy) was used instead for EBS volume, then cold data is not searchable. It was a good compromise to use EBS volume without provisioned IO to reduce cost, considering logs have a recency bias, and most queries were for the last 30 days.
Mike is extremely unhappy about having to work on some mundane ELK stack management tasks and being unable to contribute and create new innovative features. Sam, the Director of DevOps, and Didi, the VP of Engineering, struggle to keep ELK up and running without significant resource and capital investments.
Summary: ELK has grown into a cluster of different products that can only be managed by a couple of team members. Cost and complexity have grown significantly compared to stage 1 architecture, where Mike started with ELK Stack to solve his one problem.
At this stage, just when we thought everything had been settled, Didi received an email in her mailbox from Peter. He is the GM for one of the lines of business. After a strategy session with his team, Peter has decided to target retail customers. He wanted to ensure that Didi and her team certify the entire infrastructure and application with PCI DSS 3.2 Level 1.
Didi calls for an urgent meeting with Mike and Sam to review the impact analysis and timeline to respond to Peter. In Part 2, we will go further on this journey... Stay tuned...
Build versus buy is not a new problem. Think about building value in-house with ELK stack or buying value using a secure SaaS-based machine data analytics platform like Sumo Logic. Build versus buy is the same argument as creating your power grid or getting a utility provider's service. The cost-complexity curve for each stage looked something like Figure 6 below.
The question is not if you can manage and scale ELK, most important question to ask is if you should?
This is a two-part blog on ELK stack vs. SumoLogic. In this first part, we highlight the effort of moving Elastic Stack from development into production.
Build, run, and secure modern applications and cloud infrastructures.Start free trial
Moving to the cloud offers more than economics; it comes with unique security challenges that on-premises solutions cannot address. In minutes, Cloud Infrastructure Security for AWS from Sumo Logic brings cloud-native security analytics to AWS cloud environments. Curated workflows, out-of-the-box dashboards and AI-driven anomaly detection help security personnel easily monitor cloud security posture and cloud configurations and manage cloud risk from a centralized platform.
In a perfect world, computers would function properly on the network at all times. There would be no issues with the operating system and no problems with the applications. Unfortunately, this isn’t a perfect world. System failures can and will occur, and when they do, it is the responsibility of system administrators to diagnose and resolve the issues. But where can system administrators begin the search for solutions when problems arise? The answer is Windows event logs.