This is a two-part series focused on ELK stack vs. Sumo Logic. In part 1 we will highlight the journey of a DevOps engineer moving Elastic Stack from development into the production. We will also highlight typical efforts involved in managing and scaling ELK stack or Elastic stack in the production. In part 2 we will summarize the total cost of ownership considerations from the customers who have migrated from ELK stack to Sumo Logic.
Part 1: Day in a life of an engineer managing ELK Stack
ELK Stack is a very popular open source log management tool. ELK Stack stands for ElasticSearch, Logstash, and Kibana. These are three separate projects and work well with each other.
ELK stack or Elastic Stack can be used for aggregating logs from your application and infrastructure. There is tremendous value in aggregating and analyzing your logs. Bringing innovative products and features faster is one of the key objectives in the new world of cloud applications. To achieve faster innovation in the data-driven digital economy, it is essential to extract value out of the machine data by aggregating and analyzing logs, metrics, and events. In this article, we will go through the hypothetical scenario and explore the day in a life of a DevOps Engineer responsible for managing ELK stack. The story is hypothetical but learning is based on working closely with more than 40 customers. You will be able to identify yourself in this story and even better will be able to predict the next phase of your ELK stack evaluation.
Stage 0: No Logging Solution
Here is how the story starts. Meet Mike our DevOps engineer. Mike is very bright and entrusted to solve some of the most complicated bugs. Currently, he is tasked with troubleshooting a critical issue in one of the microservice. He cannot reproduce the issue so he needs logs from the production server around the same time when the issue was observed.
Summary: Extremely reactive troubleshooting process. Taking too much time to identify root cause and resulting in a poor customer experience.
Stage 1: Starting with small instance of ELK
After some research, he quickly downloads ELK stack and installs it on few AWS EC2 instances. This is a very typical first step in the ELK Stack journey that starts with one developer. To troubleshoot the issue, Mike, the DevOps engineer has created ELK stack with minimum data ingestion and retention. The architecture looks something like in figure 1 below. LogStash for ingestion, Elastic nodes for indexing and Kibana for visualization. The good news about Mike’s effort was that he was able to identify the issues in few days and started working on the patch.
Summary: At this point, few AWS EC2 instances are added, and some additional cost incurred.
Figure 1: ELK architecture at the end of Stage 1
Stage 2: Removing single point of failure
Sure enough, the value is visible to Sam, the director of DevOps and he wants to incorporate ELK stack for his entire team responsible for 10+ microservices. Single point of failure and lack of monitoring in the above architecture was unacceptable to him. Mike was tasked to come up with the solution to remove a single point of failure for Logstash and adding a solution to monitor ELK stack. Mike came up with the architecture in Figure 2 below. Added couple of Logstash servers and Load balancers and purchased Elastic support to get Monitoring (previously Marvel) for ELK stack.
Summary: More AWS EC2 instances were added for the Load balancer, which will add to the cost.
Figure 2: ELK architecture with ELB at the end of Stage 2
Stage 3: Adding Kafka to avoid data loss during log spikes
On Friday night, not too sure why Friday is the day chosen by IT Gods, there was a huge spike in logs due to a failed release. LogStash along with the rest of the data pipeline was not designed to handle such surge in log volume. Data was lost at the critical time when they needed it the most. Somehow Mike and other two engineers were able to save the day or should I say night? After the postmortem analysis team decided to reduce the impact of spikes on the ELK stack. Mike added the messaging system to manage the impact of the spikes on the data pipeline. Apache Kafka was chosen as the messaging bus. Think of Kafka as the distributed software running on multiple servers. However, it was now getting to the point where Mike was investing a significant amount of time maintaining the ELK stack and was spending far less time on new features or patching application issues.
Summary: Possibilities of data loss was reduced but it added considerable work for Mike the DevOps engineer and additional cost to run Kafka on EC2 instances.
Figure 3: Apache Kafka to avoid the data loss during bursts
Stage 4: Adding Security to get enterprise-wide adoption of ELK
Meanwhile, Sam, Director of DevOps was struggling to meet uptime commitment for ELK Stack on current budget and resource. The budget was significant but he needed more headcount and dollars to keep this ELK stack operational. He decided to pitch an idea to Didi, VP of engineering, to roll out ELK stack for entire engineering organization. Sure enough, Didi was concerned about security and scalability. VP of engineering asked how are you securing the ELK stack? How can we ensure we can accommodate 200+ engineers to log in into the system without system getting crushed?
Sam, Director of DevOps went back to Mike and two other engineers who already had gathered enough tribal knowledge on ELK stack. After an initial investigation and testing, Mike and his team of engineers came up with the architecture below. Added Shield (now known as Security) for security so that users will be able to log in using their own credentials. There were times when the Kibana dashboard link was shared with other engineering and support teams which resulted in ELK performance issues. ELK stack used to get crushed because of too many users running queries or support team running a bad query, which brought down the entire ELK cluster. To avoid that problem, Mike and the team decided to create different roles for Elasticsearch cluster such as Master Node, Client Node, and Data node. Ensuring that an odd number of master nodes with at least three dedicated master nodes are part of the cluster. Dedicated master nodes were assigned to reduce the occurrence of split brain issues. For more information on Split Brain issues please check Jespen GitHub repository and the blog. Data node for indexing and Client node for scatter-gather operations. When Client node receives the query, it exactly knows the specific Data nodes that should service the query thereby reducing the load on other data node. As soon as the architecture below was open to the entire engineering organization, data exploded in size. The team was again faced with a challenge of increasing more budget or reducing the data.
Summary: More cost, more complexity, and more work for Mike the DevOps engineer.
Figure 4: Adding Sheild (now known as Security) to the stack
Stage 5: Optimizing storage on ELK Stack
To reduce the cost and still have access to the data, Mike and team partitioned data node into hot and cold data nodes. Hot data nodes to store 30 days of logs. SSD drives were used on Hot data nodes for better performance. For cold data nodes, the team chose lower cost hard disk such as SATA and compromising on the performance. It was a good compromise considering most queries were for last 30 days.
Mike is extremely unhappy for having to work on some mundane ELK stack management tasks and not been able to contribute and create new innovative features. Sam the Director of DevOps and Didi VP of Engineering are now struggling to keep ELK up and running without significant resource and capital investments.
Summary: ELK is grown into a cluster of different products can only be managed by a couple of people in the team. Cost and complexity both have grown significantly from where Mike started to solve his one problem.
Figure 5: Adding different data zone to reduce the cost
At this stage, just when we thought everything has been settled, Didi received an email in her mailbox from Peter. He is the GM for one of the Line-of-Business. Peter, after strategy session with his team, has decided to target customers in retail space. He wanted to ensure that Didi and her team certifies the entire infrastructure and application with PCI DSS 3.2 Level 1.
Didi calls for an urgent meeting with Mike and Sam to go over the impact analysis and timeline to respond back to Peter. In Part 2 we will go further on this journey… Stay tuned…
Build versus buy is not a new problem. Think about building value in-house with ELK stack or buying value using secure SaaS-based machine data analytics platform such as Sumo Logic. Build versus buy is exactly the same argument as creating your own power grid or just getting a service from the utility provider. The cost-complexity curve for each stage looked something like in figure 6 below.
Figure 6: Cost-Complexity Curve by different stages of ELK implementation
The question is not if you can manage and scale ELK, most important question to ask is if you should?