Sumo Logic ahead of the packRead article
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
In the era of data abundance, there exists a significant need for database systems that can effectively manage large quantities of data. For certain types of applications, an oft-considered option is Apache Cassandra. Like any other piece of software, however, Cassandra has issues that could potentially impact performance. When this happens, it’s critical to know where to look and what to look for in the effort to quickly restore service to an acceptable level.
Keep reading for a primer on Apache Cassandra, and some steps that can be taken to effectively isolate and resolve problems with cluster performance when they occur.
Apache Cassandra is an open-source NoSQL distributed database regularly utilized by organizations that need to collect and utilize massive amounts of data. Leveraged by both small companies and large enterprises, Cassandra provides several key benefits that make it a highly-reliable and effective solution for many use cases:
Cassandra is scalable - Cassandra’s node-based architecture enables it to scale with ease. This simplifies the process for increasing capacity and throughput, as necessary.
Cassandra is fault-tolerant - With Cassandra, data can be replicated across nodes and data centers, helping to eliminate the possibility of a single point of failure. This provides a high level of fault tolerance. Consider the scenario in which node A goes down. If node B contains a replica of the data from the now-unavailable node A, the impact of the failure is lessened and availability is maintained.
In any distributed system, an important objective when searching for root cause is to narrow down where the problem is occurring. In the context of Cassandra, this means identifying the instance or instances (node or nodes) that are causing the problem. It’s possible that the issue is occurring for the entire cluster. But it’s also possible that the problem is only present within a single data center, or with a specific set of nodes on which the same data has been replicated. Without knowing exactly where the problem is occurring, it can be difficult to gather the details necessary to formulate a fix and reach a resolution.
An effective strategy for understanding the scope of a problem in a Cassandra database is to monitor and leverage metrics data to gain critical insight into the issue at hand. This enables more targeted log analysis, which facilitates an easier path towards identifying root cause.
Cassandra furnishes users with myriad metrics that can be of great value in incident response. Through monitoring these metrics, the existence of problems with cluster performance are more easily identified, as is the location of the problem within the cluster itself.
Some metrics made available by Cassandra include client request metrics, providing insight into timeouts, failures, request statistics (type, latency, throughput) and more. Table metrics are provided to help track the performance of tables within the distributed database. And, similarly, keyspace metrics are collected to provide insight into performance for each keyspace.
These metrics categories represent just a small portion of what’s available and valuable for monitoring and incident response. For a more complete list of Cassandra metrics, their descriptions, and the data types associated with each, you are encouraged to visit the official Cassandra monitoring documentation.
Additionally, Cassandra comes packaged with a utility known as nodetool. This command line tool comes with a set of commands for viewing node and cluster status, viewing compaction information, gaining visibility into node-level statistics (such as load, memory usage, and cache effectiveness), and more. This tool can prove useful in debugging, enabling team members to quickly view information that may help narrow down problems to specific Cassandra instances.
All in all, careful monitoring and analysis of performance metrics can help to identify and isolate performance problems within a Cassandra cluster. Once this has been done, targeted analysis of Cassandra log data can assist in narrowing down the root cause.
Cassandra writes to several log files that can be of great help when troubleshooting problems, including system.log, debug.log and gc.log. Uncaught exceptions, information about table or keyspace alterations, information about compactions, and more, can be gathered by visiting and analyzing the messages in these log files. For a more in-depth look at Cassandra logging and what each of these log files contains, take a look at the Apache Cassandra common log documentation.
Developers and IT folks know that logs and metrics tell the story when things go awry. This holds true in the case of Apache Cassandra. With that said, and as is often the case with any distributed system, monitoring performance and performing root cause analysis is easier when using a tool that is built to centralize and visualize metrics and log data. For Cassandra, Sumo Logic has an app for just that. With step-by-step instructions for setting up log and metrics collection along with pre-packaged dashboards within the app itself, Sumo Logic makes it easier than ever to monitor your Cassandra cluster and troubleshoot performance issues should they arise.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.Start free trial
Moving to the cloud offers more than economics; it comes with unique security challenges that on-premises solutions cannot address. In minutes, Cloud Infrastructure Security for AWS from Sumo Logic brings cloud-native security analytics to AWS cloud environments. Curated workflows, out-of-the-box dashboards and AI-driven anomaly detection help security personnel easily monitor cloud security posture and cloud configurations and manage cloud risk from a centralized platform.
In a perfect world, computers would function properly on the network at all times. There would be no issues with the operating system and no problems with the applications. Unfortunately, this isn’t a perfect world. System failures can and will occur, and when they do, it is the responsibility of system administrators to diagnose and resolve the issues. But where can system administrators begin the search for solutions when problems arise? The answer is Windows event logs.