Log4j Vulnerability Response Center. Get Informed Now

Back to blog results

January 20, 2012 By Christian Beedgen

Log Data is Big Data

Nearly" class="redactor-autoparser-object">https://www.sumologic.com/blog... all of today’s most successful businesses rely on data to make smart decisions. Information technology provides the platform for processing this data to the business. It should stand to reason then that IT should be making decisions based on data just the same, in order to optimize and secure the data processing infrastructure. Welcome to the tautology club [http://xkcd.com/703/].

The single biggest data set that IT can use for monitoring, planning, and optimization is log data. After all, logs are what the IT infrastructure generates while it is going about its business (pun intended). Log data is generally the most detailed data available for analyzing the state of the business systems, whether it be for operations, application management, or security. Best of all, the log data is being generated whether it is being collected or not. It’s free data, really. But in order to use it, some non-trivial additional infrastructure has to be put in place. And with that still, first generation log management tools did run into problems scaling to the required amount of data, even before the data explosion we have seen over the last couple of years really took off.

How can the problems of the first generation be overcome? Ironically, the answer lies in following the trends in business data processing. Big Data has emerged in the recent past as a term to describe systems that have non-traditional scaling requirements, and those problems are increasingly being solved with non-traditional solutions. Once one starts to dissect what Big Data really represents, it starts to become clear that log data has always been Big Data – except that it wasn’t understood or treated as such. Doug Laney has been providing the canonical definition of Big Data since 2001, in observing that the increase in data volume, velocity, and variety is creating challenges for businesses today and in the future. “Just as often what’s new is old (e.g. the 3Vs’ of “big data” I first wrote about at Meta Group over 10 years ago seemingly have become in vogue this year)” [http://blogs.gartner.com/doug-laney].

Those familiar with collecting, analyzing, and operationalizing insight from network, system, and application logs will recognize that even today we are are still struggling with many issues represented in Doug Laney’s 3V framework. The sheer volume of logs generated by the typical IT infrastructure has long been an insurmountable problem. Even modest size businesses can easily generate half a terabyte of logs each and every day, and the volume of generated data increases as the size of the business grows. The real-time, time-series nature of logs essentially in itself defines the term velocity in this context: logs arrive every second of every minute of every day. Analyzing logs is not a data warehousing exercise – its fundamental value is its real-time proposition.

Log data does not fall into the convenient schemas required by relational databases. Log data is, at its core, unstructured, or, in fact, semi-structured, which leads to a deafening cacophony of formats; the sheer variety in which logs are being generated is presenting a major problem in how they are analyzed. The emergence of Big Data has not only been driven by the increasing amount of unstructured data to be processed in near real-time, but also by the availability of new toolsets to deal with these challenges.

Classic relational data management solutions simply are not built for this data, as every single legacy vendor in the SIEM and Log Management category has painfully experienced. Web-scale properties such as Google, Yahoo, Amazon, LinkedIn, Facebook and many others have faced the challenges embodied in the 3Vs first. At the same time, some of these companies have decided to turn what they learned in building large scale infrastructures to run their own business into a strategic product asset itself. The need to solve planetary-scale problems has led to the invention of Big Data tools, such as Hadoop, Cassandra, HBase, Hive, and the lot. And so today it is possible to leverage offerings such as Amazon AWS in combination with the aforementioned Big Data tools to build platforms that can address the challenges – and opportunities – of Big Data head on and without requiring a broader IT footprint.

https://www.sumologic.com/blog... class="at-below-post-recommended addthis_tool">

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.


Sumo Logic Continuous Intelligence Platform™

Build, run, and secure modern applications and cloud infrastructures.

Start free trial
Christian Beedgen

Christian Beedgen

As co-founder and CTO of Sumo Logic, Christian Beedgen brings 18 years experience creating industry-leading enterprise software products. Since 2010 he has been focused on building Sumo Logic’s multi-tenant, cloud-native machine data analytics platform which is widely used today by more than 1,600 customers and 50,000 users. Prior to Sumo Logic, Christian was an early engineer, engineering director and chief architect at ArcSight, contributing to ArcSight’s SIEM and log management solutions.

More posts by Christian Beedgen.