04.10.2014 | Posted by Stefan Zier, Lead Architect
By now, you have likely read about the security vulnerability known as the Heartbleed bug. It is a vulnerability in the widespread OpenSSL library. It allows stealing the information protected, under normal conditions, by the SSL/TLS encryption used to encrypt traffic on the Internet (including Sumo Logic).
How did we eliminate the threat?
When we were notified about the issue, we quickly discovered that our own customer-facing SSL implementation was vulnerable to the attack — thankfully, the advisory was accompanied by some scripts and tools to test for the vulnerability. Mitigation happened in four steps:
Fix vulnerable servers. As a first step, we needed to make sure to close the information leak. In some cases, that meant working with third party vendors (most notably, Amazon Web Services, who runs our Elastic Load Balancers) to get all servers patched. This step was concluded once we confirmed that all of load balancers on the DNS rotation were no longer vulnerable.
Replace SSL key pairs. Even though we had no reason to believe there was any actual attack against our SSL private keys, it was clear all of them had to be replaced as a precaution. Once we had them deployed out to all the servers and load balancers, we revoked all previous certificates with our CA, GeoTrust. All major browsers perform revocation checks against OCSP responders or CRLs.
Notify customers. Shortly after we resolved the issues, we sent an advisory to all of our customers, recommending a password change. Again, this was a purely precautionary measure, as there is no evidence of any passwords leaking.
Update Collectors. We have added a new feature to our Collectors that will automatically replace the Collector’s credentials. Once we complete testing, we will recommend all customers to upgrade to the new version. We also enabled support for certificate revocation checking, which wasn’t enabled previously.
How has this affected our customers?
Thus far, we have not seen any signs of unusual activity, nor have we seen any customers lose data due to this bug. Unfortunately, we’ve had to inconvenience our customers with requests to change passwords and update Collectors, but given the gravity of the vulnerability, we felt the inconvenience was justified.
Our intranet is hosted on AWS and our users use OpenVPN to connect to it. The version of OpenVPN we had been running needed to be updated to a version that was released today. Other servers behind OpenVPN also needed updating.
Sumo Logic uses on the order of 60-70 SaaS services internally. Shortly after we resolved our customer facing issues, we performed an assessment of all those SaaS services. We used the scripts to test for the vulnerability combined with DNS lookups. If a service looked like it was hosted with a provider/service that was known to have been vulnerable (such as AWS ELB), we added it to our list. We are now working our way through the list and changing passwords on all affected applications, starting with the most critical ones. Unfortunately, manually changing passwords in all of the affected applications takes time and presents a burden on our internal IT users. We plan to have completed this process by the end of the week.
Overall, the past few days were pretty interesting on the internet. Many servers (as many as 66% of SSL servers on the net) are running OpenSSL, and most were affected. Big sites, including Yahoo Mail and many others were affected. The pace of exploitation and spread of the issue were both concerning. Thankfully, Codenomicon, the company that discovered this vulnerability, did an amazing job handling and disclosing it in a pragmatic and responsible fashion. This allowed everybody to fix the issue rapidly and minimize impact on their users.
03.19.2014 | Posted by Ariel Smoliar, Senior Product Manager
Last July we launched our Applications webpage, and have been constantly adding new applications to this list. This week we are excited to announce a major step in delivering a better application user experience. We have now integrated the Sumo Logic Application Library directly into core service and have made it available for both trial users and paying customers.
The initial Library rollout includes the following applications: Active Directory, Apache, AWS CloudTrail, Cisco, IIS, Linux, Log Analysis QuickStart, Nginx, Palo Alto Networks, VMware and Windows. We updated the Sumo Logic user interface with a new “Apps” tab. You can install applications from the menu for a true self-service experience, without downloading any files. The dashboards for the applications you choose will be visible after following a few simple steps.
Over the coming weeks, we will add the remainder of the Sumo Logic Applications to the Library, including ones for Akamai Cloud Monitor, AWS Elastic Load Balancing, Snort, OSSEC, and more. Till that time, we will manually load these applications for our customers.
What’s Next and Feedback
This is just the first phase in the rollout of our Application Library. We will continue to deliver more applications that provide critical insights into your operational and security use cases. In addition, we will continue to enhance the Library itself as a system to share relevant insights across your organization.
We are eager to hear your feedback on this initial phase. Please fill out this form if you would like to meet with us and share your experience using the Sumo Application Library.
03.06.2014 | Posted by Ariel Smoliar, Senior Product Manager
After the successful launch of the Sumo Logic Application for AWS CloudTrail last November and with numerous customers now using this application, we were really excited to work again on a new logging service from AWS, this time providing analytics around log files generated by the AWS Load Balancers.
Our integration with AWS CloudTrail targets use cases relevant to security, usage and operations. With our new application for AWS Elastic Load Balancing, we provide our customers with dashboards that provide real-time insights into operational data. You will also be able to add additional use cases based on your requirements by parsing the log entries and visualizing the data using our visualization tools.
Insights from ELB Log Data
Sumo Logic runs natively on the AWS infrastructure and uses AWS load balancers, so we had plenty of raw data to work with during the development of the content. You will find 12 fields in the ELB logs covering the entire request/response lifecycle. By adding the request, backend and response processing time, we can highlight the total time (latency) from when the load balancer started reading the request headers to when the load balancer started sending the response headers to the client. The Latency Analysis dashboard presents a granular analysis per domain, client IP and backend instance (EC2).
The Application also provides analysis of the status codes based on the ELB and backend instances status codes. Please note that the total count for the status codes will be similar for both the ELB and the instances most of the time, unless there are issues, such as no backend response or client rejected request. Additionally, for ELBs that have been configured with a TCP listener (layer 4) rather than HTTP, the TCP requests will be logged. In this case, you will see that the URL has three dashes and there are no values for the HTTP status codes.
Often during my discussions with Sumo Logic users, the topic of scheduled searches and alerting comes up. Based on our work with ELB logs, there is no specific threshold that we recommend that covers every single use case scenario. The threshold should be based on the application – e.g., tiny beacon requests versus downloading huge files cause different latencies. Sumo Logic provides you with the flexibility to set threshold in the scheduled search or just to change the color in the graph for monitoring purpose, based on the value range.
I want to talk a little bit about machine data visualization. While skiing last week in Steamboat Colorado, I kept thinking about the relevance of the beautiful Rocky Mountain landscape with the somewhat more mundane world of load balancer data visualization. So here is what we did to present the load balancers data in a more compelling way:
You can slice and dice the data using our Transpose operator as we did in the Latency by Load Balancer monitor, but I would like to focus on a different feature that was built by our UI team and share how we used it in this application. This feature combines data about the number of requests, the size of the total requests, the client IP address and integrates these data elements into the Total Requests and Data Volume monitor.
We first used this visualization approach in our Nginx app (Traffic Volume and Bytes Served monitor). We received very positive feedback and decided it made sense to incorporate this approach into this application as well.
Combining three fields in a single view enables you to get faster overview of your environment and also provides you with the ability to drill-down and investigate any activity.
It reminds one of the landscape above, right?
To get this same visualization, click on the gear icon in the Search screen and choose the Change Series option.
For each data series, you can choose how you would like to represent the data. We used Column Chart for the total requests and Line Chart for the received and sent data.
I find it beautiful and useful. I hope you plan to use this visualization approach in your dashboards, and please let us know if any help is required.
One more thing…
Please stay tuned and check our posts next week… we can’t wait to share with you where we’re going next in the world of Sumo Logic Applications.
03.05.2014 | Posted by David Andrzejewski, Data Sciences Engineer
A few weeks ago I had the pleasure of hosting the machine data track of talks at Strata Santa Clara. Like “big data”, the phrase “machine data” is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or collected automatically by machines. This includes software logs and sensor measurements from systems as varied as mobile phones, airplane engines, and data centers. The concept is closely related to the “internet of things”, which refers to the trend of increasing connectivity and instrumentation in existing devices, like home thermostats.
More data, more problems
This data can be useful for the early detection of operational problems or the discovery of opportunities for improved efficiency. However, the decoupling of data generation and collection from human action means that the volume of machine data can grow at machine scales (i.e., Moore’s Law), an issue raised by both Monash and Abadi. This explosive growth rate amplifies existing challenges associated with “big data.” In particular, two common motifs among the talks at Strata were the difficulties around:
- mechanics: the technical details of data collection, storage, and analysis
- semantics: extracting understandable and actionable information from the data deluge
The talks covered applications involving machine data from both physical systems (e.g., cars) and computer systems, and highlighted the growing fuzziness of the distinction between the two categories.
Steven Gustafson and Parag Goradia of GE discussed the “industrial internet” of sensors monitoring heavy equipment such as airplane engines or manufacturing machinery. One anecdotal data point was that a single gas turbine sensor can generate 500 GB of data per day. Because of the physical scale of these applications, using data to drive even small efficiency improvements can have enormous impacts (e.g., in amounts of jet fuel saved).
Moving from energy generation to distribution, Brett Sargent of LumaSense Technologies presented a startling perspective on the state of the power grid in the United States, stating that the average age of an electrical distribution substation in the United States is over 50 years, while its intended lifetime was only 40 years. His talk discussed remote sensing and data analysis for monitoring and troubleshooting this critical infrastructure.
Ian Huston, Alexander Kagoshima, and Noelle Sio from Pivotal presented analyses of traffic data. The talk revealed both common-sense (traffic moves more slowly during rush hour) and ￼￼￼￼￼￼￼￼￼counterintuitive (disruptions in London tended to resolve more quickly when it was raining) findings.
My presentation showed how we apply machine learning at Sumo Logic to help users navigate machine log data (e.g., software logs). The talk emphasized the effectiveness of combining human guidance with machine learning algorithms.
Krishna Raj Raja and Balaji Parimi of Cloudphysics discussed how machine data can be applied to problems in data center management. One very interesting idea was to use data and modeling to predict how different configuration changes would affect data center performance.
The amount of data available for analysis is exploding, and we are still in the very early days of discovering how to best make use of it. It was great to hear about different application domains and novel techniques, and to discuss strategies and design patterns for getting the most out of data.
02.23.2014 | Posted by Bruno Kurtic, Founding Vice President of Product and Strategy
Security is a tricky thing and it means different things to different people. It is truly in the eye of the beholder. There is the checkbox kind, there is the “real” kind, there is the checkbox kind that holds up, and there is the “real” kind that is circumvented, and so on. Don’t kid yourself: the “absolute” kind does not exist.
I want to talk about security solutions based on log data. This is the kind of security that kicks in after the perimeter security (firewalls), intrusion detection (IDS/IPS), vulnerability scanners, and dozens of other security technologies have done their thing. It ties all of these technologies together, correlates their events, reduces false positives and enables forensic investigation. Sometimes this technology is called Log Management and/or Security Information and Event Management (SIEM). I used to build these technologies years ago, but it seems like decades ago.
A typical SIEM product is a hunking appliance, sharp edges, screaming colors – the kind of design that instills confidence and says “Don’t come close, I WILL SHRED YOU! GRRRRRRRRRR”.
Ahhhh, SIEM, makes you feel safe doesn’t it. It should not. I proclaim this at the risk at being yet another one of those guys who wants to rag on SIEM, but I built one, and beat many, so I feel I’ve got some ragging rights. So, what’s wrong with SIEM? Where does it fall apart?
SIEM does not scale
It is hard enough to capture a terabyte of daily logs (40,000 Events Per Second, 3 Billion Events per Day) and store them. It is couple of orders of magnitude harder to run correlation in real time and alert when something bad happens. SIEM tools are extraordinarily difficult to run at scales above 100GB of data per day. This is because they are designed to scale by adding more CPU, memory, and fast spindles to the same box. The exponential growth of data over the two decades when those SIEM tools were designed has outpaced the ability to add CPU, memory, and fast spindles into the box.
Result: Data growth outpaces capacity → Data dropped from collection → Significant data dropped from correlation → Gap in analysis → Serious gap in security
SIEM normalization can’t keep pace
SIEM tools depend on normalization (shoehorning) of all data into one common schema so that you can write queries across all events. That worked fifteen years ago when sources were few. These days sources and infrastructure types are expanding like never before. One enterprise might have multiple vendors and versions of network gear, many versions of operating systems, open source technologies, workloads running in infrastructure as a service (IaaS), and many custom written applications. Writing normalizers to keep pace with changing log formats is not possible.
Result: Too many data types and versions → Falling behind on adding new sources → Reduced source support → Gaps in analysis → Serious gaps in security
SIEM is rule-only based
This is a tough one. Rules are useful, even required, but not sufficient. Rules only catch the thing you express in them, the things you know to look for. To be secure, you must be ahead of new threats. A million monkeys writing rules in real-time: not possible.
Result: Your rules are stale → You hire a million monkeys → Monkeys eat all your bananas → You analyze only a subset of relevant events → Serious gap in security
SIEM is too complex
It is way too hard to run these things. I’ve had too many meetings and discussions with my former customers on how to keep the damned things running and too few meetings on how to get value out of the fancy features we provided. In reality most customers get to use the 20% of features because the rest of the stuff is not reachable. It is like putting your best tools on the shelf just out of reach. You can see them, you could do oh so much with them, but you can’t really use them because they are out of reach.
Result: You spend a lot of money → Your team spends a lot of time running SIEM → They don’t succeed on leveraging the cool capabilities → Value is low → Gaps in analysis → Serious gaps in security
So, what is an honest, forward-looking security professional who does not want to duct tape a solution to do? What you need is what we just started: Sumo Logic Enterprise Security Analytics. No, it is not absolute security, it is not checkbox security, but it is a more real security because it:
Processes terabytes of your data per day in real time. Evaluates rules regardless of data volume and does not restrict what you collect or analyze. Furthermore, no SIEM style normalization, just add data, a pinch of savvy, a tablespoon of massively parallel compute, and voila.
Result: you add all relevant data → you analyze it all → you get better security
It is SaaS, there are no appliances, there are no servers, there is no storage, there is just a browser connected to an elastic cloud.
Result: you don’t have to spend time on running it → you spend time on using it → you get more value → better analysis → better security
Rules, check. What about that other unknown stuff? Answer: machine that learns from data. It detects patterns without human input. It then figures out baselines and normal behavior across sources. In real-time it compares new data to the baseline and notifies you when things are sideways. Even if “things” are things you’ve NEVER even thought about and NOBODY in the universe has EVER written a single rule to detect. Sumo Logic detects those too.
Result: Skynet … nah, benevolent overlord, nah, not yet anyway. New stuff happens → machines go to work → machines notify you → you provide feedback → machines learn and get smarter → bad things are detected → better security
Read more: Sumo Logic Enterprise Security Analytics
02.06.2014 | Posted by Sanjay Sarathy, CMO
It has been an extremely productive start to 2014 at Sumo Logic. Over the past few weeks I’ve talked with a number of customers about their current and planned use of Sumo Logic. From these conversations three distinct patterns have emerged around the core value of what we provide which I’d like to share.
Scale. With workloads increasingly moving to the Cloud, the rapid growth of mobile applications and the speed with which companies can add and remove computing capacity, dynamism is the word of the day. Customers who are going through significant architectural changes and have chosen Sumo Logic often talk positively about the concept of scale with us. This refers to more than just our ability to scale with their data growth. It also refers to how customers use Sumo Logic to scale their machine data analytics *without* having to scale their operational overhead. A customer once said, “hosting software is for suckers”, and I have to agree.
Time to Value. Every customer wishes to minimize time figuring out what’s relevant in their data sets. Customers love our advanced analytics and how it enables them to be more proactive with their IT and security environments. Just as important to them as our LogReduce and Anomaly Detection capabilities is our ability to shorten the front end of the process – the ability to instantly query large amounts of log data based on very rapid collection and indexing. Their ability to get to production with Sumo Logic in a day or less is a huge win for them and drives immediate business value. No longer do companies have to spend weeks or months configuring an environment to start a project.
Content is King. We’ve spent a lot of time over the past months releasing new application content that support a multitude of use cases, from IT Ops to Security. Included on this list are apps for Akamai, AWS CloudTrail, Palo Alto Networks and even a QuickStart application for newbies to log management. The customer response has been extremely positive – the ability to easily incorporate knowledge about these data sources and generate additional business insights has significantly helped with the previous point around creating rapid business value. These apps combined with our recently launched content library are just the beginning to the machine data economy that is built around sharing insights and communal knowledge.
01.23.2014 | Posted by Sanjay Sarathy, CMO
Remember Moneyball? Moneyball is the story of how the performance of the Oakland A’s skyrocketed when they started to vet players based on sabermetrics principles, a data-driven solution that defied conventional wisdom. The team’s success with a metrics-driven approach only came about because GM Billy Beane and one of his assistants, Paul DePodesta, identified the value in player statistics and trusted these insights over what other baseball teams had accepted was true. Any business can learn a significant lesson from Billy Beane and Paul DePodesta, and it is a lesson that speaks volumes about the future of data in business.
If a business wants their data to drive innovation, they need to manage that data like the Oakland A’s did. Data alone does not reveal actionable business insights; experienced analysts and IT professionals must interpret it. Furthermore it’s up to business leaders to put their faith in their data, even if it goes against conventional wisdom.
Of course, the biggest problem companies confront with their data is the astronomical volume. While the A’s had mere buckets of data to pour through, the modern enterprise has to deal with a spewing fire hose of data. This constant influx of data generated by both humans and machines has paralyzed many companies who often never analyze the data available to them or just analyze the data reactively. Reactive data analysis, while useful to interpret what happened in the past, can’t necessarily provide insights into what might occur in the future. Remember your mutual fund disclaimer?
Innovation in business will stem from companies creating advantages via proactive use of that data. Case in point: Amazon’s new initiative to anticipate customers’ purchases and prepare shipping and logistics “ahead of time.”
The ability to be proactive with machine data won’t be driven simply by technology. It will instead stem from companies implementing their own strategic combination of machine learning and human knowledge. Achieving this balance to generate proactive data insights has been the goal of Sumo Logic since day one. While we have full confidence in our machine data intelligence technologies to do just that, we also know that is not the only solution that companies require. The future of data in the enterprise depends on how companies manage their data. If Billy Beane and Paul DePodesta effectively managed their data to alter the trajectory of the Oakland A’s, there is no reason that modern businesses cannot do the same.
This blog was published in conjunction with ‘Data Innovation Day’
01.16.2014 | Posted by Jim Wilson
Today I joined Sumo Logic, a cloud-based company that transforms Machine Data into new sources of operations, security, and compliance insights. I left NICE Systems, a market leader and successful organization that had acquired Merced Systems, where I led the Sales Organization for the past 6 years. I had a good position and enjoyed my role, so why leave? And why go to Sumo Logic versus many other options I considered? Many of my friends and colleagues have asked me this, so I wanted to summarize my thinking here.
First, I believe the market that Sumo Logic is trying to disrupt is massive. Sumo Logic, like many companies in Silicon Valley these days, manages Big Data. As Gartner recently noted, the concept of Big Data has now reached the peak of the Hype Cycle. The difference is that Sumo Logic actually does this by generating valuable insights from machine data (primarily log files). As a board member told me, people don’t create Big Data nearly as much as machines do. The emergence in the last 10+ years of cloud solutions, and the proliferation of the Internet and web based technologies in everything we do, in every aspect of business, has created an industry that did not exist 10 years ago. By now it’s a foregone conclusion that cloud technologies and cloud vendors like Amazon Web Services and Workday will ultimately be the solution of choice for all companies, whether they are small mom-and-pop shops or large Global Enterprises. I wanted to join a company that was solving a problem that every company has, and doing it using the most disruptive platform, Software-As-A- Service.
Equally important is my belief that it’s possible to build a better sales team that can make a difference in the traditional Enterprise Sales Process. Sumo Logic competes in a massive market with only one established player, Splunk. I believe that our capabilities, specifically Machine Data Analytics, are truly differentiated in the market. However, I am also excited to build a sales team that customers and prospects will actually want to work with. Just like technology has evolved (client server, web, cloud) I believe the sales profession needs to as well. Today’s sales organization needs to add value to the sales process, not just get in the way. This means we need to understand more about the product than we describe on the company’s website, be able to explain how our product is different from other choices, and how our service will uniquely solve the complex problems companies face today. I am excited to build an organization that will have a reputation of being knowledgeable about the industry and its ecosystem, will challenge customer thinking while understanding their requirements, and will also be fun to work with. The team at Sumo Logic understands this, and I look forward to delivering on this promise.
Finally, I think Sumo Logic has a great product. I started my sales career at Parametric Technology Corporation (PTC). Selling Pro/ENGINEER was a blast and set the gold standard for great products – everything from watching reactions during demos to hearing loyal customers rave about the innovative work they were doing with the product. I had a similar experience at Groove Networks watching Ray Ozzie and his team build a great product that was ultimately acquired by Microsoft. Sumo Logic seems to be generating that same product buzz. We have some amazing brand names like Netflix, Orange, McGraw-Hill, and Scripps Networks as our customers. These and the other customers we have are generating significant benefits from using our machine data intelligence service. The best measure of a company is the passion of their customer base. The energy and loyalty that our customer base exhibits for the Sumo Logic service is a critical reason why I’m very bullish about the long-term opportunity.
I am fired up to be a part of this organization. The management team and in particular Vance, Mark, and the existing sales team are already off to a great start and have grown sales significantly. I hope to build on their early success, and I will also follow the advice a good friend recently gave me when he heard the news: “You found something good – don’t screw it up!”
01.14.2014 | Posted by Joan Pepin, Director of Security
Today we announced that Sumo Logic has successfully completed the Service Organization Controls (SOC) Type 2 examination of the Trust Service Principles; Security, Availability and Confidentiality. Frankly, this is a pretty big deal and something we have been working towards for a while (we achieved our SOC 2 Type 1 in August of 2012) so I’m here to explain a little bit about what that means for you.
In case you’re not familiar with the SOC 2 Type 2 it may help you to know that that the SOC family of reports was implemented by the American Institute of Certified Public Accountants (the AICPA) as a replacement for the venerable old SAS-70 report back in 2011. (So if you’re still asking your vendors for their SAS-70, you’re behind the times a bit- I get this a lot- it’s usually followed by questions about our backup tapes on security assessment paperwork that hasn’t been updated since it was noisily written in Lotus Notes(™) on this bad-boy…)
The main purpose of the SOC 2 Type 2 report is to show our customers that an independent third party has evaluated our controls and our adherence to those controls over a period of time. In the words of the AICPA, a SOC 2 report is ideal for:
“A Software-as-a-Service (SaaS) or Cloud Service Organization that offers virtualized computing environments or services for user entities and wishes to assure its customers that the service organization maintains the confidentiality of its customers’ information in a secure manner and that the information will be available when it is needed. A SOC 2 report addressing security, availability and confidentiality provides user entities with a description of the service organization’s system and the controls that help achieve those objectives. A type 2 report also helps user entities perform their evaluation of the effectiveness of controls that may be required by their governance process.”
The major areas of the SOC report are called “Trust Service Principles” because Trust is what this is all about. Once again in the words of the AICPA:
“Trust Services helps differentiate entities from their competitors by demonstrating to stakeholders that the entities are attuned to the risks posed by their environment and equipped with the controls that address those risks. Therefore, the potential beneficiaries of Trust Services assurance reports are consumers, business partners, creditors, bankers and other creditors, regulators, outsourcers and those using outsourced services, and any other stakeholders who in some way rely on electronic commerce (e-commerce) and IT systems.”
You know how you handle your data, but before you hand it over to someone else, you should know a good deal about how they are going to handle it, and because trust is based on openness your data services vendors should be extremely open about that.
Because trust is an important factor in any business relationship, our report lists 263 controls around Security, Availability and Confidentiality put into effect at Sumo Logic and the tests that our examiners (The wonderful people at Brightline CPAs & Associates) performed. This is an extremely thorough overview of what we do to ensure that we deserve your trust, and if you are considering sending us your data, you should ask us for a copy and look it over. And If you are considering any of our competitors, you should also ask to see their third-party assessment. (Hint: They don’t have one.)
01.08.2014 | Posted by Manish Khettry
Here at Sumo Logic, we run a log management service that ingests and indexes many terabytes of data a day; our customers then use our service to query and analyze all of this data. Powering this service are a dozen or more separate programs (which I will call assembly from now on), running in the cloud, communicating with one another. For instance the Receiver assembly is responsible for accepting log lines from collectors running on our customer host machines, while the Index assembly creates text indices for the massive amount of data pumping into our system constantly being fed by the Receivers.
We deploy to our production system multiple times each week, while our engineering teams are constantly building new features, fixing bugs, improving performance, and, last but not least, working on infrastructure improvements to help in the care and wellbeing of this complex big-data system. How do we do it? This blog post tries to explain our (semi)-continuous deployment system.
Running through hoops
In any continuous deployment system, you need multiple hoops that your software must pass through, before you deploy it for your users. At Sumo Logic, we have four well defined tiers with clear deployment criteria for each. A tier is an instance of the entire Sumo Logic service where all the assemblies are running in concert as well as all the monitoring infrastructure (health checks, internal administrative tools, auto-remediation scripts, etc) watching over it.
This is the first step in the sequence of steps that our software goes through. Originally intended as a nightly deploy, we now automatically deploy the latest clean builds of each assembly on our master branch several times every day. A clean build means that all the unit tests for the assemblies pass. In our complex system, however, it is the interaction between assemblies which can break functionality. To test these, we have a number of integration tests running against Night regularly. Any failures in these integration tests are an early warning that something is broken. We also have a dedicated person troubleshooting problems with Night whose responsibility it is, at the very least, to identify and file bugs for problems.
We cut a release branch once a week and use Stage to test this branch much as we use Night to keep master healthy. The same set of integration tests that run against Night also run against Stage and the goal is to stabilize the branch in readiness for a deployment to production. Our QA team does ad-hoc testing and runs their manual test suites against Stage.
Right before production is the Long tier. We consider this almost as important as our Production tier. The interaction between Long and Production is well described in this webinar given by our founders. Logs from Long are fed to Production and vice versa, so Long is used to monitor and trouble shoot problems with Production.
Deployments to Long are done manually a few days before a scheduled deployment to Production from a build that has passed all automated unit tests as well as integration tests on Stage. While the deployment is manually triggered, the actual process of upgrading and restarting the entire system is about as close to a one-button-click as you can get (or one command on the CLI)!
After Long has soaked for a few days, we manually deploy the software running on Long to Production, the last hoop our software has to jump through. We aim for a full deployment every week and often times will do smaller upgrades of our software between full deploys.
Being Production, this deployment is closely watched and there are a fair number of safeguards built into the process. Most notably, we have two dedicated engineers who manage this deployment, with one acting as an observer. We also have a tele-conference with screen sharing that anyone can join and observe the deploy process.
Closely associated with the software infrastructure are the social aspects of keeping this system running. These are:
We have well defined ownership of these tiers within engineering and devops which rotate weekly. An engineer is designated Primary and is responsible for Long and Production. Similarly we have a designated Jenkins Cop role, to keep our continuous integration system and Night and Stage healthy.
Group decision making and notifications
We have a short standup everyday before lunch, which everyone in engineering attends. The Primary and Jenkins Cop update the team on any problems or issues with these tiers for the previous day.
In addition to a physical meeting, we use Campfire, to discuss on-going problems and notifying others of changes to any of these tiers. If someone wants to change a configuration property on night to test a new feature, the person would update everyone else on campfire. Everyone (and not just the Primary or Jenkins Cop) is in the loop about these tiers and can jump in to troubleshoot problems.
Automate almost everything. A checklist for the rest.
There are certain things that are done or triggered manually. In cases where humans operate something (a deploy to Long or Production for instance), we have a checklist for engineers to follow. For more on checklists, I refer you to an excellent book, The Checklist Manifesto.
This system has been in place since Sumo Logic went live and has served us well. It bears mentioning that the key to all of this is automation, uniformity, and well-delineated responsibilities. For example, spinning up a complete system takes just a couple of commands in our deployment shell. Also, any deployment (even a personal one for development) comes up with everything pre-installed and running, including health checks, monitoring dashboards or auto-remediation scripts. Identifying and fixing a problem on Production is no different from that on Night. In almost every way (except for waking up the Jenkins Cop in the middle of the night and the sizing), these are identical tiers!
While automation is key, it doesn’t take away the fact that people who run and keep things healthy. A deployment to production can be stressful, more so for the Primary than anyone else and having a well defined checklist can take away some of the stress.
Any system like this needs constant improvements and since we are not sitting idle, there are dozens of features, big and small that need to be worked on. Two big ones are:
Red-Green deployments, where new releases are rolled out to a small set of instances and once we are confident they work, are pushed to the rest of the fleet.
More frequent deployments of smaller parts of the system. Smaller more frequent deployments are less risky.
In other words, there is a lot of work to do. Come join us at Sumo Logic!