What are some of the key skills that highly effective operational teams possess? What makes one operations team much more effective than another? Here is my take on a few of the key traits highly effective operations teams have. Trait 1: React with speed and effectiveness The quality of service depends on how quickly you can identify and resolve the issue. Mean Time to Identify (MTTI) and Mean Time to Resolution (MTTR) are two key metrics that will tell you how you are doing on this dimension. Your team should be tracking those two numbers depending their impact on the customers. It’s okay if those numbers are higher for low impact issues, but for high impact issues those metrics should be low. Having good visibility into what is going on deep within the system is key to resolving issues within minutes as opposed to hours and days. Trait 2: Proactively monitor – Actively look for failures Teams at this level are hypersensitive about failures that went undetected. The goal is to lower the frequency of issues that are missed by monitoring and alerting framework. Better than having one metric, such a metric should be broken down by event severity. Over time the number of issues missed by monitoring/alerting for a really high severity should go down from “frequently” to “very rare”. If your team is small, you might want to focus on a particular set of KPIs and develop full loop capability for detecting, alerting and remediating those issues. Trait 3: Build really good playbooks – Know how to respond when an alert happens High levels of monitoring and alerting can quickly lead to “alert fatigue.’ To reduce this, create easy to find and execute playbooks. Playbooks are a simplified list of steps that tell an operator how to respond when an alert happens. Playbooks should be written in a way that it requires zero thinking for an operator to execute on it. And remember to make those playbooks really easy to discover. Or heck, put a link to that in the alert itself. Trait 4: Do retrospectives – Learn from failures Failures are inevitable. There is always something that will happen, your goal is to avoid repeating them. To go one step beyond Trait 2, look at the issues and ask questions as to what was it about the process, architecture or people that led that failure to happen. Was the time it took to resolve the issue acceptable? If not, what can be done to reduce the time it took to resolve it? Can we automate some of the identification/resolution steps to reduce MTTI and MTTR? Teams can get really good at this by building a culture of blameless post mortems, focusing relentlessly on finding the root cause. For if a team doesn’t truly understand the root cause, they can’t be sure that the issue is fixed. And if you aren’t sure that you have fixed the issue, you cannot be sure that it won’t happen again. Ask yourself the five whys until you get to the root cause. Sometimes five is not enough. You have to really get down to the core issue, an issue that you can fix. If you cannot fix right away, at least detect and recover from that very quickly, hopefully without any impact to the service. Trait 5: Build resiliency into the system – Make use of auto-healing systems Having said all the above, many of the issues that turn into operational nightmares can be caught and taken care of at design time. Make the requirement to be able to run a service at a high quality a key requirement from the get go. You will be paying much more for bad design/architectural choices several times over by the time the service is generally available and used.
In 1965, Dr. Hubert Dreyfus, a professor of philosophy at MIT, later at Berkeley, was hired by RAND Corporation to explore the issue of artificial intelligence. He wrote a 90-page paper called “Alchemy and Artificial Intelligence” (later expanded into the book What Computers Can’t Do) questioning the computer’s ability to serve as a model for the human brain. He also asserted that no computer program could defeat even a 10-year-old child at chess. Two years later, in 1967, several MIT students and professors challenged Dreyfus to play a game of chess against MacHack (a chess program that ran on a PDP-6 computer with only 16K of memory). Dreyfus accepted. Dreyfus found a move, which could have captured the enemy queen. The only way the computer could get out of this was to keep Dreyfus in checks with his own queen until he could fork the queen and king, and then exchange them. And that’s what the computer did. The computer checkmated Dreyfus in the middle of the board. I’ve brought up this “man vs. machine” story because I see another domain where a similar change is underway: the field of Machine Data. Businesses run on IT and IT infrastructure is getting bigger by the day, yet IT operations still remain very dependent on analytics tools with very basic monitoring logic. As the systems become more complex (and more agile) simple monitoring just doesn’t cut it. We cannot support or sustain the necessary speed and agility unless the tools becomes much more intelligent. We believed in this when we started Sumo Logic and with the learnings of running a large-scale system ourselves, continue to invest in making operational tooling more intelligent. We knew the market needed a system that complemented the human expertise. Humans don’t scale that well – our memory is imperfect so the ideal tools should pick up on signals that humans cannot, and at a scale that perfectly matches the business needs and today’s scale of IT data exhaust. Two years ago we launched our service with a pattern recognition technology called LogReduce and about five months ago we launched Structure Based Anomaly Detection. And the last three months of the journey have been a lot like teaching a chess program new tricks – the game remains the same, just that the system keeps getting better at it and more versatile. We are now extending our Structured Based Anomaly Detection capabilities with Metric Based Anomaly Detection. A metric could be just that – a time series of numerical value. You can take any log, filter, aggregate and pre-process however you want – and if you can turn that into a number with a time stamp – we can baseline it, and automatically alert you when the current value of the metric goes outside an expected range based on the history. We developed this new engine in collaboration with the Microsoft Azure Machine Learning team, and they have some really compelling models to detect anomalies in a time series of metric data – you can read more about that here. The hard part about Anomaly Detection is not about detecting anomalies – it is about detecting anomalies that are actionable. Making an anomaly actionable begins with making it understandable. Once an analyst or an operator can grok the anomalies – they are much more amenable to alert on it, build a playbook around it, or even hook up automated remediation to the alert – the Holy Grail. And, not all Anomaly Detection engines are equal. Like chess programs there are ones that can beat a 5 year old and others that can even beat the grandmasters. And we are well on our way to building a comprehensive Anomaly Detection engine that becomes a critical tool in every operations team’s arsenal. The key question to ask is: does the engine tell you something that is insightful, actionable and that you could not have found with standard monitoring tools. Below is an example of an actual Sumo production use case where some of our nodes were spending a lot of time in garbage collection impacting refresh rates for our dashboards for some of the customers. If this looks interesting, our Metric Based Anomaly Detection service based on Azure Machine Learning is being offered to select customers in a limited beta release and will be coming soon to machines…err..a browser near you (we are a cloud based service after all). P.S. If you like stories, here is another one for you. 30 years after MackHack beat Dreyfus, in the year 1997 Kasparov (arguably one of the best human chess players) played the Caro-Kann Defence. He then allowed Deep Blue to commit a knight sacrifice, which wrecked his defenses and forced him to resign in fewer than twenty moves. Enough said. References  http://www.chess.com/article/view/machack-attack
Since I was a kid, I had a fascination for chess playing programs – until it got to a point that it became impossible for me to beat a good chess program. And years ago, not long after I gave up my personal fight with them – the last man standing lost to the best chess playing program. Clearly for chess, machine intelligence overtook human intelligence that day. Another area where machine intelligence has evolved to a point where it’s better than human intelligence is the maps program. I used to have to carry a road atlas with me or risk spending a lot of time just finding my way back on track. It got better a little when you could take a print out, but if I missed an exit or wanted to go for a scenic detour – I again was on my own. Not any more, now I can simply plug in my phone, speak the next destination, and it guides me patiently to that destination – recalculating the route if i miss an exit, heck even warning me when the route is blocked with traffic. These are just couple of examples of how technology evolves to a point – where it would have seemed a sci-fi fantasy 10-15 years ago. And it fundamentally changes how we all go about our lives. Machine Data Analytics seems like another area desperately in need of a similar evolution. Machine Data Analytics has to evolve into Machine Data Science – and it has to evolve to a point where we depend on it and use it just as I rely on maps and navigation on my cell phone. And Sumo Logic is at the forefront of making that change happen – and there are some fundamental shifts in computing technology – changes which bring that breakthrough within reach. Cloud has become as mainstream as video streaming. And just like video streaming completely disrupted brick and mortar DVD rental businesses, Cloud has already and continues to bring along fundamentally disruptive technologies to life. So what does Cloud mean for Machine Data? It will be about generating sophisticated insights from the data generated by IT today. Machine data is already the one of the biggest sources of “Big Data” in enterprises. It will be about delivering smarter analytics at scale with the simplicity of a service. As the new year begins, I feel proud and satisfied with what we have accomplished in the last two and a half years. And super excited about the journey ahead – a future is waiting to be invented. 🙂