Brandon Mensing

Posts by Brandon Mensing


No. More. Noise.

756 ‘real’ emails this week, 3200+ more generated by machines and approximately 78,000 push notifications, todos, Jira tasks and text messages – all asking for a piece of your attention. Oh, did you want to get some work done? There’s also a line of people waiting to interrupt you or ask someone near you a question while you’re mentally skydiving through the 17-layer-dip that is your source base. One more little sound and it all vanishes from your memory. But we’re not actually going to talk about cats (much) or their noises. Today let’s battle the noise that is your infrastructure. Have you ever tried setting up an alert on some key performance indicator (or KPI as so many like to say)? It’s easy – alert me when the request volume goes over 6000. Per node. Wait, no – per cluster. Ok, 14000 per cluster. No, 73333 per balancer. Ok, now 87219. Nevermind, I never want to wake up for this ever again – just alert me if the entire service stops existing. Done. Luckily I have a great solution for you! Today, right now even, you can parse out your favorite KPI and run a simple operator to find points in time where that indicator exceeds a dynamic and mathematically awesome boundary to tell you with so much likelihood that something important actually happened. That new operator is called Outlier and it does exactly what you hope it does. Let’s look at an example: parse “bytes: *,” as bytes | timeslice 5m | sum(bytes) as sumbytes by _timeslice | outlier sumbytes You’ve already done the first 3 lines yourself many times but that last line does all this extra magic for you, showing you exactly where the thing went off the rails. What’s more, you can even tune this to meet your needs by looking for a sustained change (the ‘consecutive’ parameter), a positive change only (‘direction’ parameter) and even the size of the thresholds shown here with blue shading (the ‘threshold’ parameter). Our documentation will help you get the most out of this great new operator but before we move on, you should also note that you can use this to identify outliers across multiple streams of data with a single query. This means you can get an alert if one of your deployments or hosts goes outside of it’s thresholds – where those thresholds are dynamic and specific to that deployment/host! parse “bytes: *,” as bytes | timeslice 5m | sum(bytes) as sumbytes by _timeslice, _sourceHost | outlier sumbytes by _sourceHost | where _sumbytes_violation=1 That last line is the logic you need to eliminate the non-outliers and then all you need to do is setup a saved search alert to get your noise-free alerts going. Use the conditional alert where the number of results > 0 and you’ll only see alerts when you have a real problem! And when it turns out that you get a spurious alert anyway, you can always come back and adjust threshold, consecutive, direction and window to make things right again. And now, with no shyness about the abrupt segue, how would you like to see into the future as well? Well I can help with that too – the new Predict operator will show you a linear projection into the future. Now that you’ve become a master of alerts with Outlier, just imagine what sort of power you can wield by getting an alert before your disk fills up on your DB box. Just as with Outlier, you can configure a scheduled search alert to become the ultimate master of DevOps while befriending unicorns and defying flying tacos. But just to be clear – you’ll get that alert before the outage begins so that you can avoid posting awful things on your status page and filing post mortems with your managers. This is why people will assume you’re magical. As always, the documentation is an amazing place to get started, but for now I’ll leave you with this example and kindly ask that you get in touch to let us know what you think after you’ve tried out these two fantastic operators! …| parse “diskfree: *,” as diskfree | timeslice 5m | sum(diskfree) as sum_disk by _timeslice | predict sum_disk by 5m forecast=20

March 11, 2015


A New Look for Your Data

As some of you may have seen by now, we’ve launched a fantastic new look and feel for our Dashboards. From top to bottom, we’ve added more unicorns, rainbows and cat-meme-powered visualizations. The end result? We hope you’ll agree: it’s a breath of fresh air that brings a bit more power to your analytics inside Sumo Logic. Do you find yourself accidentally moving Panels around, when all you wanted to do was look at your Dashboard? Worry no more. We have introduce a clear Edit mode that simplifies the whole process of making your Dashboard look perfect. With the new Edit mode, you can now unleash your creative side, and move and resize panels on your dashboard. You can make one panel take up most of the screen and surround it with smaller panels. Mix and match big Panels and small until you get the perfect balance of DevOps goodness (it’s also great feng shui). And for those of you that want that that edgy, uber-geek look – meet our new Light and Dark themes. Use the gear icon and select the menu item Toggle Theme to switch over to this great new option, pick your side, and may the force be with you. Over the years as we’ve gathered feedback, we kept hearing over and over how much teams wanted to be able to add simple blobs of text next to their charts. For some, this was a matter of providing really important references to SOPs and other important-sounding documents. For others, they really just needed a note to “Sound all the alarms and pray to your various gods if this ever goes above 17.8756”. You get the idea – a little extra context makes all the difference – and now you can put that right in your Dashboards! Just click the Add Panel button while in Edit mode,you can add a Panel just for title text or a text blog. And did you want icing on this cake? Markdown. That’s right – you’re just a few dozen asterisks away from the perfect nested list. We also took some time to brush up some of our favorite chart types. Been wishing for an easier to read Single Value Monitor? Done. Ever wished your pie charts look cooler? Well, we added Donut charts to spice things up. Our guys in the apps department couldn’t wait to get their hands on this.. Since we all know and love AWS and the essential functionality that our AWS apps provide, we decided that those were a great place to start with a bit of a refresh as well. These apps now feature a more uniform Overview dashboard and better visualizations for key data points, and they also look pretty cool. So what do you think of the new Dashboards and AWS apps? Love it or hate it – let us know!

February 2, 2015


Use AWS CloudTrail to Avoid a Console Attack

Our app for AWS CloudTrail now offers a dashboard specifically for monitoring console login activity. In the past months since the AWS team added this feature, we decided to break out these user activities in order to provide better visibility into what’s going on with your AWS account. Many of you might think of this update as incremental and not newsworthy, but I’m actually writing here today to tell you otherwise! More and more people are using APIs and CLIs (and third parties) to work with AWS outside the console. As console logins are becoming more and more rare and as more business-critical assets are being deployed in AWS, it’s critical to always know who’s logged into your console and when. For a great and terrifying read about just how badly things can go wrong when someone gains access to your console, look no further than the story of Code Spaces. With one story opening with “was a company” and another “abruptly closed,” there isn’t exactly a lot of suspense about how things turned out for this company. After attackers managed to gain access to Code Spaces’ AWS console, they built themselves a stronghold of backdoors and began an attempt to extort money from the company. When the attackers accounts were removed, they quickly used the additional users they had generated to get back in and begin taking out infrastructure and data. With the service down and their customer’s data in disarray, all trust in their product was lost. The company was effectively destroyed in a matter of hours. The new dashboard in our updated CloudTrail app allows you to quickly see who’s attempting to login to your console, from where and whether or not they’re using multi-factor authentication (which we highly recommend). If you haven’t installed the app previously, be sure to follow our simple steps from our documentation to setup the appropriate permissions in AWS. For those of you who have already installed the app previously, you can install the app again anew in order to get a new copy of the app with the additional dashboard included. From there, we encourage you to customize queries for your specific situation and even consider setting up a scheduled search to alert you to a problematic situation. Keeping an eye out for suspicious activity on your AWS console can be an invaluable insight. As attackers get more sophisticated, it’s harder and harder to keep your business secure and operational. With the help of Sumo Logic and logs from AWS CloudTrail you can stay ahead of the game by preventing the most obvious (and most insidious) types of breaches. With functionality like this, perhaps Code Spaces would still be in business.\


November 11, 2014


Using the Join Operator

The powerful analytics capabilities of the Sumo Logic platform have always provided the greatest insights into your machine data. Recently we added an operator – bringing the essence of a SQL JOIN to your stream of unstructured data, giving you even more flexibility. In a standard relational join, the datasets in the tables to be joined are fixed at query time. However, matching up IDs between log messages from different days within your search timeframe likely produces the wrong result because actions performed yesterday should not be associated with a login event that occurred today. For this reason, our Join operator provides for a specified moving timeframe within which to join log messages. In the diagram below, the pink and orange represent two streams of disparate log messages. They both contain a key/value pair that we want to match on and the messages are only joined on that key/value when they both occur within the time window indicated by the black box. Now let’s put this to use. Suppose an application has both real and machine-controlled users. I’m interested in knowing which users are which so that I can keep an eye out for any machine-controlled users that are impacting performance. I have to find a way to differentiate between the real vs the machine-controlled users. As it turns out, the human users create requests at a reasonably low rate while the machine-controlled users (accessing via an API) are able to generate several requests per second and always immediately after the login event. In these logs, there are several different messages coming in with varying purposes and values. Using Join, I can query for both the logins and requests and then restrict the time window of the matching logic to combine the two messages streams. The two sub queries in my search will look for request/query events and login events respectively. I’ve restricted the match window to just 15 seconds so that I’m finding the volume of requests that are very close to the login event. Then I’m filtering out users who made less than 10 requests in that 15-second time frame following a login. The result is a clear view of the users that are actively issuing a large volume of requests via the API immediately upon logging in. Here is my example query: (login or (creating query)) | join (parse "Creating query: '*'" as query, "auth=User:*:" as user) as query, (parse "Login success for: '*'" as user) as login on query.user = login.user timewindow 15s | count by query_user | where _count > 10 | sort _count As you can see from the above syntax, the subqueries are written with the same syntax and even support the use of aggregates (count, sum, average, etc) so that you can join complex results together and achieve the insights you need. And of course, we support joining more than just two streams of logs – combining all your favorite data into one query!