Cloud siem icon white

Automate your SOC with Cloud SIEM Get started

Get started

Stefan Zier

Stefan was Sumo’s first engineer and Chief Architect. He enjoys working on cloud plumbing and is plotting to automate his job fully, so he can spend all his time skiing in Tahoe.

Posts by Stefan Zier

Blog

Benchmarking Microservices for Fun and Profit

Why should I benchmark microservices? The ultimate goal of benchmarking is to better understand the software, and test out the effects of various optimization techniques for microservices. In this blog, we describe our approach to benchmarking microservices here at Sumo Logic. Create a spreadsheet for tracking your benchmarking We found a convenient way to document a series of benchmarks is in a Google Spreadsheet. It allows collaboration and provides the necessary features to analyze and sum up your results. Structure your spreadsheet as follows: Title page Goals Methodology List of planned and completed experiments (evolving as you learn more) Insights Additional pages Detailed benchmark results for various experiments Be clear about your benchmark goals Before you engage in benchmarking, clearly state (and document) your goal. Examples of goals are: “I am trying to understand how input X affects metric Y” “I am running experiments A, B and C to increase/decrease metric X” Pick one key metric (Key Performance Indicator – KPI) State clearly which one metric you are concerned about and how the metric affects users of the system. If you choose to capture additional metrics for your test runs, ensure that the key metrics stands out. Think like a scientist You’re going to be performing a series of experiments to better understand which inputs affect your key metric, and how. Consider and document the variables you devise, and create a standard control set to compare against. Design your series of experiments in a fashion that leads to the understanding in the least amount of time and effort. Define, document and validate your benchmarking methodology Define a methodology for running your benchmarks. It is critical your benchmarks be: Fairly fast (several minutes, ideally) Reproducible in the exact same manner, even months later Documented well enough so another person can repeat them and get identical results Document your methodology in detail. Also document how to re-create your environment. Include all details another person needs to know: Versions used Feature flags and other configuration Instance types and any other environmental details Use load generation tools, and understand their limitations In most cases, to accomplish repeatable, rapid-fire experiments, you need a synthetic load generation tool. Find out whether one already exists. If not, you may need to write one. Understand that load generation tools are at best an approximation of what is going on in production. The better the approximation, the more relevant the results you’re going to obtain. If you find yourself drawing insights from benchmarks that do not translate into production, revisit your load generation tool. Validate your benchmarking methodology Repeat a baseline benchmark at least 10 times and calculate the standard deviation over the results. You can use the following spreadsheet formula: =STDEV(<range>)/AVERAGE(<range>) Format this number as a percentage, and you’ll see how big the relative variance in your result set is. Ideally, you want this value to be < 10%. If your benchmarks have larger variance, revisit your methodology. You may need to tweak factors like: Increase the duration of the tests. Eliminate variance from the environments. Ensure all benchmarks start in the same state (i.e. cold caches, freshly launched JVMs, etc). Consider the effects of Hotspot/JITs. Simplify/stub components and dependencies on other microservices that add variance but aren’t key to your benchmark. Don’t be shy to make hacky code changes and push binaries you’d never ship to production. Important: Determine the number of results you need to get the standard deviation below a good threshold. Run each of your actual benchmarks at least that many times. Otherwise, your results may be too random. Execute the benchmark series Now that you have developed a sound methodology, it’s time to gather data. Tips: Only vary one input/knob/configuration setting at a time. For every run of the benchmark, capture start and end time. This will help you correlate it to logs and metrics later. If you’re unsure whether the input will actually affect your metric, try extreme values to confirm it’s worth running a series. Script the execution of the benchmarks and collection of metrics. Interleave your benchmarks to make sure what you’re observing aren’t slow changes in your test environment. Instead of running AAAABBBBCCCC, run ABCABCABCABC. Create enough load to be able to measure a difference There are two different strategies for generating load. Strategy 1: Redline it! In most cases, you want to ensure you’re creating enough load to saturate your component. If you do not manage to accomplish that, how would you see that you increased it’s throughput? If your component falls apart at redline (i.e. OOMs, throughput drops, or otherwise spirals out of control), understand why, and fix the problem. Strategy 2: Measure machine resources In cases where you cannot redline the component, or you have reason to believe it behaves substantially different in less-than-100%-load situations, you may need to resort to OS metrics such as CPU utilization and IOPS to determine whether you’ve made a change. Make sure your load is large enough for changes to be visible. If your load causes 3% CPU utilization, a 50% improvement in performance will be lost in the noise. Try different amounts of load and find a sweet spot, where your OS metric measurement is sensitive enough. Add new benchmarking experiments as needed As you execute your benchmarks and develop a better understanding of the system, you are likely to discover new factors that may impact your key metric. Add new experiments to your list and prioritize them over the previous ones if needed. Hack the code In some instances, the code may not have configuration or control knobs for the inputs you want to vary. Find the fastest way to change the input, even if it means hacking the code, commenting out sections or otherwise manipulating the code in ways that wouldn’t be “kosher” for merges into master. Remember: The goal here is to get answers as quickly as possible, not to write production-quality code—that comes later, once we have our answers. Analyze the data and document your insights Once you’ve completed a series of benchmarks, take a step back and think about what the data is telling you about the system you’re benchmarking. Document your insights and how the data backs them up. It may be helpful to: Calculate the average for each series of benchmarks you ran and to use that to calculate the difference (in percent) between series — i.e. “when I doubled the number of threads, QPS increased by 23% on average.” Graph your results — is the relationship between your input and the performance metric linear? Logarithmic? Bell curve? Present your insights When presenting your insights to management and/or other engineering teams, apply the Pyramid Principle. Engineers often make the mistake of explaining methodology, results and concluding with the insights. It is preferable to reverse the order and start with the insight. Then, if needed/requested, explain methodology and how the data supports your insight. Omit nitty-gritty details of any experiments that didn’t lead to interesting insights. Avoid jargon, and if you cannot, explain it. Don’t assume your audience knows the jargon. Make sure your graphs have meaningful, human-readable units. Make sure your graphs can be read when projected onto a screen or TV.

September 2, 2016

Blog

Change Management, the ChatOps Way

All changes to production environments at Sumo Logic follow a well-documented change management process. While generally a sound practice, it is also specifically required for PCI, SOC 2, HIPAA, ISO 27001 and CSA Star compliance, amongst others. Traditional processes never seemed like a suitable way to implement change management at Sumo Logic. Even a Change Management Board (CMB) that meets daily is much too slow for our environment, where changes are implemented every day, at any time of the day. In this blog, I’ll describe our current solution, which we have iterated towards over the past several years. The goals for a our change management process are that: Anybody can propose a change to the production system, at anytime, and anybody can follow what changes are being proposed. A well-known set of reviewers can quickly and efficiently review changes and decide on whether to implement them. Any change to production needs to leave an audit trail to meet compliance requirements. Workflow and Audit Trail We used Atlassian JIRA to model the workflow for any System Change Request (SCR). Not only is JIRA a good tool for workflows, but we also use it for most of our other bug and project tracking, making it trivial to link to relevant bugs or issues. Here’s what the current workflow for a system change request looks like: A typical system change request goes through these steps: Create the JIRA issue. Propose the system change request to the Change Management Board. Get three approvals from members of the Change Management Board. Implement the change. Close the JIRA issue. If the CMB rejects the change request, we simply close the JIRA issue. The SCR type in JIRA has a number of custom fields, including: Environments to which the change needs to be applied Schedule date for the change Justification for the change (for emergency changes only) Risk assessment (Low/Medium/High) Customer facing downtime? Implementation steps, back-out steps and verification steps CMB meeting notes Names of CMB approvers These details allow CMB members to quickly assess the risk and effects of a proposed change. Getting to a decision quickly To get from a proposal to approved change in the most expedient manner, we have a dedicated #cmb-public channel in Slack. The typical sequence is: Somebody proposes a system change in the Slack channel, linking to the JIRA ticket. If needed, there is a brief discussion around the risk and details of the change. Three of the members of the CMB approve the change in JIRA. The requester or on-calls implement the change and mark the SCR implemented. In the past, we manually tied together JIRA and Slack, without any direct integration. As a result, it often took a long time for SCRs to get approved, and there was a good amount of manual leg work to find the SCR in JIRA and see the details. Bender to the rescue In order to tie together the JIRA and Slack portions of this workflow, we built a plugin for our sumobot Slack bot. In our Slack instance, sumobot goes by the name of Bender Bending Rodriguez, named for the robot in Futurama. As engineers and CMB members interact with an SCR, Bender provides helpful details from Jira. Here’s an example of an interaction: As you can see, Bender listens to messages containing both the word “proposing” and a JIRA link. He then provides a helpful summary of the request. As people vote, he checks the status of the JIRA ticket, and once it moves into the Approved state, he lets the channel know. Additionally, he posts a list of currently open SCRs into the channel three times a day, to remind CMB members of items they still need to decide on. The same list can also be manually requested by asking for “pending scrs” in the channel. Since this sumobot plugin is specific to our use case, I have decided not to include it in the open source repository, but I have made the current version of the source code available as part of this blog post here.

Blog

Using Sleeper Cells to Load Test Microservices

Blog

Optimizing Selectivity in Search Queries

While we at Sumo Logic constantly work to improve search performance automatically, there are some improvements that can only be made by humans who understand what question a search query answers and what the data looks like. In a Sumo Logic query, operators are chained together with the pipe symbol ( | ) to calculate the result. Each operator sends its output to the next operator in the chain. Certain operators, such as where and parse (except with nodrop) drop certain messages. Dropped messages do not need to be processed by any subsequent operators in the chain. This is called the selectivity of an operator. The fewer messages that “make it” through the operator, the more selective the operator is. Operator Ordering For optimal query performance, move the most selective operators to the earliest positions in your query. This reduces the amount of data subsequent operators need to process, and thereby speeds up your query. Query 1 error | parse “ip=*, errorcode=*“ as ip, errorcode | lookup ip from /my/whitelisted_ips on ip=ip | where errorcode=”failed_login” In the example above, Query 1 performs a lookup on all log lines, just to discard all lines where errorcode isn’t “failed_login”. Query 2 below is much more optimal, since it only performs the lookup on log lines that match the overall selectivity criteria of the query. Query 2 error | parse “ip=*, errorcode=*“ as ip, errorcode | where errorcode=”failed_login” | lookup ip from /my/whitelisted_ips on ip=ip Data Knowledge/Result Predictions To optimize queries and predict results, you can use knowledge of your data. Consider the following example. Query 3 error failed_login | parse “ip=*, errorcode=*“ as ip, errorcode | where errorcode=”failed_login” | lookup ip from /my/whitelisted_ips on ip=ip | if( isNull(ip), "unsafe", "safe") as ip_status | where ip_status="unsafe" | count by ip | top 10 newip, ip by _count You may know that your top 10 values are all measured in the thousands or tens of thousands. Based on that knowledge, you can optimize this query to not evaluate any IP addresses that occur less frequently than what you expect: Query 4 error failed_login | parse “ip=*, errorcode=*“ as ip, errorcode | where errorcode=”failed_login” | count by ip | where _count > 1000 | lookup ip from /my/whitelisted_ips on ip=ip | if( isNull(ip), "unsafe", "safe") as ip_status | where ip_status="unsafe" | top 10 newip, ip by _count

February 17, 2015

Blog

Debugging Amazon SES Message Delivery Using Sumo Logic

We at Sumo Logic use Amazon SES (Simple Email Service) for sending thousands of emails every day for things like search results, alerts, account notifications etc. We need to monitor SES to ensure timely delivery and know when emails bounce. Amazon SES provides notifications about status of email via Amazon SNS (Simple Notification Service). Amazon SNS allows you to send these notifications to any HTTP endpoint. We ingest these messages using Sumo Logic’s HTTP Source. Using these logs, we have identified problems like scheduled searches which always send results to an invalid email address; and a Microsoft Office 365 outage when a customer reported having not received the sign up email. Here’s a step by step guide on how to send your Amazon SES notifications to Sumo Logic. 1. Set Up Collector. The first step is to set up a hosted collector in Sumo Logic which can receive logs via HTTP endpoint. While setting up the hosted collector, we recommend providing an informative source category name, like “aws-ses”. 2. Add HTTP Source. After adding a hosted collector, you need to add a HTTP Source. Once a HTTP Source is added, it will generate a URL which will be used to receive notifications from SNS. The URL looks like https://collectors.sumologic.com/receiver/v1/http/ABCDEFGHIJK. 3. Create SNS Topic. In order to send notifications from SES to SNS, we need to create a SNS topic. The following picture shows how to create a new SNS topic on the SNS console. We uses “SES-Notifications” as the name of the topic in our example. 4. Create SNS Subscription. SNS allows you to send a notification to multiple HTTP Endpoints by creating multiple subscriptions within a topic. In this step we will create one subscription for the SES-Notifications topic created in step 3 and send notifications to the HTTP endpoint generated in step 2. 5. Confirm Subscription. After a subscription is created, Amazon SNS will send a subscription confirmation message to the endpoint. This subscription confirmation notification can be found in Sumo Logic by searching for: _sourceCategory=<name of the sourceCategory provided in step 1> For example: _sourceCategory=aws-ses Copy the link from the logs and paste it in your browser. 6. Send SES notifications to SNS. Finally configure SES to send notifications to SNS. For this, go to the SES console and select the option of verified senders on the left hand side. In the list of verified email addresses, select the email address for which you want to configure the logs. The page looks like On the above page, expand the notifications section and click edit notifications. Select the SNS topic you created in step 3. 7. Switch message format to raw (Optional). SES sends notifications to SNS in a JSON format. Any notification sent through SNS is by default wrapped into a JSON message. Thus in this case, it creates a nested JSON, resulting in a nearly unreadable message. To remove this problem of nested JSON messages, we highly recommend configuring SNS to use raw message delivery option. Before setting raw message format After setting raw message format JSON operator was used to easily parse the messages as show in the queries below: 1. Retrieve general information out of messages _sourceCategory=aws-ses | json “notificationType”, “mail”, “mail.destination”, “mail.destination[0]”, “bounce”, “bounce.bounceType”, “bounce.bounceSubType”, “bounce.bouncedRecipients[0]” nodrop 2. Identify most frequently bounced recipients _sourceCategory=aws-ses AND !”notificationType”:”Delivery” | json “notificationType”, “mail.destination[0]” as type,destination nodrop | count by destination | sort by _count

AWS

October 2, 2014

Blog

We are Shellshock Bash Bug Free Here at Sumo Logic, but What about You?

Blog

Mitigating the Heartbleed Vulnerability

https://www.sumologic.com/blog... dir="ltr">By now, you have likely read about the security vulnerability known as the Heartbleed bug. It is a vulnerability in the widespread OpenSSL library. It allows stealing the information protected, under normal conditions, by the SSL/TLS encryption used to encrypt traffic on the Internet (including Sumo Logic). How did we eliminate the threat? When we were notified about the issue, we quickly discovered that our own customer-facing SSL implementation was vulnerable to the attack — thankfully, the advisory was accompanied by some scripts and tools to test for the vulnerability. Mitigation happened in four steps: Fix vulnerable servers. As a first step, we needed to make sure to close the information leak. In some cases, that meant working with third party vendors (most notably, Amazon Web Services, who runs our Elastic Load Balancers) to get all servers patched. This step was concluded once we confirmed that all of load balancers on the DNS rotation were no longer vulnerable. Replace SSL key pairs. Even though we had no reason to believe there was any actual attack against our SSL private keys, it was clear all of them had to be replaced as a precaution. Once we had them deployed out to all the servers and load balancers, we revoked all previous certificates with our CA, GeoTrust. All major browsers perform revocation checks against OCSP responders or CRLs. Notify customers. Shortly after we resolved the issues, we sent an advisory to all of our customers, recommending a password change. Again, this was a purely precautionary measure, as there is no evidence of any passwords leaking. Update Collectors. We have added a new feature to our Collectors that will automatically replace the Collector’s credentials. Once we complete testing, we will recommend all customers to upgrade to the new version. We also enabled support for certificate revocation checking, which wasn’t enabled previously. How has this affected our customers? Thus far, we have not seen any signs of unusual activity, nor have we seen any customers lose data due to this bug. Unfortunately, we’ve had to inconvenience our customers with requests to change passwords and update Collectors, but given the gravity of the vulnerability, we felt the inconvenience was justified. Internal impact Our intranet is hosted on AWS and our users use OpenVPN to connect to it. The version of OpenVPN we had been running needed to be updated to a version that was released today. Other servers behind OpenVPN also needed updating. Sumo Logic uses on the order of 60-70 SaaS services internally. Shortly after we resolved our customer facing issues, we performed an assessment of all those SaaS services. We used the scripts to test for the vulnerability combined with DNS lookups. If a service looked like it was hosted with a provider/service that was known to have been vulnerable (such as AWS ELB), we added it to our list. We are now working our way through the list and changing passwords on all affected applications, starting with the most critical ones. Unfortunately, manually changing passwords in all of the affected applications takes time and presents a burden on our internal IT users. We plan to have completed this process by the end of the week. Interesting Days Overall, the past few days were pretty interesting on the internet. Many servers (as many as 66% of SSL servers on the net) are running OpenSSL, and most were affected. Big sites, including Yahoo Mail and many others were affected. The pace of exploitation and spread of the issue were both concerning. Thankfully, Codenomicon, the company that discovered this vulnerability, did an amazing job handling and disclosing it in a pragmatic and responsible fashion. This allowed everybody to fix the issue rapidly and minimize impact on their users. https://www.sumologic.com/blog... class="at-below-post-recommended addthis_tool">

Blog

AWS re:Invent - The Future is Now

Blog

Pragmatic AWS: 3 Tips to Enhance the AWS SDK with Scala

At Sumo Logic, most backend code is written in Scala. Scala is a newer JVM (Java Virtual Machine) language created in 2001 by Martin Odersky, who also co-founded our Greylock sister company, TypeSafe. Over the past two years at Sumo Logic, we’ve found Scala to be a great way to use the AWS SDK for Java. In this post, I’ll explain some use cases. 1. Tags as fields on AWS model objects Accessing AWS resource tags can be tedious in Java. For example, to get the value of the “Cluster” tag on a given instance, something like this is usually needed: String deployment = null; for (Tag tag : instance.getTags()) { if (tag.getKey().equals(“Cluster”)) { deployment = tag.getValue(); } } While this isn’t horrible, it certainly doesn’t make code easy to read. Of course, one could turn this into a utility method to improve readability. The set of tags used by an application is usually known and small in number. For this reason, we found it useful to expose tags with an implicit wrapper around the EC2 SDK’s Instance, Volume, etc. classes. With a little Scala magic, the above code can now be written as: val deployment = instance.cluster Here is what it takes to make this magic work: object RichAmazonEC2 { implicit def wrapInstance(i: Instance) = new RichEC2Instance(i) } class RichEC2Instance(instance: Instance) { private def getTagValue(tag: String): String = tags.find(_.getKey == tag).map(_.getValue).getOrElse(null) def cluster = getTagValue(“Cluster”) } Whenever this functionality is desired, one just has to import RichAmazonEC2._ 2. Work with lists of resources Scala 2.8.0 included a very powerful new set of collections libraries, which are very useful when manipulating lists of AWS resources. Since the AWS SDK uses Java collections, to make this work, one needs to import collections.JavaConversions._, which transparently “converts” (wraps implicitly) the Java collections. Here are a few examples to showcase why this is powerful: Printing a sorted list of instances, by name: ec2.describeInstances(). // Get list of instances. getReservations. map(_.getInstances). flatten. // Translate reservations to instances. sortBy(_.sortName). // Sort the list. map(i => “%-25s (%s)”.format(i.name, i.getInstanceId)). // Create String. foreach(println(_)) // Print the string. Grouping a list of instances in a deployment by cluster (returns a Map from cluster name to list of instances in the cluster): ec2.describeInstances(). // Get list of instances. filter(_.deployment = “prod”). // Filter the list to prod deployment. groupBy(_.cluster) // Group by the cluster. You get the idea – this makes it trivial to build very rich interactions with EC2 resources. 3. Add pagination logic to the AWS SDK When we first started using AWS, we had a utility class to provide some commonly repeated functionality, such as pagination for S3 buckets and retry logic for calls. Instead of embedding functionality in a separate utility class, implicits allow you to pretend that the functionality you want exists in the AWS SDK. Here is an example that extends the AmazonS3 class to allow listing all objects in a bucket: object RichAmazonS3 { implicit def wrapAmazonS3(s3: AmazonS3) = new RichAmazonS3(s3) } class RichAmazonS3(s3: AmazonS3) { def listAllObjects(bucket: String, cadence: Int = 100): Seq[S3ObjectSummary] = { var result = List[S3ObjectSummary]() def addObjects(objects: ObjectListing) = result ++= objects.getObjectSummaries var objects = s3.listObjects(new ListObjectsRequest().withMaxKeys(cadence).withBucketName(bucket)) addObjects(objects) while (objects.isTruncated) { objects = s3.listNextBatchOfObjects(objects) addObjects(objects) } result } } To use this: val objects = s3.listAllObjects(“mybucket”) There is, of course a risk of running out of memory, given a large enough number of object summaries, but in many use cases, this is not a big concern. Summary Scala enables programmers to implement expressive, rich interactions with AWS and greatly improves readability and developer productivity when using the AWS SDK. It’s been an essential tool to help us succeed with AWS.

AWS

July 12, 2012

Blog

Nine 1Password Power Tips

Blog

Security-Gain without Security-Pain

AWS

June 21, 2012

Blog

Pragmatic AWS: Principle of Least Privilege with IAM

Lock and Chain – by Martin Magdalene One of the basic principles in information security is the Principle of Least Privilege. The idea is simple: give every user/process/system the minimal amount of access required to perform its tasks. In this post, I’ll describe how this principle can be applied to applications running in a cluster of EC2 instances that need access to AWS resources. What are we protecting? The AWS Access Key ID and Secret are innocent looking strings. I’ve seen people casually toss them around scripts and bake them into AMIs. When compromised, however, they give the attacker full control over all of our resources in AWS. This goes beyond root access on a single box – it’s “god mode” for your entire AWS world! Needless to say, it is critical to limit both the likelihood of a successful attack and the exposure in case of a successful attack against one part of your application. Why do we need to expose AWS credentials at all? Since our applications run on EC2 instances and access other AWS services, such as S3, SQS, SimpleDB, etc, they need AWS credentials to run and perform their functions. Limiting the likelihood of an attack: Protecting AWS credentials In an ideal world, we could pass the AWS credentials into applications without ever writing them to disk and encrypt them in application memory. Unfortunately, this would make for a rather fragile system – after a restart, we’d need to pass the credentials into the application again. To enable automated restarts, recovery, etc., most applications store the credentials in a configuration file. There are many other methods for doing this. Shlomo Swidler compared tradeoffs between different methods for keeping your credentials secure in EC2 instances. At Sumo Logic, we’ve picked what Shlomo calls the SSH/On Disk method. The concerns around forgetting credentials during AMI creation don’t apply to us. Our AMI creation is fully automated, and AWS credentials never touch those instances. The AWS credentials only come into play after we boot from the AMI. Each application in our stack runs as a separate OS user, and the configuration file holding the AWS credentials for the application can only be read by that user. We also use file system encryption wherever AWS credentials are stored. To add a twist, we obfuscate the AWS credentials on disk. We encrypt them using a hard-coded, symmetric key. This obfuscation, an additional Defense-in-Depth measure, makes it a little more difficult to get the plain text credentials in the case of instance compromise. It also makes shoulder surfing much more challenging. Limiting exposure in case of a successful attack: Restricted access AWS credentials Chances are that most applications only need a very small subset of the AWS portfolio of services, and only a small subset of resources within them. For example, an application using S3 to store data will likely only need access to a few buckets, and only perform limited set of operations against them. AWS’s IAM service allows us to set up users with limited permissions, using groups and policies. Using IAM, we can create a separate user for every application in our stack, limiting the policy to the bare minimum of resources/actions required by the application. Fortunately, the actions available in policies directly correspond to AWS API calls, so one can simply analyze which calls an application makes to the AWS API and derive the policy from this list. For every application-specific user, we create a separate set of AWS credentials and store them in the application’s configuration file. In Practice – Automate, automate, automate! If your stack consists of more than one or two applications or instances, the most practical option for configuring IAM users is automation. At Sumo Logic, our deployment tools create a unique set of IAM users. One set of users per deployment and one user per application within the deployment. Each user is assigned a policy that restricts access to only those of the deployments resources that are required for the application. If the policies changes, the tools update them automatically. The tools also configure per-application OS level users and restrict file permissions for the configuration files that contain the AWS credentials for the IAM user. The configuration files themselves store the AWS credentials as obfuscated strings. One wrinkle in this scheme is that the AWS credentials created for the IAM users need to be stored somewhere after their initial creation. After the initial creation of the AWS credentials, they can never be retrieved from AWS again. Since many of our instances are short-lived, we needed to make sure we could use the credentials again later. To solve this particular issue, we encrypt the credentials, then store them in SimpleDB. The key used for this encryption does not live in AWS and is well-protected on hardware tokens. Summary It is critical to treat your AWS credentials as secrets and assign point-of-use specific credentials with minimal privileges. IAM and automation are essential enablers to make this practical. Update (6/12/2012): AWS released a feature named IAM Roles for EC2 Instances today. It makes temporary a set of AWS credentials available via instance metadata. The credentials are rotated multiple times a day. IAM Roles add a lot of convenience, especially in conjunction with the AWS SDK for Java. Unfortunately, this approach has an Achilles heel: any user with access to the instance can now execute a simple HTTP request and get a valid set of AWS credentials. To mitigate some of the risk, a local firewall, such as iptables, can be used to restrict HTTP access to a subset of users on the machine. Comparing the two approaches + User privileges and obfuscation offer a stronger defense in scenarios where a single (non-root) user is compromised. + Per-application (not per-instance) AWS credentials are easier to reason about. – The rotation of IAM keys performed transparently by IAM roles adds security. An attacker has to maintain access to a compromised machine to maintain access to valid credentials. Best of Both Worlds AWS’s approach could be improved upon with a small tweak: Authenticate access to the temporary/rotating credentials T in instance metadata using another pair of credentials A. A itself would not have any privileges other than accessing T from within an instance. This approach would be a “best of both worlds”. Access to A could be restricted using the methods described above, but keys would still be rotated on an ongoing basis.

AWS

June 12, 2012

Blog

Pragmatic AWS: Data Destroying Drones

As we evolve our service, we occasionally delete EBS (Elastic Block Store) volumes. This releases the disk space back to AWS to be assigned to another customer. As a security precaution, we have decided to perform a secure wipe of the EBS volumes. In this post, I’ll explain how we implemented the wipe. Caveats Wiping EBS volumes may be slightly paranoid and not strictly needed, since AWS guarantees to never return a previous users data via the hypervisor (as mentioned in their security white paper). We also understand that the secure wipe is not perfect. EBS is able to move our data around in the background and leave back blocks that we didn’t wipe. Still, we felt that this additional precaution was worth the bit of extra work and cost – better safe than sorry. Drones We wanted to make sure secure wiping did not to have any performance impact on our production deployment. Therefore, we decided that it would be great to perform the secure wipe from a different set of AWS instances — Data Destroying Drones. We also wanted them to be fire-and-forget, so we wouldn’t have to manually check up on them. To accomplish all this, we built a tool that: Finds to-be-deleted EBS volumes matching a set of tag values. (we tag the volumes to mark them for wiping). Launches one t1.micro instance per EBS volume that needs wiping (using an Ubuntu AMI). Passes a cloud-init script with Volume ID and (IAM limited) AWS credentials into the instance. The Gory Details Ubuntu has a mechanism named cloud-init. It accepts a shell script via EC2’s user data, which is passed in as part of the RunInstances API call to EC2. Here is the script we use for the Data Destroying Drones: #wrap_githubgist2779581 .gist-data {max-height: 100%;} document.write('') document.write('\n \n \n \n \n \n\n \n \n \n \n #!/bin/bash\n \n \n \n set -e\n \n \n \n export INSTANCE_ID=`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`\n \n \n \n export VOLUME_ID=v-12345678\n \n \n \n export EC2_URL=https://ec2.us-east-1.amazonaws.com\n \n \n \n export EC2_ACCESS_KEY=[key id]\n \n \n \n export EC2_SECRET_KEY=[key]\n \n \n \n \n\n \n \n \n sudo apt-get install scrub\n \n \n \n euca-attach-volume -i $INSTANCE_ID -d /dev/sdj $VOLUME_ID\n \n \n \n sudo scrub -b 50M -p dod /dev/sdj > ~/sdj.scrub.log 2>&1\n \n \n \n sleep 30\n \n \n \n \n\n \n \n \n euca-detach-volume $VOLUME_ID\n \n \n \n euca-delete-volume $VOLUME_ID\n \n \n \n halt\n \n\n\n \n\n \n \n\n\n \n \n view raw\n destroy_volume.sh\n hosted with ❤ by GitHub\n \n \n\n') Not Found This script automates the entire process: Attach the volume. Perform a DoD 5220.22-M secure wipe of the volume using scrub. Detach and delete the volume. Halt the instance. The instances are configured to terminate on halt, which results in all involved resources to disappear once the secure wipe completes. The scrub can take hours or even days, depending on the size of the EBS volumes, but the cost for the t1.micro instances makes this a viable option. Even if the process takes 48 hours, it costs less than $1 to wipe the volume. Summary Aside from being a fun project the Data Destroying Drones have given us additional peace of mind and confidence that we’ve followed best practice and made a best effort to secure our customers data by not leaving any of it behind in the cloud.

AWS

June 5, 2012

Blog

Pragmatic AWS: 4 Ideas for Using EC2 Tags

At Sumo Logic, we use Amazon Web Services (AWS) for everything. Our product, as well as all our internal infrastructure live in AWS. In this series of posts, we’ll share some useful practices around using AWS. In the first installment, I’ll outline some useful things we do with tags.1. Organize resourcesWe’ve decided on a hierarchical way of managing our EC2 (Elastic Compute Cloud) resources:Deployment+ Cluster+ Instance/NodeWithin an AWS account, we can have multiple “deployments”. A deployment is a complete, independent copy of our product and uses the same architecture as our production service. Besides production, we use several smaller-scale deployments for development, testing and staging. Each deployment consists of a number of clusters, and each cluster of one or more instances.Instances and their corresponding EBS (Elastic Block Store) volumes are tagged with Deployment, Cluster and Node tags. As an example, the third frontend node of our production deployment would be tagged like so:Deployment=prodCluster=frontendNodeNumber=3There is also a direct mapping to DNS names. The DNS name for this node would be prod-frontend-3.Combined with the filtering features in AWS Console (you can make any tag a column in the resource listings), this makes it very easy to navigate to a particular set of resources.2. Display Instance StatusTags can also be used as an easy way to display status information in the AWS console. Simply update a tag with the current status, whenever it changes.The code that deploys our instances into EC2 updates a DeployStatus tag whenever it progresses from one step to another. For example, it could read:2012-05-10 17:53 Installing CassandraThis allows you to see what’s going on with instances at a glance.3. Remember EBS Volume DevicesFor EC2 instances that have multiple EBS volumes, when they need to be attached, our tools need to know which volume gets mapped to which device on the instance.When we first create a volume, for example /dev/sdj, we create add a DeviceName tag to the volume with a value of /dev/sdj to track where it needs to be attached. Next time we attach the volume, we know it’s “proper place”.4. Attribute and remind people of costsAll our developers are empowered to create their own AWS resources. This is a huge benefit for full-scale testing, performance evaluations, and many other use cases. Since AWS is not a charity, however, we need to manage costs tightly. In order to do this, we tag all AWS resources with an Owner tag (either by hand, or via our automated deployment tool).To consume this tag, we have a cron job that runs daily and emails users who have active resources in AWS to remind them to shut down what they no longer require.The subject line of the email reads “[AWS] Your current burn rate is $725.91/month!”. The body of the email contains a table with a more detailed cost breakdown. In addition, there is also a rollup email that goes out to the entire development team.SummaryEC2 tags are extremely useful tools to track state, organize resources and store relationships between resources like instances and EBS volumes. There are a myriad more ways to use them. I hope these tips have been helpful.

AWS

May 15, 2012

Blog

Sumo on Sumo, Part 2: User Signups

In part 1, we mentioned that we’re big on “dogfooding”. In this short post we’ll run you through a very simple use case we solve with our Sumo Logic product: When we launched in January, everybody here was extremely excited, and we wanted to know who signed up for our demo. Solving this problem with the product was a matter of 2 minutes.The component responsible for account creation logs a line like the one below whenever somebody signs up for the demo:2012-04-13 10:31:58,917 -0700 INFO [hostId=prod-bill-1] [module=BILL] [logger=bill.signup.DemoAccountRequestHandler] [thread=btpool0-19] [10.10.10.10] Successful request: Stefan Zier (stefan+blog@sumologic.com) http://www.sumologic.com/free-trial/?email=stefan%2Bblog%40sumologic.com&firstName=Stefan&lastName=Zier&organizationName=Sumo&key=nnnnnnnnnnnnnnnnnnLooking carefully, this has all the information we’re interested in. Let me walk you through the process of building up the search, refining it iteratively. First, let’s find the lines we’re interested in by picking two keywords we think uniquely identify this type of message:DemoAccountRequestHandler SuccessfulOk, this gives us the lines. Now, let’s parse out the first, last name and email:DemoAccountRequestHandler Successful | parse “request: * (*)” as name, emailNow we have the data we care about, all parsed out. Note the simplified regular expression in the parse section of the query. Simply find a good anchor (“request:” in this case) and put * for the values you want parsed out. Looking at the results, we see that our own sign ups for testing and QA are included. Let’s get rid of them.DemoAccountRequestHandler Successful !sumologic.com | parse “request: * (*)” as name, emailThe only task that remains is to schedule it. Let’s say we want the results sent to employees every hour. We only want an email when there are results, not at 4am, when people are asleep. We save the search, schedule it, and configure the conditions for the schedule, as well as the destination of the notification: That’s all – everybody at Sumo Logic now gets an hourly summary of demo signups! Here is what the email looks like (ok, I cheated and removed the “!sumologic.com” for this demo):

May 1, 2012

Blog

RAFC: Internet Connectivity for the cloud age

Blog

Sumo on Sumo, Part 1: Collection

At Sumo Logic, we strongly believe in using our own service, sometimes called “dogfooding.” The primary reason for doing this is because Sumo Logic is a great fit for our environment. We run a mix of on-premise and cloud appliances, services and applications for which we need troubleshooting and monitoring capabilities:Our service, the Sumo Logic Log Management and Analytics Service, a distributed, complex SaaS applicationA heterogeneous, on-premise office networkOur development infrastructure, which lives in Amazon Web Services (AWS)Our websiteIn short: We are like many other companies out there, with a mix of needs and use cases for the Sumo Logic service. In this post, I’ll explain how we’ve deployed our Sumo Logic Collectors in our environment. Collectors are small software components that gather logs from various sources and send them to the Sumo Logic Cloud in a reliable and secure fashion. They can be deployed in a variety of ways. This post provides some real-world examples.The Sumo Logic ServiceOur service is deployed across a large number of servers that work in concert to accept logs, store them in NoSQL databases, index them, manage retention, and provide all the search and analytics functionality in the service. Any interaction with our system is almost guaranteed to touch more than one of these machines. As a result, debugging and monitoring based on log files alone would be impractical, bordering on impossible.The solution to this, of course, is the Sumo Logic service. After deciding that we wanted our own service to monitor itself, we weighed several different deployment options:A centralized Collector, pulling logs via SSHA centralized Collector, receiving logs via syslogCollectors on each machine in the deployment, reading local filesIn the end, we went with the third option: both our test and production environments run Sumo Logic Collectors on each machine. The primary motivator for this choice was that it was best for testing the service – running a bigger number of Collectors is more likely to surface any issues.This decision made it a priority to enable automatic deployment features in the Collector. This is why our Collectors can now be deployed both by hand and in a scripted fashion. Every time we deploy the service, a script installs and registers the Collector. Using the JSON configuration, we configure collection of files from our own application, third party applications, and system logs.So there you have it – the Sumo Logic service monitors itself.The OfficeSumo Logic’s main office is located in downtown Mountain View, on Castro Street. As strong proponents of cloud-based technologies, we’ve made an effort to keep the amount of physical infrastructure inside our office to a minimum. But there are some items any office needs, including ours:Robust Internet connectivity (we run a load balanced setup with two ISPs)Network infrastructure (we have switches, WiFi access points, firewalls)Storage for workstation backups (we have a small NAS)Phones (we just deployed some IP phones)Security Devices (we run an IDS with taps into our multiple points in network, and a web proxy)DHCP/DNS/Directory services (we run a Mac server with Open Directory)Some of these devices log to files, others to syslog, while yet others are Windows machines. Whenever a new device is added to the network, we make sure logs are being collected from it. This has been instrumental in debugging tricky WiFi issues (Castro Street is littered with tech companies running WiFi), figuring out login issues, troubleshooting Time Machine problems with our NAS, and many other use cases.Development InfrastructureOur bug tracker, CI cluster and other development infrastructure live in AWS. In order to monitor this infrastructure, we run Sumo Logic Collectors on all nodes, picking up system log files, web server logs, and application logs from the various commercial and open source tools we run. We use these logs for troubleshooting and to monitor trends.Our WebsiteThe web server on our public facing web site logs lots of interesting information about visitors and how they interact with the site. Of course, we couldn’t resist dropping in a Collector to pick up these log files. We use a scheduled query that runs hourly and tells us who signed up for the demo and trial accounts.In SummaryWe eat our own dog food and derive a lot of value from our own service, to troubleshoot and monitor all of our infrastructure, both on-premise and in the cloud. Our Collector’s ability to collect from a rich set of common source types (files, syslog, Windows), as well as the automatic, scripted installation, make it very easy to add new logs into the Sumo Logic Cloud.– Stefan Zier, Cloud Infrastructure ArchitectSumo on Sumo, part 2: User Signups