A couple of weeks ago I gave a cool little web presentation (I say cool because I like doing those decently more than I like sitting at my desk, and I say little because I went for 33 minutes, and I know I could have gone for 90...) about cloud security best practices and design principles (I will be giving this talk again, BTW, on January 9th for the Amazon Web Services Ecosystem) and I got a pretty good question from one of the viewers. They wanted to know “what mistakes do people make when they utilize cloud based infrastructure providers?” and I thought that was an excellent question, and since not all of you were there to hear my answer, I’m here to share it, and expound on it a little bit. In my opinion, the biggest mistake you can make in adopting Infrastructure as a Service (IaaS) providers is to just move your data-center into the cloud wholesale in basically the same shape it is already in. Now, certainly I understand this temptation! You have probably spent a lot of time creating scripts and setting up access controls and logging mechanisms, and everything that comes with building your deployment in a traditional (what I call ‘data-center-centric’) way. And so it may seem that the best, fastest, easiest and cheapest thing to do is to simply pick it up and move it, as it were, to your new “hosting provider”, but this approach may well leave you missing out on some of the best reasons to run in the cloud. IaaS providers, such as Amazon Web Services, offer a multitude of services and features that can vastly improve your operational efficiency, scalability, and security, but they must be properly leveraged. Cloud computing, while similar in some respects to hosting, is an entirely different paradigm, and in order to take full advantage of it’s benefits, some time and care needs to be taken in the design phase of such a project. I like to compare the differences in cloud versus data-center configurations in terms of two types of gambling/entertainment most of you will be familiar with- playing Three Card Monte on the street (your data-center) and going to gamble in a major casino (the cloud). In a Three Card Monte scenario, the ‘house’ makes its money by keeping a tight level of control over the game. They know exactly which card the token is under at all times, they can palm or move the token at will, and they will have one or more shills in the audience to help them control the crowd’s reactions. This can be a very profitable endeavor for the ‘house’, but it is not scalable to large crowds, multiple dealers, or to environments where there is a high degree of scrutiny. In contrast, a casino is designed to achieve the same ends (to take your money and provide some entertainment in the process) but does so in a very different way. The casino relies on statistics in order to win over any given day. The casino can’t control (due to regulators) which slot machines will pay out exactly how much exactly when, nor can they control which blackjack dealers will have good or bad nights, and they cannot ‘fix’ the roulette wheel, but yet- the house always wins. This model is scaleable to large crowds, multiple dealers and games, and even high degrees of scrutiny. It is through exercising control at a higher level and giving up control at the lower level that they are able to achieve this scalability and profitability. It works the same in the cloud. Rather than having precise control over your hardware and network connections, you exercise control at the design level by creating feedback loops, auto-scaling triggers and by catching and reacting to exceptions. This allows you, much the same as the casino, to give up control over many of the details, and still ensure you always win at the end of the day. So just as it would be impractical to set up a Three Card Monte table in a modern casino, simply hauling your existing design into the cloud is not the best approach. Take the time to re-design your system to utilize all of the great advantages that IaaS providers such as Amazon provide. Tune in later fo Part II.
As I mentioned in one of my previous posts, here at Sumo Logic we believe cloud-based services provide excellent value due to their ease of setup, convenience and scalability, and we leverage them extensively to provide internal services that would be far more time, labor and cash intensive to manage ourselves. Today I’m going to talk about some of the services we use for collaboration, operations and I/T, why we use them, and how they simplify our lives. Campfire Campfire is a huge part of our productivity and culture at Sumo Logic. While I would lump this and Skype together under something like “Managed Corporate Messaging” they fill two very different niches in our environment. Campfire from 37 Signals is a fantastic tool for group conversations. Using the Campfire service, we have set up multiple chat rooms for various types of issues, including Production Issues, Development Issues, Sales/Customer-Support Issues, and of course, a free-for-all chat-room where we try to make one another spontaneously erupt into chaotic LOLs. These group-chats provide a critical space where we can work together to troubleshoot and solve problems cooperatively. Campfire makes it very easy to upload pictures and share large amounts of information in real-time with co-workers who can be anywhere. The conversations are all archived for later reference, which allows us to use the Production Incidents room as a 24×7 conference call and canonical forum of record for anything happening to production systems. Our Production on-call devs are expected to echo their actions into the Production channel and keep up with events there as they transpire. Campfire also has a cool feature which allows you to start a voice conference with participants if needed, which is a great option in certain situations. These calls can also be archived for later reference. One down side to the text and audio archives is that they are not easily searchable so it helps to know approximately when something happened, and we have found it necessary to consult other records to determine where to look. Skype Skype is, of course, the very popular IM and VOIP service that was purchased by Microsoft a while back. We use Skype extensively for 1:1 chatting and easy and secure file-transfers throughout the company. We also make extensive use of the wide array of available emoticons. (Stefan Zier is a particularly prolific and artistic user of these.) We also use Skype video chat for interviews and to collaborate with team members abroad. We have a conference room with a TV and Skype camera just for this application. Cloudkick Running a large-scale cloud-based service requires a lot of operational awareness. One of the ways we achieve this is through Cloudkick. Cloudkick was recently acquired by Rackspace and is evolving into Rackspace Cloud Monitoring. We are still on the legacy Cloudkick service, which we have come to use heavily. We automatically install Cloudkick agents on all of our production instances and use them to collect a wide array of status codes from the O/S and through JMX as well as by running our own custom scripts which we use to check for the existence of critical processes and to detect if things like HPROF files exist. The Cloudkick website has a “show only failures” mode which we call the “What’s Wrong? Page”. This is a very helpful tool that allows our EverybodyOps team to quickly assess issues with our production environment. PagerDuty Of course, we also need to be proactively alerted to failures and crossed thresholds that could indicate trouble, and for this we rely on PagerDuty. (Affectionately known as P. Diddy to many of us, nickname coined by Christian). PagerDuty is another great tool which allows us to maximize the benefits of our EverybodyOps culture. Within PagerDuty we have a number of on-call rotations. One for our Production Primary role and one for the Secondary role, as well as another role for monitoring test failures and a lesser-known role for those of us who monitor the temperature in the one small server room we do have. P. Diddy allows us easily cover for each other using exceptions or by simply switching the Primary and Secondary roles on the fly if the Primary needs to go AFK for a while. P. Diddy allows each user to set their own personal escalation policy which can include texting, calling, and emailing with a configurable number of re-tries and timeouts. Another nice touch is that the rotation calendars can be imported into our personal calendars to remind us of when we are up next. This all makes the on-call rotation run pretty flawlessly from an administrative perspective with no gnarly configuration and management on our end. I must admit, I do have a personal habit of “Joaning” my secondary when I am on call… To properly “Joan” your secondary you accidentally escalate an alert to them that you meant to resolve, (I blame the comma after “Resolv”!) Google Apps Like many companies of all sizes we rely on Google for our email service. While some Sumos (like myself and Stefan) use mail clients to read our email, most Sumos are happy with the standard web interface from Google. We also heavily use internal groups for team communications. We also make good use of Google Docs for document authoring and sharing (this blog post was written and communally edited using Google Docs, in fact, due to the impressive real-time collaboration, Stefan Zier is watching me add this bit in order to resolve his comment right now!) We use Google Calendar for our scheduling needs (and calendar-stalking exercises!) We also use Google Analytics to obsess over you. Also, as Sumo Logic’s Director of Security, (which makes me partially responsible for managing the users and groups in Google Apps) I appreciate the richness of their security settings and especially the two-factor authentication and mobile device policy management. There’s more! These are just some of our SaaS providers. In an upcoming post I’ll talk more about some of the services that help us support and bill our customers and test and develop our product. We have found all of these providers deliver valuable and even crucial services that it would be far more expensive and time consuming for us to manage ourselves. We hope you may find some of them helpful too!
For the last two years, Sumo Logic has been (quietly) building a secure, massively scalable, multi-tenant data management and analytics platform in the cloud. For us at Sumo Logic, the Cloud is a concept we believe in and have internalized deeply into our culture, our processes and infrastructure. In our office we only own two equipment racks, and they are less than half full. The boxes there are our back-up server, some security gear, a VOIP box, and a single small server to provide network and AAA services for the LAN. We have ‘dogfooded’ not only our own product here (we make extensive use of our own product for troubleshooting and operations, see Stefan Zier’s series “Sumo on Sumo”), but the entire idea of the cloud itself. From our email and build environment, through our CRM and our product itself, we live in the Cloud. Through adopting best practices and developing some of our own we operate there in a way that is designed to be secure, and I’d like to share some of the insights we’ve picked up along the way. Of course, the “Cloud” is a nebulous term 😉 and here at Sumo Logic we use several different types of cloud-based services, which mostly break down into two categories; SaaS and IaaS. On the SaaS side, we have our email and CRM, testing, support and billing and as well as a number of services we use to monitor and alert on our service availability, and on the IaaS side, we use AWS to host our build environment and its associated bug-tracker, wiki and code-repository. One of the many advantages to this model is exemplified by our build environment (Hudson). Hosting this in EC2 provides us with great flexibility in bringing up new build-slaves at peak times, such as before a major release or branch. In general, the SaaS providers we use provide excellent security features. For instance, we mandate the use of two-factor authentication and strong passwords for access to Sumo Logic email, and our provider has a rich variety of security controls and features such as the two-factor authentication that we can (and do) leverage. This is much simpler than keeping this level of security would be if we ran the whole mess ourselves 🙂 On the IaaS side, Stefan Zier has done an amazing job of setting us up in AWS. One example of how IaaS features can be leveraged is the way in which he handled access to our AWS hosted resources. In addition to username and password authentication to our cloud based services (more on that later) Amazon “security groups” are used to limit network-level access to these services to only certain IP addresses on a whitelist. In order to handle the automation of that whitelist, we make use of a dynamic DNS provider that assigns hostnames to authorized systems. Stefan wrote a program which polls for the addresses of those authorized hosts and updates the corresponding security group in AWS. We plan on getting this set up on AWS’ Virtual Private Cloud sometime soon, which will allow us to layer a VPN on top of this already very secure solution. Another layer of protection we incorporate is anonymity. All of our cloud-based company infrastructure is attached to a domain which is not connected to Sumo Logic in any way. Similarly, we have used anonymized labels for our private git repositories, etc. The public cloud allows for some obscurity and anonymity, and we leverage that. Of course, there is still work involved in keeping things secure. Living in the cloud means having a lot of accounts. A LOT of them. Our process for on-boarding and off-boarding employees requires the creation or deletion of a very large number of accounts and the adding or removing of a lot of tags, groups, lists and checkboxes. Having solid documented procedures for this is the only way to keep it straight and running smoothly. We also have to host our own LDAP server for AAA to some of our tools, and we also use this for our VPN authentication. So we have to manage that, and it is a pain. Centralized AAA and policy/group management services exist for cloud-based services, and we’ve looked at some. Unfortunately, none of them also supported hosting or managing an LDAP instance for us, and keeping that synced up and tied-into the rest of the mess would be a killer feature. We certainly feel there is a gap in the market here that we wish somebody would fill. From an end-user perspective, there are a lot of accounts to keep straight and a lot of passwords to remember. In order to make this both secure and manageable, we provide (and mandate the use of) a password management tool that runs both Mac and Windows (and has a useable web-interface for Linux and others) and also runs on Android and iPhone. It uses a cloud-based file-storage service to sync its encrypted password database between devices. This allows us to mandate that users have extremely strong passwords that are different for every account and it gives our users the tools to actually comply with that rule 🙂 Of course, building a secure cloud-based service ourselves requires a lot of thought and engineering well beyond just leveraging our provider’s consoles. We have done a lot of thought about how to build a secure service leveraging IaaS and we have written a paper about some of the design principles and practices we employ. If you are interested you can download it here.