Get the reportMore
Posts by Joan Pepin
LogReduce vs Shellshock
Don't Just Move your Data-Center (Part I)
A couple of weeks ago I gave a cool little web presentation (I say cool because I like doing those decently more than I like sitting at my desk, and I say little because I went for 33 minutes, and I know I could have gone for 90...) about cloud security best practices and design principles (I will be giving this talk again, BTW, on January 9th for the Amazon Web Services Ecosystem) and I got a pretty good question from one of the viewers. They wanted to know “what mistakes do people make when they utilize cloud based infrastructure providers?” and I thought that was an excellent question, and since not all of you were there to hear my answer, I’m here to share it, and expound on it a little bit. In my opinion, the biggest mistake you can make in adopting Infrastructure as a Service (IaaS) providers is to just move your data-center into the cloud wholesale in basically the same shape it is already in. Now, certainly I understand this temptation! You have probably spent a lot of time creating scripts and setting up access controls and logging mechanisms, and everything that comes with building your deployment in a traditional (what I call ‘data-center-centric’) way. And so it may seem that the best, fastest, easiest and cheapest thing to do is to simply pick it up and move it, as it were, to your new “hosting provider”, but this approach may well leave you missing out on some of the best reasons to run in the cloud. IaaS providers, such as Amazon Web Services, offer a multitude of services and features that can vastly improve your operational efficiency, scalability, and security, but they must be properly leveraged. Cloud computing, while similar in some respects to hosting, is an entirely different paradigm, and in order to take full advantage of it’s benefits, some time and care needs to be taken in the design phase of such a project. I like to compare the differences in cloud versus data-center configurations in terms of two types of gambling/entertainment most of you will be familiar with- playing Three Card Monte on the street (your data-center) and going to gamble in a major casino (the cloud). In a Three Card Monte scenario, the ‘house’ makes its money by keeping a tight level of control over the game. They know exactly which card the token is under at all times, they can palm or move the token at will, and they will have one or more shills in the audience to help them control the crowd’s reactions. This can be a very profitable endeavor for the ‘house’, but it is not scalable to large crowds, multiple dealers, or to environments where there is a high degree of scrutiny. In contrast, a casino is designed to achieve the same ends (to take your money and provide some entertainment in the process) but does so in a very different way. The casino relies on statistics in order to win over any given day. The casino can’t control (due to regulators) which slot machines will pay out exactly how much exactly when, nor can they control which blackjack dealers will have good or bad nights, and they cannot ‘fix’ the roulette wheel, but yet- the house always wins. This model is scaleable to large crowds, multiple dealers and games, and even high degrees of scrutiny. It is through exercising control at a higher level and giving up control at the lower level that they are able to achieve this scalability and profitability. It works the same in the cloud. Rather than having precise control over your hardware and network connections, you exercise control at the design level by creating feedback loops, auto-scaling triggers and by catching and reacting to exceptions. This allows you, much the same as the casino, to give up control over many of the details, and still ensure you always win at the end of the day. So just as it would be impractical to set up a Three Card Monte table in a modern casino, simply hauling your existing design into the cloud is not the best approach. Take the time to re-design your system to utilize all of the great advantages that IaaS providers such as Amazon provide. Tune in later fo Part II.
Some of Our Essential Service Providers
As I mentioned in one of my previous posts, here at Sumo Logic we believe cloud-based services provide excellent value due to their ease of setup, convenience and scalability, and we leverage them extensively to provide internal services that would be far more time, labor and cash intensive to manage ourselves. Today I’m going to talk about some of the services we use for collaboration, operations and I/T, why we use them, and how they simplify our lives.CampfireCampfire is a huge part of our productivity and culture at Sumo Logic. While I would lump this and Skype together under something like “Managed Corporate Messaging” they fill two very different niches in our environment.Campfire from 37 Signals is a fantastic tool for group conversations. Using the Campfire service, we have set up multiple chat rooms for various types of issues, including Production Issues, Development Issues, Sales/Customer-Support Issues, and of course, a free-for-all chat-room where we try to make one another spontaneously erupt into chaotic LOLs.These group-chats provide a critical space where we can work together to troubleshoot and solve problems cooperatively. Campfire makes it very easy to upload pictures and share large amounts of information in real-time with co-workers who can be anywhere. The conversations are all archived for later reference, which allows us to use the Production Incidents room as a 24×7 conference call and canonical forum of record for anything happening to production systems. Our Production on-call devs are expected to echo their actions into the Production channel and keep up with events there as they transpire.Campfire also has a cool feature which allows you to start a voice conference with participants if needed, which is a great option in certain situations. These calls can also be archived for later reference. One down side to the text and audio archives is that they are not easily searchable so it helps to know approximately when something happened, and we have found it necessary to consult other records to determine where to look.SkypeSkype is, of course, the very popular IM and VOIP service that was purchased by Microsoft a while back. We use Skype extensively for 1:1 chatting and easy and secure file-transfers throughout the company. We also make extensive use of the wide array of available emoticons. (Stefan Zier is a particularly prolific and artistic user of these.)We also use Skype video chat for interviews and to collaborate with team members abroad. We have a conference room with a TV and Skype camera just for this application.CloudkickRunning a large-scale cloud-based service requires a lot of operational awareness. One of the ways we achieve this is through Cloudkick. Cloudkick was recently acquired by Rackspace and is evolving into a Cloud Monitoring tool. We are still on the legacy Cloudkick service, which we have come to use heavily.We automatically install Cloudkick agents on all of our production instances and use them to collect a wide array of status codes from the O/S and through JMX as well as by running our own custom scripts which we use to check for the existence of critical processes and to detect if things like HPROF files exist.The Cloudkick website has a “show only failures” mode which we call the “What’s Wrong? Page”. This is a very helpful tool that allows our EverybodyOps team to quickly assess issues with our production environment.PagerDutyOf course, we also need to be proactively alerted to failures and crossed thresholds that could indicate trouble, and for this we rely on PagerDuty. (Affectionately known as P. Diddy to many of us, nickname coined by Christian). PagerDuty is another great tool which allows us to maximize the benefits of our EverybodyOps culture.Within PagerDuty we have a number of on-call rotations. One for our Production Primary role and one for the Secondary role, as well as another role for monitoring test failures and a lesser-known role for those of us who monitor the temperature in the one small server room we do have. P. Diddy allows us easily cover for each other using exceptions or by simply switching the Primary and Secondary roles on the fly if the Primary needs to go AFK for a while.P. Diddy allows each user to set their own personal escalation policy which can include texting, calling, and emailing with a configurable number of re-tries and timeouts. Another nice touch is that the rotation calendars can be imported into our personal calendars to remind us of when we are up next. This all makes the on-call rotation run pretty flawlessly from an administrative perspective with no gnarly configuration and management on our end.I must admit, I do have a personal habit of “Joaning” my secondary when I am on call… To properly “Joan” your secondary you accidentally escalate an alert to them that you meant to resolve, (I blame the comma after “Resolv”!)Google AppsLike many companies of all sizes we rely on Google for our email service. While some Sumos (like myself and Stefan) use mail clients to read our email, most Sumos are happy with the standard web interface from Google. We also heavily use internal groups for team communications.We also make good use of Google Docs for document authoring and sharing (this blog post was written and communally edited using Google Docs, in fact, due to the impressive real-time collaboration, Stefan Zier is watching me add this bit in order to resolve his comment right now!) We use Google Calendar for our scheduling needs (and calendar-stalking exercises!)We also use Google Analytics to obsess over you.Also, as Sumo Logic’s Director of Security, (which makes me partially responsible for managing the users and groups in Google Apps) I appreciate the richness of their security settings and especially the two-factor authentication and mobile device policy management.There’s more!These are just some of our SaaS providers. In an upcoming post I’ll talk more about some of the services that help us support and bill our customers and test and develop our product.We have found all of these providers deliver valuable and even crucial services that it would be far more expensive and time consuming for us to manage ourselves. We hope you may find some of them helpful too!