Back to blog results

October 28, 2015By Stefan Zier

Change Management, the ChatOps Way

All changes to production environments at Sumo Logic follow a well-documented change management process. While generally a sound practice, it is also specifically required for PCI, SOC 2, HIPAA, ISO 27001 and CSA Star compliance, amongst others. Traditional processes never seemed like a suitable way to implement change management at Sumo Logic. Even a Change Management Board (CMB) that meets daily is much too slow for our environment, where changes are implemented every day, at any time of the day. In this blog, I’ll describe our current solution, which we have iterated towards over the past several years.

The goals for a our change management process are that:

  • Anybody can propose a change to the production system, at anytime, and anybody can follow what changes are being proposed.
  • A well-known set of reviewers can quickly and efficiently review changes and decide on whether to implement them.
  • Any change to production needs to leave an audit trail to meet compliance requirements.

Workflow and Audit Trail

We used Atlassian JIRA to model the workflow for any System Change Request (SCR). Not only is JIRA a good tool for workflows, but we also use it for most of our other bug and project tracking, making it trivial to link to relevant bugs or issues. Here’s what the current workflow for a system change request looks like:

Workflow and Audit Trail

A typical system change request goes through these steps:

  1. Create the JIRA issue.
  2. Propose the system change request to the Change Management Board.
  3. Get three approvals from members of the Change Management Board.
  4. Implement the change.
  5. Close the JIRA issue.

If the CMB rejects the change request, we simply close the JIRA issue. The SCR type in JIRA has a number of custom fields, including:

  • Environments to which the change needs to be applied
  • Schedule date for the change
  • Justification for the change (for emergency changes only)
  • Risk assessment (Low/Medium/High)
  • Customer facing downtime?
  • Implementation steps, back-out steps and verification steps
  • CMB meeting notes
  • Names of CMB approvers

These details allow CMB members to quickly assess the risk and effects of a proposed change.

Getting to a decision quickly

To get from a proposal to approved change in the most expedient manner, we have a dedicated #cmb-public channel in Slack. The typical sequence is:

  1. Somebody proposes a system change in the Slack channel, linking to the JIRA ticket.
  2. If needed, there is a brief discussion around the risk and details of the change.
  3. Three of the members of the CMB approve the change in JIRA.
  4. The requester or on-calls implement the change and mark the SCR implemented.

In the past, we manually tied together JIRA and Slack, without any direct integration. As a result, it often took a long time for SCRs to get approved, and there was a good amount of manual leg work to find the SCR in JIRA and see the details.

Bender to the rescue

In order to tie together the JIRA and Slack portions of this workflow, we built a plugin for our sumobot Slack bot. In our Slack instance, sumobot goes by the name of Bender Bending Rodriguez, named for the robot in Futurama. As engineers and CMB members interact with an SCR, Bender provides helpful details from Jira. Here’s an example of an interaction:

Bender Conversation

As you can see, Bender listens to messages containing both the word “proposing” and a JIRA link. He then provides a helpful summary of the request. As people vote, he checks the status of the JIRA ticket, and once it moves into the Approved state, he lets the channel know.

Additionally, he posts a list of currently open SCRs into the channel three times a day, to remind CMB members of items they still need to decide on. The same list can also be manually requested by asking for “pending scrs” in the channel.

Bender Conversation

Since this sumobot plugin is specific to our use case, I have decided not to include it in the open source repository, but I have made the current version of the source code available as part of this blog post here.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Stefan Zier

Stefan was Sumo’s first engineer and Chief Architect. He enjoys working on cloud plumbing and is plotting to automate his job fully, so he can spend all his time skiing in Tahoe.

More posts by Stefan Zier.

People who read this also enjoyed