This essay explores AIOps and investigates if machine intelligence applies to IT operations (ITOps). I will dive into objection handling around artificial intelligence (AI) in pop culture and address the limitations around data sets and implicit bias coded into machines. Then, I will delve into what this means for ITOps and the ways AI-based parsing utilities can help operators and developers alike. How does Sumo Logic enable anomaly detection and identify threats? What are some examples of how we implement AIOps? What do we do in the world of OpenTelemetry? Read on to find out!
The irony: Why is artificial intelligence so dumb?
According to this blog post from Sumo Logic on AI vs. machine learning vs. deep learning, “AI systems use detailed algorithms to perform computing tasks much faster and more efficiently than human minds.” This means that AI is an umbrella term that can include machine learning and deep learning but may also be unable to rationalize or learn from data sets. Popular movies and media often use these terms without clearly defining them and this leads to a misunderstanding of the limitations and capabilities of machines that incorporate these three categories.
AI is an umbrella term that can include machine learning and deep learning.
We often think of AI as a machine that is autonomous or that can interact with other machines dynamically; something that can reason on its own or even carry on a conversation. In reality, even the simplest example of the human ability to rationalize and apply logic is quite complex, far beyond the pre-programmed tasks or algorithms that qualify as AI.
Moreover, data sets themselves can be biased and lead to catastrophic results in the lives of humans. Many examples of this come to mind, including automation around decisions such as loans, college admissions, sentencing and healthcare. Professor Meredith Broussard, a computer scientist and data journalist, touches upon this idea in her book, Artificial Unintelligence. She says that to understand artificial intelligence, we need to accept that “real AI is computational statistics on steroids.”
The issues embedded in AI tend to be overlooked, as the logic and data sets collected for these computations come from homogeneous groups. The naive application of machine intelligence methods creates mistakes, such as biased outcomes due to lower minority representation in data samples. Fortunately, there is a counterculture effort to disrupt problematic AI authority over people’s autonomy.
We need to consider the day-to-day ways that dumb, weak AI governs our lives. It is also near impossible to live our lives without participating with tech giants and subsequently losing track of our own individual data ownership. Fortunately, for those in the EU, Article 22 of the 2018 General Data Protection Regulation (GDPR) states that AI-driven decision-making regarding the lives of humans is not allowed. “The data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her.” Unfortunately for those of us not living in this region, similar data privacy and protection acts do not necessarily apply. However, some are popping up as the general public becomes savvier. The emerging California Consumer Privacy Act (CCPA) of 2020 expands existing data privacy laws by allowing consumers greater control of their data and establishing the California Privacy Protection Agency.
We must not fall into the trap of looking for easier, faster or cheaper automated decision making due to the tendency for algorithmic bias. Providing more transparency behind the computational logic in decision-making in insurance models and item pricing, for example, is a step in the right direction. Take the “magical” obfuscating elements away from terms like “artificial intelligence” and “machine learning” and instead lean into better communication for the mechanisms behind mathematical models and how they impact our lives and data privacy.
AI in ITOps
One reference to AIOps at Sumo Logic is the use of artificial intelligence, machine learning and pattern recognition to perform and automate tasks normally executed by IT operations. This means that the mechanisms behind procedurally generated policy lie with the implementation of AIOps. When many DevOps practitioners think of “AIOps,” they see a closed-source black box solution…an off-the-shelf algorithm. . . into which they have no visibility.
Labeling a piece of intellectual property as “AIOps'' obfuscates the complexity of the underlying functionality.
It raises eyebrows when discussing machine intelligence in the context of IT operations. As a result, many IT and DevOps practitioners are more comfortable using entirely different terms for describing “AIOps.” I think it is more accurate to describe implementations of “AIOps” as trigger-based response algorithms that can start subroutines and react based on criteria that humans set up ahead of time. Often, the intelligence the machine provides DevOps practitioners comes from parsing through large data sets and flagging data according to predetermined factors in the code base.
Muddying the logic behind AIOps can be troubling to IT practitioners and developers who want to understand everything about how their system operates. They speak the language of bots, daemons, subroutines, events and triggers. Perhaps, collectively, academics can call it “Artificial Intelligence.” However, when companies call this “artificial intelligence,” it implies to these same developers that the solution is not open source. Moreover, the media and TV tend to portray AI as something that gets to be so complex that even the machine's creator does not fully understand its capabilities. “AI,” as a term, makes things more vague, as it removes specificity and obfuscates the interconnected disparate processes involved in producing a result.
Mathematical pursuits in the field of DevOps are at the core of the logic behind what we call “AIOps.” Technical chauvinism is responsible for the idea that AIOps is somehow a superior magic black box that is outside of the ability for humans to understand it. This concept is touched on in Human Compatible - AI and the problem of control by Stuart Russell. He posits the question of whether the computer system is a tool of humans or if humans have become the tools of the computer system, supplying the information and fixing bugs when necessary but no longer understanding how the whole thing is working. Russell references a computer glitch on April 3, 2018 that caused 15,000 flights in Europe to be delayed or canceled and one in 2010, where a trading algorithm caused the infamous “flash crash” on the NYSE, wiping out one trillion dollars and shutting down the exchange.
What happened is still not well understood or discussed in detail, but this was likely a failure of the AIOps pipeline of logic. If the humans operating the exchange or the airport terminals understood the underlying operational mechanisms that led to these incidents, response times would have been significantly reduced. Moreover, if we retain sufficient understanding of the technological systems we use, we will be able to retain autonomy. AI could be an enhancement of our capabilities, not a baffling hindrance. DevOps engineers, site reliability managers and security analysts need to understand the underlying infrastructure — from the data collected to the logic used in decision-making processes.
AI-based parsing utilities can help quickly identify threats, particularly if you must address incident response scenarios late at night.
>>Learn more: “Troubleshooting outages at 3 AM with Alert Response” details our troubleshooting and response capabilities.
Essentially, you create scripts to automate incident response processes via webhook connections. When an alert is raised your system sends out a ServiceNow ticket with enriched data on the incident. Monitoring logs track incidents raised and the watchdog cycle begins anew, even while you sleep soundly. Best of all, logic from these scripts is widely understood (assuming you don’t suck at documenting code).
Applying data collection best practices to inform AIOps
Sumo Logic is committed to data collection best practices. We know that a vendor-neutral format is important for observability and approximation of system-wide health. Your machine intelligence is only as good as the data you collect via telemetry pipelines. A telemetry pipeline routes data (logs/metrics/traces) from where it is generated to wherever it needs to go, which is often on disparate backends. This also intersects with how this routing data is expressed via monitoring. Telemetry pipelines filter this data and produce enriched metadata available to various backends regarding disparate information, such as Kubernetes (K8s) container information, region information, GeoIP information and other logging and trace data.
The best emerging standard for this is OpenTelemetry or OTel, which originated from unifying OpenCensus and OpenTracing projects under the Cloud Native Computing Foundation (CNCF) governance. This open standard allows you to focus more on analytics by helping you with unifying data collection. CNCF helps prevent the problem identified by the following XKCD comic from emerging in the AIOps space.
As touched upon earlier, AIOps in practice must work around compliance and data protection, especially when dealing with information that could be personally identifiable. How does this relate to data collection or OTel? The answer is that it becomes impossible to approximate system-wide health across disparate backends without a way to unify collection methods. You don’t want to store user data in the same place as you run a video game. User IP addresses, API keys and passwords should have different storage in separate places. There are GDPR, COPPA and FedRamp requirements, which lead to disparate tracing and routing to backends. In gaming, teams will often be working with personally identifiable information for children; applications need to log their data in a way that is compliant with COPPA (Children’s Online Privacy Protection Act). In this use case, there are limits to where telemetry data is stored, who has access to it and how long it is stored before it must be deleted. The same happens in social media platforms, streaming services and a wide variety of other applications.
OTel allows for format standardization and the ability to use the same agent to collect logs, metrics and traces. Unifying data collection lets you focus more on analytics. This is why OTel is core to Sumo Logic’s collection strategy.
There is the potential for AI and its underlying applications to help us work smarter and do more with the preponderence of data at our fingertips. But it cannot remain a black box. Technical teams need to understand the data and algorithms that drive action so that AI lives up to the potential of its name. In this way, teams can build intelligent AI solutions that enhance efficiency and respect the privacy and diversity of individuals who may shape or use them.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.