Numerous companies keep launching AI/ML features, specifically “ChatGPT for XYZ” type productization. Given the buzz around Large Language Models (LLMs), consumers and executives alike are growing to assume that building AI/ML-based products and features is easy. LLMs can appear to be magical as users experiment with them.

However, in our experience, building production-grade AI/ML features requires hypotheses and problem formulation, curating training datasets, defining success metrics, modeling/algorithms, DevOps and most importantly, doing all these activities at scale in a performant, reliable and cost-efficient manner.

Let us walk you through an MLOps framework that has helped Sumo Logic deliver several recent AI/ML-based features, including our AI-driven alerting capability, which is used as an illustrative example. We are also making available our MLOps Observability dashboards, which are designed to work with logs telemetry from your MLOps stack

MLOps framework for production-grade features

Gathering customer requirements and success metrics

Our MLOps framework begins with understanding the customer problem and outcomes we want to solve. In the AI-driven alerting example, our customers reported experiencing alerts from their modern app stacks at rates higher than they are able to attend to. Such noisy monitors are common: 5% of monitors used by Sumo Logic customers trigger more than four times per day, 60-90% of these are false alarms, and anecdotally, half of these alerts are triggered after normal work hours. In short, for AI-driven alerting, our success criteria were to reduce such noisy monitors by 50% (i.e. no more than two times per day for the top 5% of monitors by detection rate).

While the success criteria were defined in aggregate customer terms, such a metric was unsuitable for continuous model evaluation in our MLOps framework. As a result, we came up with a proxy for this success criteria we will describe later.

As part of understanding the customer problem, we also suggest envisioning the ideal user experience. For AI-driven alerting, very early on, we assessed that the customer’s on-call engineers were looking to set up monitors without having to spend too much time tweaking monitor thresholds or parameters of an AI-driven alerting feature. As a result, we were looking for an approach that could support a “set it and forget it” experience with minimal user inputs.

Collect data

Next, you must figure out data sources relevant to your problem. For AI-driven alerting, customer logs (and eventually, time series’) from their apps and infrastructure stacks were our focus.

More importantly, at this stage, you also want to preprocess the data to account for data aggregation, data privacy and compliance concerns (e.g. data minimization), data augmentation (e.g. join datasets based on downstream modeling requirements) and other factors. For AI-driven alerting, we only require a time-sliced signal from the customer’s logs (e.g. latency or throughput at five-minute intervals) – such aggregation is defined by the user as part of authoring a monitor. Conversely, aggregation ensures that we do not collect raw logs and also ensures that only the signal of interest is collected (versus everything else that is mentioned in the logs).

Other considerations at this stage are to figure out how much data you would need based on model requirements that provide a good balance of model accuracy, cost and performance. For AI-driven alerting, we determined that 60 days of time-sliced logs were more than adequate for most of our customer use cases. You also have to set up data pipelines to automate data collection, which in turn require their own monitoring and operations.

Building the model

The Model: Train-Test-Predict is the heart of any AI/ML feature. Understanding the archetype of the problem and formulating the best-fit model for it is a crucial challenge. For AI-driven alerting, while the user could specify the signal of interest, their challenge is to figure out what makes a particular value of the signal unusual and worthy of an alert in the middle of the night.

AI-driven alerting uses a forecasting approach where the model predicts the expected value of the signal in a given detection window (say five minutes) to detect unusual data points. This is augmented with some additional context provided by the user such as the percentage of data points in a detection window that should be unusual before triggering an alert. At this stage, you also have to translate the original success criteria into a measurement that can be evaluated continuously. For our example, we chose the prediction error in a given detection window for evaluating the model.

In addition, we run the train-test-evaluate process in a closed loop to pick the best model hyperparameters. These automated machine learning (AutoML) features cut down the time-consuming, iterative process behind machine learning model development and are key to achieving the “set it and forget it” experience for users. A crucial and time consuming aspect of modeling is to figure out failure modes for the ML model. For example, how does the model handle incomplete or sparse data (e.g. you could set an alert on an error signal to allow the model to handle incomplete or sparse data), how do you handle cases where the AutoML is unable to converge on a well-tuned model in a reasonable amount of time, etc. As a best practice, previewing AI/ML features with customers and exposing the work-in-progress models to numerous diverse customer datasets is the only useful way to iron out failure modes in our experience.

While LLMs may seem like a discontinuous break from prior ML technologies, similar practices can be adapted to help deploy and manage product features powered by LLMs. Ultimately, the modeling phase involves identifying a step in the customer experience where classic deterministic software cannot solve the target problem, but an ML/AI approach (whether Classical ML or LLMs) based on patterns or regularities in the data can deliver ideal results. As before, this problem formulation and framing exercise is foundational to the feature development process and heavily influences all subsequent design decisions.

The dataset and training aspects may manifest differently in the LLM setting, depending on whether you are doing fine-tuning (actually updating model weights, as in Classical ML) or in-context learning (dynamically supplying training data in the LLM prompt at query time). In either case, it is still necessary to test the performance of a given configuration, which may be considered to include the base model, the prompt itself, as well as the specific learning procedure and hyperparameters (when fine-tuning) or the training instance population and selection procedure (for in-context learning).

Deploy and Observe

Finally, once a model has been tested and validated, you must determine how to expose the model to consumers, deploy it via your CI/CD process and observe it for accuracy, reliability and cost. At Sumo Logic, ML model predictions are served internally to other microservices via an API endpoint. The early stages of a rollout are also a good time to collect cost metrics compared to earlier estimates for the AI/ML feature and construct SLO targets for ML-as-a-service endpoints.

Learn more about AI/ML and log analysis.