Back to blog results

September 12, 2019 By Himanshu Pal and Rishi Divate

A 360 degree view of the performance, health and security of MongoDB Atlas

We're happy to announce that the MongoDB Atlas app is now available in the Sumo Logic app catalog. MongoDB Atlas is a globally distributed cloud database service and the easiest way to run MongoDB in the cloud. Atlas offers best-in-class automation, built-in security controls, and proven best practices to protect your data and scale your applications. Through this seamless MongoDB Atlas-Sumo Logic partnership, we're bringing complete visibility into performance, operations, and user activity across your databases.

In this post, we’ll provide an overview on how Sumo Logic’s integration with MongoDB Atlas works and how to leverage it to address the following goals:

  • Comprehensive visibility into the operations and health of Atlas clusters
  • Optimize the performance of Atlas clusters by identifying slow/inefficient queries and monitoring key database and system metrics
  • Security visibility of your Atlas environment by monitoring user logins, audit events and project and organizational activity

We’ve had an existing integration for the MongoDB on-premises database that helps monitor, optimize, and secure those deployments in real-time and it is used by customers all over the world. Adding this integration for Atlas provides customers with the option to use Sumo Logic for on-premises, cloud, or hybrid scenarios.

“Increasingly, our customers are choosing to migrate existing on-prem workloads to the cloud or to launch new projects in a cloud-native environment on MongoDB Atlas,” said Sahir Azam, SVP of Cloud Products & GTM, MongoDB. “This latest expansion of Sumo Logic’s MongoDB integration will give joint customers valuable insights into the performance, security and operations of their Atlas workloads.”

How Does It Work?

In this section, we first talk about how Atlas logs and metrics can be collected and sent to Sumo Logic, and then we discuss how to best make use of the data via the app dashboards.

Collecting Atlas Data and Installing the Atlas App

To understand how the integration works, let’s first look into how we collect logs and metrics data via the Atlas APIs.

Sumo Logic uses a collector agent to do so and can be deployed either as an AWS Lambda function or as a script running on a Linux machine as shown below.

The collector allows you can configure the types of logs and metrics you want collected. The integration supports following data types:

  1. Database Logs
  2. Audit Logs
  3. Alerts
  4. Project Events
  5. Organization Events
  6. Process Metrics
  7. Disk Metrics
  8. Database Metrics

Once configured, the collector then sends data periodically to Sumo Logic via an HTTP Source.

After collection is configured, you can then install the MongoDB Atlas app from the Sumo Logic app catalog. For additional details, please see the help page on how to collect data and these instructions on how to install the application.

App Use Cases

Now let’s take a look at some examples of how to make use the dashboards in the application.

Query Optimization and Identifying Atlas Issues

System performance and capacity planning are two areas that must be closely monitored as part of any MongoDB Atlas deployment. Using the apps’ performance dashboards, you can optimize cluster performance by identifying slow/inefficient queries, and monitoring key database and system metrics including disk space used, cache usage, memory, disk utilization, replication headroom, open connections, queued operations, and page faults as shown below:

To identify slow queries database administrators, frequently monitor two key metrics:

Scanned Objects / Returned: This is the ratio of the number of documents examined to fulfill a query to the actual number of returned documents

Scanned / Returned: This is the ratio of number of index keys examined to fulfill a query to the actual number of returned documents.

You can monitor these two metrics in the panels from theMongoDB Atlas - Metrics dashboard as shown below.

You can drill down to the panel and use the log overlay feature to correlate metrics performance with queries that could be responsible for such behavior. By clicking on the vertical band you can narrow down to a particular interval associated with the spike and see the relevant log messages.

The above example shows no index keys were examined, so MongoDB scanned all documents in the collection, known as a collection scan (COLLSCAN) as indicated by the “plansummary” attribute.

Ideally, the ratio of scanned documents to returned documents should be close to 1. A high ratio negatively impacts query performance. To remediate this, one can index the right fields in the query attribute. You can also set up alerts in Sumo when the number of documents scanned exceed a certain threshold, say 1000.

You can also investigate this further using the panel below in the Slow Queries dashboard.

The panel shows statistics by database, collection, and database commands; from which you can determine how your queries are performing. For example, 95% of all queries complete in 377 ms.

Troubleshooting Errors

With the MongoDB Atlas - Errors and Warnings dashboard, you can determine the distribution of errors and warnings by component. You can also use Error - Time Compare panel to review historic performance and create the necessary operational baselines.

As you can see, in the panel below, there is a significantly higher number of replication errors. You can drill-down to the MongoDB Atlas - Replication dashboard to find out the root cause.

In the Replication Error Summary panel shown below, you can see the error in the “msg” column and as we see in this example, most of the errors are heartbeat-related.

Another important metric to monitor is replication lag which is the amount of time it takes a write operation on a primary replica set member to replicate to a secondary member. This can be on account of network latency, disk throughput, write load, or other factors.

Replication lag can be improved by moving to larger Atlas instances or adding shards. You can setup an alert in Sumo Logic to get notified when a threshold is exceeded in the Replication Lag panel as shown below.

Monitoring Security Alerts and Maintaining Visibility

These days you data is your most important asset and protecting it is paramount.

Potential threats to a database could include:

  1. Unauthorised modifications
  2. Unauthorised disclosure
  3. Loss of availability

The security dashboards in this application provide a comprehensive view and analysis of the security posture of Atlas using project and organization level events, alerts, and audit logs.

The MongoDB Atlas - Audit dashboard gives you an at-a-glance view of the overall security posture with a number of single value panels shown below.

You can also monitor and detect recent read and write events, and potential cases of dropped collections or databases via the Database Read/Write Operations panel.

To verify if users have been added, removed, or modified to find potential backdoors, use the Audit Events by User panel as shown below.

The MongoDB Atlas - Events dashboard can help monitor project and organizational activity. A key compliance use case is to detect the users that are logging to Atlas without MFA, and this can be flagged via this dashboard as shown in the figure below.

Finally, the MongoDB Atlas - Alerts dashboard enables users to keep track of open and closed alerts, compare alert trends over time, identify abnormal alert activity, and determine the alert distribution per host so as to deduce which hosts need the most attention.


Key Takeaways

In this blog post, we show you examples of how to use the Sumo Logic MongoDB Atlas app to monitor Atlas clusters for comprehensive monitoring of the performance, health, and security of Atlas clusters.

Get Started

If you don’t have a Sumo Logic account yet, you can sign up for a free trial today. To get started with our MongoDB Atlas app, check out the app help doc.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Himanshu Pal and Rishi Divate

View more posts by Himanshu Pal and Rishi Divate.

More posts by Himanshu Pal and Rishi Divate.

People who read this also enjoyed