Intuitive monitoring, troubleshooting & security for all your apps 30 days free

VIjay Upadhyaya

Director of Competitive Strategy. Two decades of experience in security space. Three BlackHat certifications in Shellcode writing, web application hacking, and network forensics along with CISSP, CISA, and CRISC. Ability to translate and simplify complex technical problems.

Posts by VIjay Upadhyaya


Detecting Insider Threats with Okta and Sumo Logic

Security intelligence for SaaS and AWS Workloads is different than your traditional on-prem environment Based on Okta’s latest Business@Work report, organizations are using between 16-22 SaaS applications in their environment. In the report, Office 365 comes out as the top business applications followed by Box and G suite. These business-critical SaaS applications hold sensitive and valuable company information such as financial data, employee records, and customer data. While everyone understands that SaaS applications provide immediate time-to-value and are increasing in adoption at a faster pace than ever before, what many fail to consider is that these SaaS applications also create a new attack surface that represents substantial risk for the company due to the lack of visibility that security operations teams would typically have with traditional, on-prem applications. If employee credentials are compromised, it creates huge exposure for the company because the attacker is able to access all the applications just like an insider would. In this case, timely detection and containment of an insider threat become extremely important. Sumo Logic’s security intelligence will allow security operations to address the many challenges related to SaaS and cloud workload security. There are many challenges for incident management and security operations teams when organizations are using SaaS applications: How do you make sure that users across SaaS applications can be uniquely identified? How can you track anomalies in user behavior? The first step from the attacker after exploiting the vulnerability is to steal employee’s identity and move laterally in the organization. In that process, the attacker’s behavior will be considerably different than the normal user’s behavior. Second, it is critical that the entire incident response and management processes are automated for detection and containment of such attacks to minimize potential damage or data leakage. Most organizations moving to the cloud have legacy solutions such as Active Directory and on-prem SIEM solutions. While traditional SIEM products can integrate with Okta, they cannot integrate effectively with other SaaS applications to provide complete visibility into user activities. Considering there are no collectors to install to get logs from SaaS applications, traditional SIEM vendors will not be able to provide the required insight into the modern SaaS application and AWS workloads. In order to solve for these specific problems, Okta and Sumo Logic have partnered to provide better visibility and faster detection of insider threats. Okta ensures that every user is uniquely identified across multiple SaaS applications. Sumo Logic can ingest those authentication logs from Okta and be able to correlate with the user activities across multiple SaaS applications such as Salesforce, Box, and Office 365. Sumo Logic has machine learning operators such as multi-dimensional Outlier, LogReduce, and LogCompare to quickly surface the anomaly in the user activities by correlating identity from Okta with the user activities in Salesforce and Office 365. Once the abnormal activities have been identified, Sumo Logic can take multiple actions such as sending Slack message, creating ServiceNow tickets or disabling the user in Okta or triggering actions within a customer’s automation platform. The use case: Okta + Sumo Logic = accurate incident response for cloud workloads and SaaS applications ` How many times have you fat fingered your password and got the authentication failure? Don’t answer it. Authentication failure is a part of life. You cannot launch an investigation every time there is an authentication failure. That would result in too many false positives and an overload of wasted effort for your security operations team. Okta and Sumo Logic allows you to detect multiple authentication failures followed by a successful authentication. It is good enough to launch an investigation at this point, but we all know it could also be a user error. Caps Lock is on, key board is misbehaving or we might have just changed the password and forgotten! To ensure that security operations get more intelligent and actionable insights into such events, Sumo Logic can provide additional context by correlating such authentication failure logs from Okta with user activity across multiple SaaS application. For example, I changed my password and now I am getting authentication failure within Okta. After that I realized the mistake and corrected it, I get the successful authentication. I log into the Box application to work on few documents and signed off. Sumo Logic will take this Okta event and correlate with the Box activities. In case the attacker had logged in instead of me, then there will be anomalies in behavior. An attacker might download all documents or make ownership changes to the documents. While this is happening, Sumo Logic will be able to spot these anomalies in near real time and be able to take a variety of automated actions from creating a ServiceNow ticket to disable the user in Okta. You can start ingesting your Okta logs and correlate with the user activity logs across multiple SaaS applications now. Sign up for your free Sumo Logic trial that never expires! Co-author Matt Egan is a Partner Solutions Technical Architect in Business Development at Okta. In this role, he works closely with ISV partners, like Sumo Logic, to develop integrations and joint solutions that increase customer value. Prior to joining Okta, Matt has held roles ranging from Software Development to Information Security over an 18 years career in technology.


Graphite vs. Sumo Logic: Building vs. Buying value

No no no NOOOOO NOOOOOOOO… One could hear Mike almost yelling while staring at his computer screen. Suddenly Mike lost the SSH connection to one of the core billing servers. He was in the middle of the manual backup before he could upgrade the system with the latest monkey patch. Mike, gained a lot of visibility and the promotion, after his last initiative of migrating ELK Stack to SaaS-based machine data analytics platform, Sumo Logic. He improved MTTI/MTTR by 90% and uptime of the log analytics service by 70% in less than a month time. With the promotion, he was in-charge of the newly formed site reliability engineering (SRE) team. He had 4 people reporting to him. It was a big deal. This was his first major project after the promotion and he wanted to ensure that everything goes well. But just now, something happened to the billing server and Mike had a bad feeling about it. He waited for few minutes to check if the billing server will start responding again. It has happened before, where SSH client used to temporarily lose the connection to the server. The root cause of the connection loss was the firewall in the corporate headquarters. They had to upgrade the firewall to fix this issue. Mike was convinced that it’s not the firewall, but something else has happened to the billing server, and this time around there was a way to confirm his hunch. To view what happened to the billing server he runs a query on Sumo Logic. “_SourceHost=billingserver AND “shut*” He quickly realizes that server was rebooted. He broadens the search to +-5 minutes range from the above log message and identifies that disk was full. He added some more disk to the existing server to ensure that billing server does not restart because of the lack of hard drive space. However, Mike had no visibility into host metrics such as CPU, Hard Disk, and Memory usage. He needed a solution to gather host and custom metrics. He couldn’t believe how the application was managed without these metrics. He knew very well that Metrics must be captured to get visibility into system health. So he reprioritized his Metrics project over making ELK stack and entire infrastructure PCI compliant. Stage 1: Installing Graphite After a quick search, he identifies Graphite as one of his options. He had a bad taste in his mouth related to ELK, which cost him arm and a leg for just a search feature. This time though, he thought it will be different. Metrics were only 12 bytes in size! He thought how hard can it be to store 12 Bytes of data for 200 Machines? He chose Graphite as their open-source host metrics system. He downloads and installs the latest graphite on AWS t2.medium @ $0.016 USD per hour, Mike can get 4GB RAM with 2 vCPU. In less than $300 USD Mike is ready to test his new Metrics system. Graphite has three main components. Carbon, whisper and Graphite Web. Carbon listens on a TCP port and expects time series metrics. Whisper is a flat-file database while Graphite Web is a Django application that can query Carbon-cache and Whisper. He installs all of this on one single server. The logical architecture looks some like in Figure 1 below. Figure 1: Simple Graphite Logical Architecture on a Single Server Summary: At the end of stage 1, Mike had a working solution with a couple of servers on AWS. Stage 2: New Metrics stopped updating – the First issue with Graphite On a busy day, suddenly new metrics were not shown in the UI. This was the first time ever after few months of operations that Graphite was facing issues. After careful analysis, it was clear that metrics were getting written to the whisper files. Mike, thought for a second and realized that whisper pre-allocates the disk space to whisper files based on the configuration in carbon.conf file. To make it more concrete, 31.1 MB is pre-allocated by whisper for 1 metric collected every 1 second for one host and retained for 30 days. Total Metric Storage = 1 Host* 1 metric/sec* 60 sec *60 mins *24 hrs *30 days retention. He realized that he might have run out of disk space and sure enough, that was the case. He doubled the disk space, restarted the graphite server and now new data points started showing up. Mike was happy that he was able to resolve the issue before it got escalated. However, his mind started creating “What-If” scenarios. What if the application he is monitoring goes down exactly at the same time Graphite gives up? He parks that scenario in the back of his head and goes back to working on other priorities. Summary: At the end of stage 2, Mike already had incurred additional storage cost and ended up buying EBS Provisioned IoPS volume. SSD would have been better but this is the best he could do with the allocated budget. Stage 3: And Again New Metrics Stopped Updating On Saturday night 10 PM there was a marketing promotion. Suddenly it went viral and a lot of users logged into the application. Engineering had auto-scaling enabled on its front end while Mike had ensured that new images will automatically enable StatsD. Suddenly the metrics data points per minute (DPM) grew significantly and way above average DPM. Mike, had no idea about these series of events. The ticket with only information he received was “New Metrics are not showing up, AGAIN!” He quickly found out the following. MAX_UPDATES_PER_SECOND which determines how many updates you must have per second was increasing gradually also MAX_CREATES_PER_MINUTE was at its max. Mike quickly realized the underlying problem. It was the I/O problem causing the server to crash because graphite server is running out of memory. Here is how he connects the dots. Auto-scaling kicks in and suddenly 800 servers start sending the metrics to graphite. This is four times the load than the average number of hosts running at any given time. This quadruples the metrics ingested as well. Graphite configurations MAX_UPDATE_PER_SECOND and MAX_CREATES_PER_MINUTE reduces the load on disk I/O but it has an upstream impact. Suddenly carbon-cache starts using more and more memory. Considering “MAX_CACHE_SIZE” was set to infinite, Carbon-cache kept storing the metrics in the memory that was waiting to be written to whisper/disk. As carbon-cache process ran out of memory it crashed and sure enough, metrics stopped getting updated. So Mike added EBS volume with provisioned I/O and upgraded the server to M3 Medium instead of t2. Summary: At the end of stage 3, Mike has already performed two migrations. First, by changing the hard-drive he had to transfer the graphite data. Second, after changing the machine he had to reinstall and repopulate the data. Not to mention this time he has to reconfigure all the clients to send metrics to this new server. Figure 2: Single Graphite M3 Medium Server after Stage 3 Stage 4: Graphite gets resiliency, but at what cost? Mike from his earlier ELK experience learned one thing, that he cannot have any single point of failures in his data ingest pipeline at the same time he has to solve for the Carbon relay crash. Before anything happens he has to resolve the single point of failure in the above architecture and allocate more memory to carbon-relay. He decided to replicate similar graphite deployment in a different availability zone. This time he turns on the replication in the configuration file and creates the architecture as below. The architecture below ensures replication and adds more memory to carbon-relay process so that it can hold metrics in memory while whisper is busy writing them to the disk. Summary: At the end of stage 4, Mike has resiliency with replication and more memory for Carbon relay process. This change has doubled the Graphite cost from the last time. Figure 3: Two Graphite M3 Medium Server with replication after Stage 4 Stage 5: And another one bites the dust… Yet another Carbon Relay issue. Mike was standing in the line for the hot breakfast. At this deli, one has to pay first and then get their breakfast. He saw a huge line at the cashier. The cashier seemed to be a new guy. He was slow and the line was getting longer and longer. It was morning and everyone wanted to quickly get back. Suddenly Mike’s brain started drawing an analogy. He thought carbon-relay as a cashier, person serving the breakfast as a carbon-cache and chef as a whisper. The chef takes the longest time because he has to cook the breakfast. Suddenly he realizes the flaw in his earlier design. There is a line port (TCP 2003) and a Pickle port(TCP 2004) on Carbon-relay. Every host is configured to throw metrics at those ports. The moment Carbon-Relay gets saturated there is no way to scale them up without adding new servers and some network reconfigurations and hosts configuration changes. To avoid that kind of disruptions, he quickly comes up with a new design he calls it relay-sandwich. He separates out HA proxy on its dedicated server. Carbon-relay also gets its own server so that it can scale horizontally without changing the configuration at the host level. Summary: Each Graphite instance has four servers and total of 8 servers across two graphite instances. At this point, the system is resilient with headroom to scale carbon-relay. Figure 4: Adding more servers with HA Proxy and Carbon Relay Stage 6: Where is my UI? As you all must have noticed this is just the backend architecture. Mike was the only person running the show but if he wants more users to have access to this system, he must scale front end as well. He ends up installing Graphite-Web and the final architecture becomes as shown in figure 5. Summary: Graphite evolved from single server to 10 machine Graphite cluster instance managing metrics only for the fraction of their infrastructure. Figure 5: Adding more servers with HA Proxy and Carbon Relay Conclusion: It was Deja-vu for Mike. He had seen this movie before with ELK. After 20 servers in with Graphite, he was just getting started. He quickly realizes that if he enables custom metrics he has to double the size of his graphite cluster. Currently, the issue is graphite has metrics indicating “What” is wrong with the system while with Sumo Logic platform with correlated logs and metrics not only indicates “what” is wrong with the system but also indicates “why” something is wrong. Mike, turns on Sumo Logic metrics on the same collectors collecting logs and gets correlated logs and metrics on Sumo Logic platform. Best part he is not on the hook to manage the management system.

May 31, 2017


ELK Stack vs. Sumo Logic: Building or Buying Value?