Spot worker nodes on EKS (Elastic Kubernetes Service) are a great way to save costs by allowing customers to take advantage of unused capacity. With Sumo Logic, we have experimented with and adopted spot worker nodes for some of our EKS clusters to see if we can pass along the same benefits. We decided to share some of the learnings, challenges, and caveats with using spot instances along with the AWS monitoring setup.
We gradually migrated from using only reserved EC2 Instances in our EKS clusters to a mix of reserved and Spot Instances. All the stateless apps were moved to run on spot instances keeping stateful applications on reserved instances and this reduced our EC2 costs by over 50% on top of the already discounted reserved instances from AWS!
Why do we need to monitor spot instances?
Cost savings are great but the trade-off we're making by adopting spot instances is AWS can take away the instances anytime with only a 2-minute notice. Obviously not ideal. We needed a way to monitor the interruptions and found AWS records the interruption events in CloudTrail under the name BidEvictedEvent. We looked at these events, but they didn't contain the name of the node, instance type, or availability zone which made it tricky to correlate interruptions with affected pods in the cluster. It was also difficult to get an idea of what instance types we can use to reduce interruption rates.
At the same time, we also started noticing that a lot of the spot nodes were taken away by AWS but there were no corresponding events in CloudTrail. This is when we realized apart from spot interruptions, there is also the concept of rebalance recommendations. These events are logged in the autoscaler activity logs and indicate nodes were removed in response to a rebalance recommendation received from AWS due to an increased risk of being interrupted.
How do you monitor spot nodes?
At this point, we realized we needed a complete monitoring solution so we set out to explore available tools and found AWS Node Termination Handler (NTH) running with the IMDS (Instance Metadata Service) processor works well for us. It installs a daemonset in the cluster which runs a daemon on every node that listens to IMDS for spot events. It automatically emits Kubernetes events and no additional setup is required for tracking the data.
We use sumologic-kubernetes-collection on all our clusters which can be easily installed via helm. It ingests all the logs from pods, Kubernetes metrics, application traces as well as Kubernetes events which include the spot-related events, thanks to AWS NTH. We created a dashboard in Sumo Logic that tracks the number of interruptions and rebalance recommendations grouped by clusters, instance-types, etc.
We also set up various monitors on some key metrics such as the number of nodes in the cluster and cluster-autoscaler health to monitor any potential issues in the clusters.
We were surprised at what we found. The rebalances occurred almost seven times more frequently than actual spot interruptions. Even though during rebalancing, the nodes are gracefully drained and moved to a newer node, the nature is similar to a spot interruption from the pod's perspective and this is what caused excessive pod movement in our clusters despite the low count of BidEvictedEvents in CloudTrail. In our case, we disabled rebalancing to minimize disruptions and the pod movement was reduced by at least 60%.
Another observation we found interesting was that Spot Instance Advisor might be misleading. We use a mix of m5 and r5 instance types and according to spot instance advisor, m5 instances are more frequently interrupted compared to r5.
However, in practice what we observed was quite the opposite over one month.
The r5 & r5a instances received way more (almost 80% of all) interruptions and rebalance recommendations compared to m5 & m5a – contrary to what AWS claims.
Overall, with sumologic-kubernetes-collection and Sumo Logic Application Observability, we now have deep insights into the health of our Kubernetes clusters which enabled us to reduce disruptions to our workloads as well as speed up our CI/CD pipelines.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.