Today I am happy to be able to talk about something that has been spooking around in my head for the last six months or so. I've been thinking about this ever since we started looking into Docker and how it applies to what we are doing here at Sumo.
There are many different and totally valid ways to get logs and statistics out of Docker. Options are great, but I have concluded that the ultimate goal should be a solution that doesn't require users to have in-depth knowledge about all the things that are available for monitoring and the various methods to get to them. Instead, I want something that just pulls all the monitoring data out of the containers and Docker daemons with minimal user effort. In my head, I have been calling this "a comprehensive solution". Let me introduce you to the components that I think need to be part of a comprehensive monitoring solution for Docker:
- Docker events, to track container lifecycles
- Configuration info on containers
- Logs, naturally
- Statistics on the host and the containers
- Other host stuff (daemon logs, host logs, ...)
Let's start with events. The Docker API makes it trivial to subscribe to the event stream. Events contain lots of interesting information. The full list is well described in the Docker API doc, but let’s just say you can track containers come and go, as well as observe containers getting killed, and other interesting stuff, such as out of memory situations. Docker has consistently added new events with every version, so this is a gift that will keep on giving in the future.
I think of Docker events as nothing but logs. And they are very nicely structured—it's all just JSON. If, for example, I can load this into my log aggregation solution, I can now track which container is running where. I can also track trends - for example, which images are run in the first place, and how often are they being run. Or, why are suddenly 10x more containers started in this period vs. before, and so on. This probably doesn't matter much for personal development, but once you have fleets, this is a super juicy source of insight. Lifecycle tracking for all your containers will matter a lot.
Docker events, among other things, allow us to see containers come and go. What if we wanted also to track the configurations of those containers? Maybe we want to track drift of run parameters, such as volume settings, or capabilities and limits. The container image is immutable, but what about the invocation? Having detailed records of container starting configurations in my mind is another piece of the puzzle towards solving total visibility. Orchestration solutions will provide those settings, sure, but who is telling those solutions what to do?
From our own experience, we know that deployment configurations are inevitably going to be drifting, and we have found the root cause to otherwise inscrutable problems there more than once. Docker allows us to use the inspect API to get the container configuration. Again, in my mental model, that's just a log. Send it to your aggregator. Alert on deviations, use the data after the fact for troubleshooting. Docker provides this info in a clean and convenient format.
Well, obviously, it would be great to have logs, right? Turns out there are many different ways to deal with logs in Docker, and new options are being enabled by the new log driver API. Not everybody is quite there yet in 12-factor land, but the again there are workarounds for when you need fat containers and you need to collect logs from files inside of containers.
More and more I see people following the best practice of writing logs to standard out and standard error, and it is pretty straightforward to grab those logs from the logs API and forward them from there. The Logspout approach, for example, is really neat. It uses the event API to watch which containers get started, then turns around and attaches to the log endpoint, and then pumps the logs somewhere. Easy and complete, and you have all the logs in one place for troubleshooting, analytics, and alerting.
Since the release of Docker 1.5, container-level statistics are exposed via a new API. Now you can alert on the "throttled_data" information, for example - how about that? Again (and at this point, this is getting repetitive, perhaps), this data should be sucked into a centralized system. Ideally, this is the same system that already has the events, the configurations, and the logs! Logs can be correlated with the metrics and events. Now, this is how I think we are getting to a comprehensive solution. There are many pieces to the puzzle, but all of this data can be extracted from Docker pretty easily today already. I am sure as we all keep learning more about this it will get even easier and more efficient.
In all the excitement around APIs for monitoring data, let's not forget that we also need to have host level visibility. A comprehensive solution should therefore also work hard to get the Docker daemon logs, and provide a way to get any other system level logs that factor into the way Docker is being put to use on the hosts of the fleet. Add host level statistics to this and now performance issues can be understood in a holistic fashion - on a container basis, but also related to how the host is doing. Maybe there's some intricate interplay between containers based on placement that pops up on one host but not the other? Without quick access to the actual data, you will scratch your head all day.
What's the desirable user experience for a comprehensive monitoring solution for Docker? I think it needs to be brain-dead easy. Thanks to the API-based approach that allows us to get to all the data either locally or remotely, it should be easy to encapsulate all the monitoring data acquisition and forwarding into a container that can either run remotely, if the Docker daemons support remote access, or as a system container on every host. Depending on how the emerging orchestration solutions approach this, it might not even be too crazy to assume that the collection container could simply attach to a master daemon. It seems Swarm might make this possible. Super simple, just add the URL to the collector config and go.
I really like the idea of being able to do all of this through the API because now I don't need to introduce other requirements on the hosts. Do they have Syslog? JournalD? Those are of course all great tools, but as the levels of abstractions keep rising, we will less and less be able to make assumptions about the hosts. So the API-based access provides decoupling and allows for composition.
All For One
So, to be completely honest, there's a little bit more going on here on our end than just thinking about this solution. We have started to implement almost all of the ideas into a native Sumo Logic collection Source for Docker. We are not ready to make it generally available just yet, but we will be showing it off next week at DockerCon (along with another really cool thing I am not going to talk about here). Email firstname.lastname@example.org to get access to a beta version of the Sumo Logic collection Source for Docker.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.