The Sumo Logic Log Management and Analytics Service allows any organization to easily store, search, and analyze the log data produced by its applications. But even with the very best tools, the quality of the insights uncovered will depend on the raw materials: the logs themselves.
At Sumo Logic we make extensive use of our own product for developing, debugging, and troubleshooting our own product, which is itself a distributed and complex cloud-based SaaS application. This “dogfooding” process has helped enormously to improve the product, both in terms of features and maturity. It has also taught us a few things about writing logging into our own application. In this open-ended series we plan to share some logging patterns and strategies that we have found to be helpful.
General Logging Tips
When in doubt, log it
This one is quite simple: you won’t know about the things you don’t log. While pairing every single line of code with a corresponding log statement may be taking things too far, on the balance we have found that it is more common to regret something you didn’t log than something you did. This advice may be a bit self-serving, but when an elusive bug rears its ugly head on your production system at 3 A.M. you will be glad you logged that crucial extra piece of information.
Use logging tools
There are a variety of logging libraries that can give you a high degree of control over different logging levels, output locations, and formats. Log4j is a good example.
Periodically log the progress of long-running operations
Many data processing tasks involve a “bulk processor” that chews through many individual pieces of data, one at a time. Logging every item processed may be overly verbose, but logging at regular intervals (say, every second or every 10 seconds) may be quite helpful:
If our logs contain a regular stream of these messages indicating that 5000 items were processed per second and we observe an unexplained drop to 5 items per second, this may be an excellent clue that something unusual is happening.
Log for readability and parsability
Logging statements can be directed at many audiences: operational personnel maintaining the system, other developers reading your code, or computer programs parsing out structured data fields. Each of these audiences will appreciate logs that are self-documenting and easily consumed.
Records request received – email@example.com, userId=77A2063948571CC6, requestId=5B54D5E1D59E9850, recordsRequested=50
NOT AS GOOD
firstname.lastname@example.org | 77A2063948571CC6 | 5B54D5E1D59E9850 | 50
Logging for Distributed Systems
A common characteristic of modern distributed applications is that any single event touches many different system components which themselves may reside on many different machines. These systems can be especially challenging to build, operate, and maintain. Problems may manifest only under particular circumstances, and often application logs act as the “black box”, containing the only evidence available to uncover the root causes of system misbehavior. Given this setting, it is of the utmost importance to ensure that your logging is sufficient to help you understand your system and keep things running smoothly.
Use unique IDs to track an event across multiple system components
Unique IDs (such as the hexademical representation of a random number) can be helpful to pinpoint relevant logs when investigating a problem. For example, when a user logs into your website, you can associate that visit with a unique session ID that is embedded in all logs resulting from this visit. Searching your logs for this session ID should allow you to easily to reconstruct all of the system events associated with this user’s session.
This pattern can also be nested for finer-grained searchability. For example, during a visit to your website a user may access their account preferences. When the frontend of your site issues a internal request for this information, this request itself can be associated with both the user session ID and a new request ID. If initial troubleshooting of the user session reveals that this particular request was unsuccessful, this request ID can be then used to drill deeper into why the request failed.
A reasonable rule of thumb is to use a new ID for any event, transaction, or session that looks like “one thing” to other components of the system.
Log each side of a remote message or request
The interfaces between different components, especially those residing on different machines, can be a rich source of potential problems. Your message may be delivered to the wrong recipient, the remote recipient may fail to process your request, your message may be eaten by a grue, and so on.
One strategy for dealing with this issue is to put matching send/receive logging statements on each side of any remote message or request. A search for these message pairs that returns an odd number of logs would then be a red flag indicating that a sent message was never received, was received twice, or suffered some other mischief.