Apache Server Traffic Analysis - Sumo Logic
Sign Up Free Request Demo

Subscribe to RSS

Topic Filter

Done

Apache access log analytics is about getting different kinds of visibility into your web operations. When it comes to traffic, we’re primarily concerned with two metrics: the number of HTTP requests (hits) and the total bytes served (volume). You can find all sorts of actionable insights by plotting these values against the other information in your access logs.

For example, comparing traffic to request URL identifies your most popular content, while visualizing hits and volume against referrer URL helps you pinpoint hotlinked media resources.

Apache Traffic Analysis
Analyzing Apache traffic metrics with Sumo Logic

In this article, we’ll learn how to identify these kinds of optimization opportunities by analyzing Apache traffic with Sumo Logic. You can follow along by signing up for a free Sumo Logic account.

Total Traffic by Hits

Let’s start by getting a high-level look at how much content you’re serving:

_sourceCategory=Apache/Access
| timeslice 1m 
| count as hits by _timeslice 

Sumo Logic adds a _sourceCategory metadata field to logs as it collects them. This lets us limit the scope of queries to either access logs, error logs, or custom log files that you’ve configured in httpd.conf. The _timeslice operator groups log messages into 5-minute intervals.

Running this query returns a table counting the number of hits every 5 minutes. You can visualize this information by clicking the Line Chart button in the Aggregates tab. This gives you a much clearer view of traffic spikes.

Total Traffic by Hits
Visualizing total hits over time

You can save this chart into a dashboard by clicking the Create Panel button in the Aggregates tab. Dashboards are automatically updated in real time, so you’ll always know exactly what’s going on in your Apache infrastructure.

Alternatively, if you have a Sumo Logic Professional account, you can set up a real-time alert to receive an email when hits pass a certain threshold. In either case, the idea is to keep tabs on key performance indicators with Sumo Logic’s monitoring capabilities, then dig deeper with ad-hoc queries when something needs attention.

Total Traffic by Volume

Hits are only half the story when it comes to analyzing web server traffic. A single hit could result in anywhere from a few kilobytes to hundreds of megabytes or more. To get a complete picture of our Apache traffic, we need to modify the above query to display both hits and volume side-by-side:

_sourceCategory=Apache/Access
| parse regex "HTTP/1.1\"\s+\d+\s+(?<size>\d+)"
| where size != "-"
| timeslice 1m 
| (size/1024) as kbytes 
| count as hits, sum(kbytes) as kbytes by _timeslice 

In addition to counting hits, this query also extracts the response size from each access log and adds them up to get the total volume for each 5-minute interval. With a few tweaks to the column chart settings, we have an at-a-glance view of both traffic metrics:

Total Traffic by Hits and Volume
Visualizing both total volume and total hits over time

This gives you more insight into what kind of content you’re serving. If a small change in hits is accompanied by a big shift in volume, it means you’re serving a small amount of large responses. If your web application processes many small HTTP requests, you’ll see a closer correlation between hits and volume.

These charts can help guide optimization efforts. In the former case, reducing response size will have the biggest impact on your web application’s performance. On the other hand, if you’re handling unnecessary HTTP requests, you’re better off refactoring your content to reduce the number of requests.

Hits and Volume by Server

If you’re maintaining multiple Apache servers, you’re probably also interested in how traffic is distributed amongst your servers. We can break down hits and volume by server using the _sourceHost metadata field. Like _sourceCategory, this value is set while configuring a source.

_sourceCategory=Apache/Access
| parse regex "HTTP/1.1\"\s+\d+\s+(?<size>\d+)"
| where size != "-"
| timeslice 1m 
| (size/1024) as kbytes 
| count as hits, sum(kbytes) as kbytes by _timeslice, _sourceHost
| transpose row _timeslice column _sourceHost

There are two important additions to the above query. First, we’ve grouped hits and volume not only by _timeslice, but also by _sourceHost. Second, the transpose operator acts as a pivot, which lets us display each server as a section in a stacked column chart:

Total Traffic by Server
Separating traffic by server

This visualization provides a clear window into your entire web application infrastructure. For instance, if you’re monitoring a load-balanced cluster, this panel immediately tells you whether traffic is distributed correctly. Or, if you’re reacting to a denial-of-service attack, all you have to do is glance at your dashboard to see which servers require attention.

Hits and Volume by URL

So far, we’ve only been examining hits and volume over time. To get a another perspective on your web traffic, we can also plot traffic against request URLs:

_sourceCategory=Apache/Access
| parse regex "[A-Z]+ (?<url>/[^\ ]+?) HTTP/1.1\"\s+\d+\s+(?<size>\d+)"
| where size != "-"
| (size/1024) as kbytes 
| count as hits, sum(kbytes) as kbytes by url
| sort hits
| limit 20

This calculates the volume and hit count for the top 20 URLs, which tells us which content is worth optimizing. To reduce HTTP requests, you can combine small images with high hit counts into spritesheets or cache dynamically generated content. To reduce bandwidth usage, you can compress high-volume media resources, gzip your HTML, or minify your CSS and JavaScript.

Hits and Volume by URL
Inspecting the highest-traffic URLs

Note that this is the same type of information we saw in the last two sections, but breaking it down by URL provides a new set of actionable insights.

Identifying Hotlinked Resources

Comparing traffic metrics to referrer URL identifies a different optimization opportunity: hotlinked resources. The following query will identify the top 20 websites that are direct-linking to your image files:

_sourceCategory=Apache/Access
| parse regex "\"[A-Z]+ .+\.?(?<type>jpg|jpeg|png|gif) HTTP/1\.1\" \d+ (?<size>\d+) \"(?<referrer>\S+)\""
| where !(referrer matches "*example.com*")
| (size/1024) as kbytes 
| count as hits, sum(kbytes) as kbytes by referrer
| sort hits
| limit 20

Note that the parse regex expression automatically drops all logs that don’t have a .jpg, .jpeg, .png, or .gif extension. In addition, the where clause filters out your own web pages (be sure to change example.com to your own domain). Visualizing the results as a bar chart quickly tells you which sites are the culprits.

Hotlinked Resources
Identifying hotlinking referrers

Depending on the type of web application you’re running, you might want to block these referrers. Or, you can simply disallow image hotlinking altogether by adding the following lines to your .htaccess file:

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?example\.com/ [NC]
RewriteCond %{HTTP_REFERER} !^$
RewriteRule .*\.(jpe?g|gif|bmp|png)$ - [F]

Again, change example.com to your domain so your own web pages aren’t blocked from retrieving image files.

Identifying Malicious Users

Finally, you can analyze traffic against client IP to find abusive users, misbehaving bots, and potential denial-of-service attacks.

_sourceCategory=Apache/Access
| parse regex "(?<client_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
| parse regex "HTTP/1\.1\" \d+ (?<size>\d+)"
| (size/1024) as kbytes 
| sum(kbytes) as kbytes by client_ip
| sort kbytes
| limit 20

Visualizing the results as a pie chart makes it easy to identify IPs that are using up abnormal amounts of bandwidth:

High Traffic Users
Traffic distribution by IP address

Note that this is only the first step towards finding malicious clients. Deeper analysis is required to determine if these IPs are scraping copious amounts of information from your application, trying to make your site unresponsive, or if they’re simply your best customers.

The next article in this series explains more sophisticated ways to identify misbehaving robots. For more information about identifying and blocking malicious users, see Analyzing System-Critical Errors.

Summary

This article was about reducing traffic, measured in either hits or volume. We discussed ways to optimize application resources, prevent other websites from hotlinking content, and identify malicious users. One topic we didn’t cover was analyzing response time, which requires a custom Apache access log format.

For any of these activities, the role of an Apache log analyzer is monitoring and root cause analysis—not remediation. A tool like Sumo Logic can provide real-time insights about your web traffic, but it’s still up to your IT and development teams to implement solutions.

Back to top

Request A Free Sumo Logic Demo

Fill out the form below and a Sumo Logic representative will contact you to schedule your free demo.
“Sumo Logic brings everything together into one interface where we can quickly scan across 1,000 servers and gigabytes of logs and quickly identify problems. It’s awesome software and awesome support.”

Jon Dokuli,
VP of Engineering

Thank you for signing up for Sumo Logic.

We are creating your account now.
Please check your email.
Need more help? Contact Us
Sign up for Sumo Logic Free*
Sign up for Sumo Logic Free*
  • No credit card required to sign-up
  • Create your account in minutes
  • No expiration date*
  • *After 30 day trial period, reverts to Sumo Logic Free
    • Please Enter your email address.
    • Please enter a valid email address.
    • This email is already in use for another account.
    • Please use your company email to create an account.
    • Please agree to the Service License.
    • Free trial provisioning is temporarily offline, please call 855-LOG-SUMO to get started.
    View All Pricing Options
    Already have an account? Login