How to Identify Robots with Apache Logs - Sumo Logic
Sign Up Free Request Demo

Subscribe to RSS

Topic Filter

Done


Two kinds of robots crawl your website: good bots and bad bots.

Good bots identify themselves in their user agent string and obey the rules set forth in your robots.txt file. They also provide some kind of value to your company in return for the bandwidth required to serve them. For example, you typically want Googlebot to crawl your site so that it shows up in search engine results.

Bad bots, on the other hand, don’t play by the rules. They not only consume server resources to the detriment of your human users, but often scrape proprietary information for their own use. Their intent can even be more malicious, including denial-of-service attacks and automated security vulnerability checking.

Identifying Robots with Apache Logs
Identifying potential bot traffic in Sumo Logic

In this article, we’ll survey several techniques for identifying both good and bad bots by analyzing Apache log data. You can follow along by signing up for a free Sumo Logic account. Once you’ve identified the bots on your own site, you can optimize good ones by altering robots.txt or block bad ones by IP address in .htaccess.

Finding robots is a more advanced application of analyzing traffic metrics, so this article assumes you’ve already read through the basics in Apache Traffic Analysis.

Identifying Good Bots

Well-behaved bots identify themselves is the user agent portion of the combined log format. This makes it relatively straightforward to isolate log entries created by good bots:

_sourceCategory=Apache/Access ("Googlebot" OR "AskJeeves" OR "Digger" OR "Lycos"
OR "msnbot" OR "Inktomi Slurp" OR "Yahoo" OR "Nutch" OR "bingbot" OR
"BingPreview" OR "Mediapartners-Google" OR "proximic" OR "AhrefsBot" OR
"AdsBot-Google" OR "Ezooms" OR "AddThis.com" OR "facebookexternalhit" OR
"MetaURI" OR "Feedfetcher-Google" OR "PaperLiBot" OR "TweetmemeBot" OR
"Sogou web spider" OR "GoogleProducer" OR "RockmeltEmbedder" OR
"ShareThisFetcher" OR "YandexBot" OR "rogerbot-crawler" OR "ShowyouBot" OR "Baiduspider" OR "Sosospider" OR "Exabot")
| parse regex "\"[A-Z]+\s+\S+\s+HTTP/[\d\.]+\"\s+\S+\s+\S+\s+\S+\s+\"(?<agent>[^\"]+?)\""
| parse regex field=agent "(?<bot_name>facebook)externalhit?\W+" nodrop
| parse regex field=agent "Feedfetcher-(?<bot_name>Google?)\S+" nodrop
| parse regex field=agent "(?<bot_name>PaperLiBot?)/.+" nodrop
| parse regex field=agent "(?<bot_name>TweetmemeBot?)/.+" nodrop
| parse regex field=agent "(?<bot_name>msn?)bot\W" nodrop
| parse regex field=agent "(?<bot_name>Nutch?)-.+" nodrop
| parse regex field=agent "(?<bot_name>Google?)bot\W" nodrop
| parse regex field=agent "Feedfetcher-(?<bot_name>Google?)\W" nodrop
| parse regex field=agent "(?<bot_name>Yahoo?)!\s+Slurp[;/].+" nodrop
| parse regex field=agent "(?<bot_name>bing?)bot\W" nodrop
| parse regex field=agent "(?<bot_name>Bing?)Preview\W" nodrop
| parse regex field=agent "(?<bot_name>Sogou?)\s+web\s" nodrop
| parse regex field=agent "(?<bot_name>Yandex?)Bot\W" nodrop
| parse regex field=agent "(?<bot_name>rogerbot?)\W" nodrop
| parse regex field=agent "(?<bot_name>AddThis\.com?)\s+robot\s+" nodrop
| parse regex field=agent "(?<bot_name>ShareThis?)Fetcher/.+" nodrop
| parse regex field=agent "(?<bot_name>Ahrefs?)Bot/.+" nodrop
| parse regex field=agent "(?<bot_name>MetaURI?)\s+API/.+" nodrop
| parse regex field=agent "(?<bot_name>Showyou?)Bot\s+" nodrop
| parse regex field=agent "(?<bot_name>Google?)Producer;" nodrop
| parse regex field=agent "(?<bot_name>Ezooms?)\W" nodrop
| parse regex field=agent "(?<bot_name>Rockmelt?)Embedder\s+" nodrop 
| parse regex field=agent "(?<bot_name>Sosospider?)\W" nodrop 
| parse regex field=agent "(?<bot_name>Baidu?)spider" nodrop
| parse regex field=agent "(?<bot_name>Exabot?)\W"
| where bot_name != ""
| if (bot_name="bing","Bing",bot_name) as bot_name
| count as hits by bot_name
| sort by hits 
| limit 20

This is a long query, but don’t be intimidated. All it’s doing is extracting the user agent from every log message and looking for bot-specific strings. Running this query in Sumo Logic will return a list of the top 20 bots crawling your site:

Identifying Good Bots
The top bots crawling your website

Obviously, you’re not going to want to re-write this query every time you want to analyze your bot traffic. Instead, you can simply record it in your Sumo Logic library by clicking the Save As button underneath the search bar.

Analyzing Bot Traffic Volume

The above query gives you some idea about who’s crawling your site, but to do anything useful with that information, we need to dig deeper. With a few modifications, we can compare the volume of bot traffic to normal traffic over time:

_sourceCategory=Apache/Access
| parse regex "\"[A-Z]+\s+\S+\s+HTTP/[\d\.]+\"\s+\S+\s+(?<size>\d+)\s+\S+\s+\"(?<agent>[^\"]+?)\""
| parse regex field=agent "(?<bot_name>facebook)externalhit?\W+" nodrop
| parse regex field=agent "Feedfetcher-(?<bot_name>Google?)\S+" nodrop
| parse regex field=agent "(?<bot_name>PaperLiBot?)/.+" nodrop
| parse regex field=agent "(?<bot_name>TweetmemeBot?)/.+" nodrop
| parse regex field=agent "(?<bot_name>msn?)bot\W" nodrop
| parse regex field=agent "(?<bot_name>Nutch?)-.+" nodrop
| parse regex field=agent "(?<bot_name>Google?)bot\W" nodrop
| parse regex field=agent "Feedfetcher-(?<bot_name>Google?)\W" nodrop
| parse regex field=agent "(?<bot_name>Yahoo?)!\s+Slurp[;/].+" nodrop
| parse regex field=agent "(?<bot_name>bing?)bot\W" nodrop
| parse regex field=agent "(?<bot_name>Bing?)Preview\W" nodrop
| parse regex field=agent "(?<bot_name>Sogou?)\s+web\s" nodrop
| parse regex field=agent "(?<bot_name>Yandex?)Bot\W" nodrop
| parse regex field=agent "(?<bot_name>rogerbot?)\W" nodrop
| parse regex field=agent "(?<bot_name>AddThis\.com?)\s+robot\s+" nodrop
| parse regex field=agent "(?<bot_name>ShareThis?)Fetcher/.+" nodrop
| parse regex field=agent "(?<bot_name>Ahrefs?)Bot/.+" nodrop
| parse regex field=agent "(?<bot_name>MetaURI?)\s+API/.+" nodrop
| parse regex field=agent "(?<bot_name>Showyou?)Bot\s+" nodrop
| parse regex field=agent "(?<bot_name>Google?)Producer;" nodrop
| parse regex field=agent "(?<bot_name>Ezooms?)\W" nodrop
| parse regex field=agent "(?<bot_name>Rockmelt?)Embedder\s+" nodrop 
| parse regex field=agent "(?<bot_name>Sosospider?)\W" nodrop 
| parse regex field=agent "(?<bot_name>Baidu?)spider" nodrop
| parse regex field=agent "(?<bot_name>Exabot?)\W" nodrop
| if (bot_name="","Normal Traffic", "Bot") as traffic_type
| timeslice 1m
| (size/1048576) as mbytes
| sum(mbytes) by traffic_type, _timeslice
| transpose row _timeslice column traffic_type

Note that we added a nodrop option to the last parse line. This ensures that non-bot log entries are not dropped from the results, which is the case for the previous query.

Visualizing the results as a stacked column chart makes it easy to see when bots are flooding your site. This is important because it can slow down response times for your human users.

Analyzing Bot Traffic Volume
Comparing normal traffic to bot traffic over time

Since this query only looks for good bots, you should be able to control their crawl frequency and block irrelevant URLs by tweaking your robots.txt file. While not all bots (notably Googlebot) will honor crawl delay instructions, it’s still a good start towards optimizing bot traffic.

Identifying Misbehaving Bots

However, there’s only so much optimization you can do for good bots because they are, well, good to begin with. By their very nature, they shouldn’t be flooding your site with requests or visiting pages that you’ve declared as “off-limits.”

As a system administrator, you should really be more concerned with misbehaving bots. Bad bots don’t typically identify themselves in their user agent string, which means the only way to detect them is by analyzing their behavior.

The rest of this article introduces a few ways to detect suspicious behavior from specific IP addresses. Regardless of whether these users are bots or humans, their abnormal browsing behavior is often enough cause to block those IPs from visiting your site.

Request Frequency by IP

Let’s start by getting a high-level view of visitor behavior. The following query returns the number of hits every minute, broken down by IP address.

_sourceCategory=Apache/Access
| parse regex "(?<client_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
| timeslice 1m
| count as hits by _timeslice, client_ip
| transpose row _timeslice column client_ip

Displaying this information as a line chart makes it easy to see how each user is interacting with your website:

Request Frequency by IP
Line chart showing hits by IP over time

This chart contains a lot of information, so it’s worth taking a moment to understand how to interpret it. Each line is a single IP address, and the y-axis shows the number of requests they made each minute. Both axes provide clues for identifying bots.

Humans generally browse web pages one at a time, which means they should be towards the bottom of the y-axis. They also spend some time reading each page, and they eventually leave your site, so you should also see intervals with no interaction. In other words, humans are represented as an irregular jagged line at the bottom of the chart.

Bot behavior can differ in a few ways. First, they can request several pages in parallel, in which case they’ll have many more requests per time interval than their human counterparts. Second, they make requests over a relatively constant interval. And third, they often crawl a large portion of your site instead of just visiting a few pages. This will be visualized as high y-axis values or relatively constant lines that don’t periodically drop to zero hits.

The brown and purple lines towards the top of the above chart are examples of potential bot behavior.

Tracking an IP Address

The goal of the previous query was to find suspicious IPs to investigate. Once you have those IPs, you can stalk their path through your website with the following query (be sure to change the where clause to use an IP you found in your own log data):

_sourceCategory=Apache/Access
| parse regex "(?<client_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
| parse regex "[A-Z]+ (?<url>.+) HTTP/1.1"
| where client_ip = "166.94.146.84"
| timeslice 1m
| count as hits by url, _timeslice
| transpose row _timeslice column url

This returns every URL the user with IP 166.94.146.84 requested each minute. Visualizing this as a stacked column chart shows you where they’ve been spending their time.

Tracking an IP Address
Tracking a single user’s path through the website

Again, there are clues in both dimensions. Simultaneous requests lie along the y-axis, and request frequency can be found along the x-axis. In addition, you can see precisely which pages and media resources they’ve been visiting.

This last part is a powerful tool for identifying scraping behavior. For example, if your company aggregates real-time price points for a particular industry, you want to know if people are stealing this valuable information. If you see the pricing URL pop up over and over again (as in the above screenshot), you’ll know that this IP address is constantly hitting this page to see if you’ve published new data.

Inspecting Image-to-HTML Ratio

The previous examples only identify bot-like behavior by hits/volume. This is great for identifying potential DoS attacks, performance problems, and scraping activities, but other lower-traffic bots can also be a concern. For instance, spam bots collecting email addresses or submitting spam through your contact and comment forms won’t show up as high-traffic users, but they can be just as problematic.

Many of these bots avoid downloading image and other media resources, which means we can identify them by comparing the number of media requests to HTML content requests.

_sourceCategory=Apache/Access
| parse regex "(?<client_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
| parse regex "\"[A-Z]+ (?<url>.+) HTTP" nodrop
| where url != "-"
| parse regex field=url "(?<type>jpg|jpeg|png|gif)" nodrop
| if (type == "", 1, 0) as content_resource
| if (type != "", 1, 0) as media_resource
| sum(content_resource) as content_resource,
  sum(media_resource) as media_resource by client_ip
| (media_resource/content_resource) as media_to_content_ratio
| sort by media_to_content_ratio asc
| fields - media_resource, content_resource

This query calculates the image-to-HTML ratio for each IP address. Alone, this number won’t tell you much, but establishing a baseline for human users and comparing that to potential outliers can help identify bots. A simple bar chart makes this much easier:

Inspecting Image-to-HTML Ratio
Identifying an image-to-HTML ratio outlier

The IP at the top of the chart is downloading significantly less static media resources than the rest of your users. This could indicate a bot, but it could also be a human user using a text-only browser.

Summary

It’s important to understand that the methods we discussed in this article are only heuristics for identifying misbehaving bots. They don’t define a magic numerical threshold that distinguishes bots from humans.

There are potential consequences for blocking IPs, so it’s important to be very careful while analyzing bot traffic. Needless to say, mistaking your best customers for bots is not going to be good for business. Unfortunately, this is easier to do than you might think.

For example, a human user that opens a few links in separate tabs will have several simultaneous hits, which is also a tell-tale sign of bots. Deeper analysis is almost always required to make an informed decision about whether it’s worth blocking a particular IP.

Back to top

Request A Free Sumo Logic Demo

Fill out the form below and a Sumo Logic representative will contact you to schedule your free demo.
“Sumo Logic brings everything together into one interface where we can quickly scan across 1,000 servers and gigabytes of logs and quickly identify problems. It’s awesome software and awesome support.”

Jon Dokuli,
VP of Engineering

Thank you for signing up for Sumo Logic.

We are creating your account now.
Please check your email.
Need more help? Contact Us
Sign up for Sumo Logic Free*
Sign up for Sumo Logic Free*
  • No credit card required to sign-up
  • Create your account in minutes
  • No expiration date*
  • *After 30 day trial period, reverts to Sumo Logic Free
    • Please Enter your email address.
    • Please enter a valid email address.
    • This email is already in use for another account.
    • Please use your company email to create an account.
    • Please agree to the Service License.
    • Free trial provisioning is temporarily offline, please call 855-LOG-SUMO to get started.
    View All Pricing Options
    Already have an account? Login