www.fgks.org   »   [go: up one dir, main page]

Welcome to WebmasterWorld Guest from 207.241.231.149

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Featured Home Page Discussion

At Home with the Robots: 2017 edition

Critique of the Year's Active User Agents

     
3:18 am on May 13, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13710
votes: 451


At Home with the Robots

It’s been another two years. Time to see what the robots were up to in April 2017.
2015 edition [webmasterworld.com]
2013 edition [webmasterworld.com]
2012 edition [webmasterworld.com]

In the course of April 2017, robots accounted for something under half of all requests. That means, of course, more than half of all page requests, since most ask only for HTML.

I’ve got a generally permissive attitude to robots. Ask for robots.txt, do what it says--i.e. go away at once if you find yourself denied by name, don’t request material from excluded directories, and crawl at a reasonable speed--and I’ll poke a hole for you. My default robots.txt lists user-agents sequentially:
User-Agent: name1
User-Agent: name2
User-Agent: name3
Disallow: /
If a robot doesn’t respond to this, I try giving it a block to itelf:
User-Agent: yourname
Disallow: /
and see if that works any better. On very, very rare occasions a robot doesn’t understand the first form, but honors the second. (“Never attribute to malice that which can be adequately explained by stupidity.”) Far more often, it never intended to obey in the first place.

If I give a full /32 IP, it means the robot used that IP consistently throughout all its visits. It doesn’t necessarily mean they will use the identical IP on your site. If I don’t give an IP at all, it means it came from an array of different addresses, so it is probably distributed. This being the Good Robots page, I don’t need to consider fakers.

Stop the Presses

For as long as I can remember, bingbot has been the Abou ben Adhem of robots.txt requests. This year it didn’t even make the Top Three. Almost one-quarter of all robots.txt requests came from ... Seznambot. Bet you didn’t see that one coming. Next came BLEXBot--about whom, more later--DotBot, and finally bingbot, with less than 6% of all robots.txt requests. (The Googlebot asked for robots.txt precisely 31 times--including redirects--in the course of the month, putting it in 7th place overall. I guess that’s one a day, plus one to grow on.) On the other hand, the winner in the subcategory of Redirected robots.txt Requests goes to Yandex: 12 of its requests were redirected. Really, yandex, haven’t you learned my canonical name by now?

Search Engines

The site that I’m looking at is responsive, with no separate /m/ site or UA-based CSS. Search engines may have a different UA distribution if they know that your site serves variable content, whether at the same or different URLs.

Still Number One

Googlebot was, as usual, the single largest robotic visitor--but the margin wasn’t huge.

IP: 66.249.64-79 but primarily from the 66.249.64-67 subsector (idle query: year after year, Google does its crawling from this single /20 range. Why do other search engines require such a vast array of IPs by comparison?)
UA:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
The vanilla Googlebot accounts for about half of its crawls. Most of the rest are:
Googlebot-Image/1.0
Then there’s the mobile version:
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
This UA, using Android, showed up on 14 April of 2016; the last appearance of the “iPhone” UA was on 18 April.

The older UAs with “Googlebot-Mobile” (DoCoMo and SAMSUNG in about equal measure) haven’t been around since the end of October 2016. So in April 2017, the Googlebot as such used only three UAs.

In late-breaking news, all Googlebot requests for supporting files (css, js) this month included a referer (the page the file “belongs” to). It looks as if they started doing this in the middle of March 2017. A handful of Googlebot image requests also had a referer, but I believe these were all associated with a “Fetch as Googlebot” GSC action.

But wait, there’s more. Alongside the true Googlebot, there’s an ever-expanding list of other Googloid functions. (This list will not include the AdSense-related crawlers, though someone else might like to chime in with the relevant information.)

IP: 66.102.6-7 and 66.249.80-95
I don’t know what they do with the rest of 66.102.0-64. I have only once--ever--seen them outside 6-9, and rarely outside 6-7.

In alphabetical order:

Docs:
Mozilla/5.0 (compatible; GoogleDocs; apps-presentations; +http://docs.google.com)
Confession: I have no idea what this does. It only fetches images, and it’s very rare. Their web page leaves me none the wiser.

Favicon:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
This UA has certainly matured over the years. Originally they sent no UA at all; later they called themselves Firefox 6, and since March of 2016 they’ve had gone by Chrome/49. Unlike some search engines, Google doesn’t display a favicon next to each result; the favicon does show up whenever you list your sites in a Google property such as GSC or Profile and quite possibly others that I don’t know about.

Image Proxy:
Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)

SearchByImage:
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.7; Google-SearchByImage) Gecko/2009021910 Firefox/3.0.7
Confession: I never knew this UA existed. Thanks to that Firefox/3, they have never seen anything but a 403. The UA, complete with “de” (barring a few ebooks, I have no German-language content), has existed since at least 2015. If they hadn’t come from a Google IP, I’d have assumed they were just another unwanted robot.

Translate
This doesn’t have a UA of its own; it just appends “,gzip(gfe)” (with comma, without leading space) to the human UA string. The referer will be something involving “translate.googleusercontent.com”.

Web Preview:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/27.0.1453 Safari/537.36
Confession: Once again, color me puzzled. I remember a few years ago the SERPs always had an option for Preview, but I haven’t seen it in yoincks, so I have no idea what this UA currently does.

Formerly Known as Webmaster Tools

Site Verification:
Mozilla/5.0 (compatible; Google-Site-Verification/1.0)
Shows up periodically on any site that has a GSC (the former WMT) account.

Search Console:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Search Console) Chrome/27.0.1453 Safari/537.36
I first saw this UA in May 2016. I don’t know exactly how old it is, because it comes only in response to a specific action on your part: “Fetch and Render” in the Fetch As Googlebbot section of GSC. Like most googloid functions it is not subject to robots.txt; casual experimentation shows that if you request a page in a roboted-out directory, it will do the fetch with this UA, but won’t show the “What a Human Sees” render.

If that UA seems familiar, it’s because Preview is identical. They must have a sentimental fondness for Chrome 27.

We Try Harder

Thanks to our friend the BLEXbot, Bing slips to #3 in the overall request count. But it’s pretty close.

IP: 40.77.167, 157.55.39, 207.46.13
Obviously these are not Bing’s full ranges. But all requests throughout the month were evenly divided between these three /24 sectors.

UAs:
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
:: pause for silent snicker at idea of Bing using an iPhone UA ::

Unlike Google, Bing uses the same UA for both pages and images. No request had a referer. The mobile UA accounted for about 15% of requests in all categories. The only exception is that the iPhone bingbot never asks for robots.txt under its own name.

About 10% of bing requests came from The Robot That Will Never Die:
msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
In spite of the “media” in the name, requests were exclusively for pages.

And then there’s Bing Preview. In addition to the three bing/msn crawl ranges, it also shows up from
65.55.210, 131.253.25-27, 199.30.24-25
UAs:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b

I’m not clear what this UA actually does. I don’t believe it is a true preview; the requests don’t come in packages (page, supporting files, images) like a human. It may be Bing’s version of a Mobile-Friendliness tester.

Unlike Google’s Site Verification, Bing’s wears plain clothes:

IP: 131.253
UA:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246
It never requests anything but /BingSiteAuth.xml

Meanwhile in the Czech Republic

(I read that they officially changed the name to Czechia, but everyone who lives there hates it, which would seem to be a drawback. The website says Czech Republic.) Seznam has always been fond of my site; not sure why, since human visitors sent by Seznam can be counted on your fingers.

IP: 77.75.76-79
UA:
Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)
This UA was rolled out in May 2016. Note the new About page, which is in English. Incidentally, Google Translate says that “seznam” means “list”.

I also found a scattered handful of:
Mozilla/5.0 (compatible; SeznamBot/3.2-test1-1; +http://napoveda.seznam.cz/en/seznambot-intro/)
which is probably exactly what it looks like, something experimental. It even asked for robots.txt :)

Yandex Carries On

This year, Yandex’s distinguishing trait was the sheer range of IPs they used.

IP: 141.8.143.141 (their hands-down favorite, down to the last /32); 5.255.250-253, 77.88.0-63, 100.43.64-95, 141.8.142-143; rarely 93.158.128-91, 199.21.96-99 (Yandex’s ARIN range)

UA:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
The imagebot was busy this month, accounting for about 2/3 of all requests.

Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/3.0; +http://yandex.com/bots)
This UA was rare; it asked for pages and supporting files (css, js) but no images.

Yahoo! Slurp
IP: 68.180.228-230
UA:
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
In March 2016, Yahoo! Slurp suddenly started requesting stylesheets, always with the appropriate page as referer (same as the Googlebot). On the other hand, they seem to have entirely stopped asking for images as of December 2016.

Mail.RU
IP: 217.69.133
UA:
Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)
Linking to a webpage in your UA string is generally considered A Good Thing--but, er, it only works if people can read Russian. Faute de mieux, I’ve always assumed they are a search engine. Rather a low-budget one: they do a biggish crawl every few months, at which point they show up on my Redirects lists requesting old pages that everyone else has already got sorted to their currect URL. Requests are almost exclusively pages. (Exceptions are interesting, but only if you know the site.)

Minor Players

Coccocbot
IP: 123.30.175
UA:
Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)
Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine)
I believe this is a search engine. They’re definitely from Vietnam; their real name has a lot more diacritics. They’re a strong contender in the race for Highest Proportion of robots.txt requests: they generally request just one file at a time, and each is accompanied by robots.txt. Or possibly the category is Slowest Crawl, since they spent the whole month painstakingly collecting all the images that belong to one page. (It would have been two, but the images belonging to the other page that caught their fancy are in a roboted-out directory.)

Daumoa
IP: 203.133.168-171
UA:
Mozilla/5.0 (compatible; Daum/4.1; +http://cs.daum.net/faq/15/4118.html?faqId=28966)
Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) Safari/538.1 Daum/4.1
I think they’re a Korean search engine. They first showed up in response to an RSS feed, but since then have started wandering further. They’ve got a few other user-agents, but the “faqID” one is their current favorite, accounting for about 90% of the month’s visits. That includes all robots.txt request, even when the page request will use a different UA. On two occasions, the second UA has asked for piwik.js, which they’re really not supposed to. Humph.

Exabot
IP: 178.255.215
UA:
Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
Robot of the French search engine Exalead. They must have noticed that my site has no French content, because they don’t come around much. Although they’re a search engine, and they periodically look at the xml sitemap, I don’t think they’ve ever done a full spidering; they come in and ask for specific pages.

DuckDuckGo
IP: 107.21.1.8
UA:
Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)
As I understand it, DuckDuckGo uses other robots’ crawl data and applies their own algorithm. So the only time I see them is when the Favicons-Bot comes by, indicating that I have come up in somebody’s search. I don’t know how often they re-crawl for this purpose; the closest together I’ve seen them is about 18 hours.

Special feature of this robot: All requests have an auto-referer, necessitating various hole-poking. In spite of the name, they request the root first; if the page is blocked they won’t request the favicon. (A bit funny in my case since everyone can see the favicon, barring one Italian referer-spam site because you gotta draw the line somewhere.)

To be continued...

[edited by: lucy24 at 3:38 am (utc) on May 13, 2017]

3:20 am on May 13, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13710
votes: 451


Further Afield

Not necessarily search engines, but still legitimate in my book:

Qwantify
IP: 194.187.170-171
UA:
Mozilla/5.0 (compatible; Qwantify/2.3w; +https://www.qwant.com/)/2.3w

Mozilla/5.0 (compatible; Qwantify/2.4w; +https://www.qwant.com/)/2.4w

Qwantify/1.0
The third, minimalist UA is only for the favicon. The change from 2.3 to 2.4 happened around the 24th, with no overlap.

MJ12bot
Is it a search engine? Is it something else? You decide. Either way, MJ12 is one of the best-known distributed crawlers, so I won’t bother to list its IPs.
UA:
Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)

Qwantify and MJ12 are among the few law-abiding robots that still use HTTP/1.0.

Cliqzbot
UA:
Mozilla/5.0 (compatible; Cliqzbot/1.0; +http://cliqz.com/company/cliqzbot)
“Was genau ist Cliqzbot?” Another of those targeted searches, I think. They are either distributed, or they sprawl so widely across 52 that there’s no telling where they really live.

DeuSu
IP: 85.93.91.84
UA:
Mozilla/5.0 (compatible; DeuSu/5.0.2; +https://deusu.de/robot.html)
DeuSu only understands Disallow if they’re given a section to themselves in robots.txt.

Yeti
IP: 125.209
UA:
Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/bot)
Although they only showed up once--robots.txt plus a page--Yeti deserves a look-in because this month marks the first sighting since ... drumroll ... July of 2014. That was at my old site, which they used to visit all the time; they’ve never before set foot on my main site. They used to change IP every year or so, but this one's been the same since mid-2013.

Three Headscratchers

Bing and Yandex don’t have any connection that I know of--but they’re both associated with one near-identical behavior:

Visitor comes in with generally humanoid headers and requests a page, sometimes giving the appropriate search engine as referer. They request scripts, stylesheets, fonts--in short, all supporting files except images and favicon. All three entities pick up piwik.js, though only one of them acts on it by requesting piwik.php. (It may or may not be relevant that piwik lives on a different site, although presumably on the same server.)

Drake Holdings:
IP: 204.79.180 (the range belongs to an outfit called Drake Holdings, hence the name)
UA:
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0;  Trident/5.0)

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;  Trident/5.0)
That’s two  spaces before the second “Trident”. Any given visit uses one UA or the other, but apart from that they’re random. Some html requests have no referer; some have bing search, with a plausible search string for the requested page. Not always right, but always understandable. Requests for supporting files have human-type referers. Unlike the other two members of the trio, it requests piwik.php, meaning that it acts on javascript.

Plainclothes Bingbot:
IP: 65.55 and 131.253
UA:
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;  Trident/5.0)

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0;  Trident/5.0)
(i.e. identical to the two Drake Holdings forms)

Both of these have been getting 302 redirected to a custom page that has served different purposes over the years; one of its functions is to intercept humans who accidentally behave like robots.

Yandex Referers
IP and UA: as expected for humans in Russia
Referer: http://yandex.ru/clck/jsredir?from=yandex.ru%3Bsearch%3Bweb%3B%3B&text=&etext= ... et cetera, et cetera, with an enormously long string of garbage that appears to be identical to a genuine Yandex referer. Each request is followed or preceded within 24 hours by an apparently human visit (with images and piwik, without favicon) to the same page. Different IP and UA, but Yandex referers to my site are infrequent enough that I can easily pick them out.

Bingbot’s Evil Twin
Alongside the ordinary plainclothes Bingbot, there’s this further agent that I never noticed until fine-tooth-combing logs:

IP: 40.77.169
UA:
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
I had no idea this Chrome UA existed; they’ve been getting a 403 on the grounds of “non-bingbot from Bing/MSN range”. They make the same set of three requests every few days: first /dir/subdir1/ and then, several hours later, /dir/subdir2 (without slash) immediately redirected to /dir/subdir2/ with slash. (Paradoxically, the malformed URL is what prevents it from being blocked at the outset; it’s a very narrowly constrained RewriteRule.) Turns out this has been going on--always the same set of three--since mid-September.

:: insert “wtf?” emoticon here ::

Other Law-Abiding Robots

In alphabetical order:

AhrefsBot
UA:
Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)
Probably distributed; I counted eight different ranges this month. But all robots.txt requests come from just two IPs, 5.196.87.7 and 151.80.31.151, which may explain why their website says that robots.txt changes can take up to a week to be recognized. In spite of this, they’ve never asked for anything in a roboted-out directory. Requests are mostly pages, with a few seemingly random images mixed in.

Applebot
IP: 17.142
UA:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)

Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4 (Applebot/0.1; +http://www.apple.com/go/applebot)
The iPhone version is rare. Has anyone ever figured out what this robot does? Some people may remember “If robots instructions don’t mention Applebot but do mention Googlebot, the Apple robot will follow Googlebot instructions.” The Applebot is not the only robot to adhere to this quaint misapprehension.

BLEXBot
IP: 148.251.244.204
UA:
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
“BLEXBot assists internet marketers to get information on the link structure”
I am not absolutely certain it is wise to put the word “BLEXBot” and “link structure” into the same sentence. Surprisingly, only about 1/6 of the month’s requests were 404s caused by appending other people’s URLs to my paths--a pervasive problem that others have noticed too. I would have guessed closer to 95%. It looks as if they’ve had the problem since December 2016, but it has been getting worse.

BUbiNG (open source)
IP: 64.62.252, 104.254.132.10
UA:
BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
BUbiNG is a scalable, fully distributed crawler, currently under development and that supersedes UbiCrawler.
Although the two IP ranges belong to different hosts, there are no major differences in their behavior. (UbiCrawler must have been before my time; I find no record of it.) I put them in the “No skin off my nose” category.

DotBot
IP: 216.244.66.229
UA:
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)
Psst! DotBot! It’s tidier when the URL in your UA string doesn’t redirect--especially not to an entirely different domain. (They are not the only ones.)

SemrushBot
IP: mostly 46.229.168.75 (the exceptions may be fakers, but why would you pretend to be the SemrushBot?)
UA:
Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)
It would be very interesting to know what they’re looking for, since several requests were for obscure interior non-English-language pages that, to the best of my knowledge, are not linked from anywhere in the known universe.

SEOkicks
IP: 138.201.30.66
UA:
Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)

SiteExplorer
IP: 208.43.225.84-85
UA:
Mozilla/5.0 (compatible; SiteExplorer/1.1b; +http://siteexplorer.info/Backlink-Checker-Spider/)
SiteExplorer is one of the rare robots that only understand Disallow if they’re given a sector to themselves in robots.txt, so I initially thought they were non-compliant. Later evidence suggests that they’re just very, very slow on the uptake: in the course of the month they picked up robots.txt eleven times, but never screwed up the courage to ask for a page until almost the end of the month.

spbot
IP: 138.197.74.117, 45.55.175.162 (two full crawls)
UA:
Mozilla/5.0 (compatible; spbot/5.0.3; +http://OpenLinkProfiler.org/bot )
I think their crawling happens on the fly: robots.txt, two forms of root--one of which gets a 301--and then all other pages, from top to bottom, with the same referer a human would send. In the rare case that a page is linked from widely separated directories on the same site, the referer is whichever one the robot saw first. Since they don’t come in with a shopping list, there are never any 301s or 410s. This makes it useful for record-keeping purposes: Count the number of requests, subtract two, and that’s how many visible URLs you’ve got ;)

test Crawl
IP: 54.92.230.108 (always)
UA:
test Crawl
I did say I didn’t have very high standards when it comes to authorizing robots. They were very active in the latter months of 2016; in April they only showed up once. They’re only interested in one directory: mostly pages, but the occasional stylesheet, and sometimes the first image on a page--regardless of whether it’s a full-color frontispiece or a little icon from the navigation banner.

UptimeBot
UA:
Mozilla/5.0 (compatible; Uptimebot/1.0; +http://www.uptime.com/uptimebot)
Referer:
http://uptime.com/example.com

Like SiteExplorer, UptimeBot is exceedingly slow w/r/t to robots.txt, but if you keep denying them for a month or two they will eventually get the message and stop requesting pages. At that point you may choose to whitelist them. Most of the time they just do a HEAD / which is hardly a server-intensive request.

CCBot
IP: 54.162.105.83, 54.225.9.188 (they pick a fresh IP for each crawl, and don’t come around very often)
UA:
CCBot/2.0 (http://commoncrawl.org/faq/)
May be following outside links, though they did once look at the sitemap. Their web page says they’re Nutch-based, which may explain their compliance with robots.txt--not something you see every day from the 54 neighborhood.

Targeted Robots: Authorized

Most of these first showed themselves in the latter part of 2015 when I started putting ebooks into a curated directory that includes both an RSS feed and a “New Listings” page. These robots tend to be truthful about where they learned about the page they’re requesting, even if they didn’t literally click on a link to get there, the way a human would.

Flipboard
UA:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 (FlipboardProxy/1.6; +http://flipboard.com/browserproxy)
Flipboard 1.2 is for robots.txt and pages, 1.6 is for images. For each new file, they request the HTML a few times, and the associated images just once. They don’t seem to be interested in stylesheets.

omgili
IP: 82.166.195.65
UA:
omgili/0.5 +http://omgili.com
The name is short for OMG I Love It. Unfortunately, I am not making this up. Check for yourself. This is a recent arrival; I only started seeing it earlier this year. It picks up a page when it first learns about it, and then comes back every week or so for the same page.

Less frequent visitors include:

magpie-crawler: 185.25.32.42
trendictionbot: 144.76.22.175
rogerbot (Nutch-based robot from Mozilla)

Social Links

facebook
IP: 31.13.64-127 (Ireland), 66.220.144-159, 69.63.128-191, 69.171.224-255, 173.252.64-127
UA:
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

facebookexternalhit/1.1
I have not yet figured out the significance of the shorter UA. It made its first appearance in the latter part of 2015, but doesn’t seem to be a direct replacement of any other UA, and doesn’t have a clearly recognizable function. The formerly common visionutils has not been around since April 2016.

Facebook’s behavior changes every couple of years. Currently new links start with a page request, followed by all image files belonging to the page, giving the page as referer. The human user then presumably selects one image, which will get re-requested every time your original visitor’s friends view the page that linked to you. These image-only requests come in with no referer. (Pro tip: Do your pages include a non-visible image file, such as piwik’s noscript dot? I rewrite this for facebook to a little banner giving the site name. So if a page happens not to have any good pictures--some books don’t--there’s still something for the human user to select.)

Facebook, incidentally, seems to be responsible for those /dir/subdir2 without-final-slash hits, now limited to the Bingbot Evil Twin. All it takes is one person to like a page and misspell its URL...

twitter
IP: 199.16.156-159, 199.59.148-151
UA:
Twitterbot/1.0
Uncharacteristically for a social-media-based robot, the Twitterbot asks for and obeys robots.txt. And, like the Googlebot, it never forgets. This month’s requests included an URL that I remember seeing on Redirect lists, meaning that they first learned about it no later than December 2013.

NetVibes
IP: 193.189.143
UA:
Netvibes (http://www.netvibes.com)


Fakers
In the category of Most Unlikely User-Agent:
UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0
The IP varies, but it’s always the identical UA string from beginning to end, so it’s not just an extra bit tacked on to the end of a human browser.

Leftovers

Finally, on the Probably-Good side, anyone who shows pretensions to robotitude by asking for--and honoring--robots.txt and having an identifiable name:

crawler4j
DomainCrawler
GarlikCrawler
MojeekBot
SafeDNSBot
SemanticScholarBot
SurveyBot (DomainTools)
VeriCiteCrawler
Wotbox

The name is crucial; there’s not much point to asking for robots.txt if you’re going around calling yourself Firefox 10, or even Chrome 54.

To be continued...
3:23 am on May 13, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13710
votes: 451


And now the bad news...

Well, not that bad. Overall, I think about 3% of unwanted robots got in--and most of those just picked up the front page and went on their way. I didn’t see any really devastating crawls coming out of nowhere. It isn’t enough to have a thoroughly human UA; they also have to send plausible humanoid headers. Thankfully, most robots are either too stupid or too lazy. About 3% of them sent no UA at all: instant lockout. About one-quarter--including a number of major search engines--did not send an Accept header: instant lockout unless whitelisted. (Other header anomalies are, of course, for me to know and you to find out. Or not.)

Robotic User-Agents

Aside from that small group that can’t be bothered to send a UA header at all, there are always a few brand-new crawlers who go around for months with the equivalent of Insert Name Here. And then there are the computer-science class assignments who never do figure out what to call themselves. (“Am I ‘Test1’? No, wait, I think I was ‘testone’.”) The fake Googlebot seems to have all but disappeared in recent months. Maybe the people who write robot scripts have figured out that “Googlebot UA + non-Google IP = automatic lockout”, so it ends up being worse than useless.

This month, about 85% of all robots were smart enough to start their names with “Mozilla” followed by some more-or-less-plausible humanoid user-agent. Firefox/40.1 seems to be in fashion just now; at least it’s a little more believable than the assorted one-digit Firefoxes. And a visitor of any species, human or robot, calling itself MSIE 6 ... can only inspire pity.

Some of them, though, aren’t even trying:

null
IP: 84.109.137.242
UA:
null

Go-http
UA:
Go-http-client/1.1
The pattern of requests suggests that most of them are the same robot, in spite of coming from all over the map.

Dorado WAP-Browser
UA:
Dorado WAP-Browser/1.0.0
Sometimes humans wear this face too, but more often it’s a robot.

Not Welcome Here

On some sites, Chinese search engines might be perfectly legitimate and even desirable. Me, I want no part of ’em--but try getting this message through.

Baidu
IP: 123.125.71, 180.76
UA:
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2

In April 2017, the Baiduspider by that name never showed its face at all. (It isn’t gone; I’ve seen it in May.) It visited once or twice a day from 180.76, but only to request robots.txt under the FF 6 alias. There were also a slew of blocked root requests throughout the month, from 123.125.67--an IP that Baidu has used in the past--claiming to be Chrome/45.

What I did see was a robot from an ARIN range professing to be
compatible;Baiduspider/2.0; +http://www.baidu.com/search/spider.html
[sic] What’s better than faking your UA? Claiming to be something that would be banned in its own right.

Sogou
IP: 36.110, 106.38, 106.120, 123.126, 220.181.124
UA:
Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
Currently it doesn’t seem to be interested in much but /ebooks/, which strongly suggests it is picking up links from some outside source. Requests but ignores robots.txt, where it is denied by name.

In both cases--Baidu and Sogou--I don’t know whether the robot is wilfully ignoring robots.txt, or whether it’s too stupid to understand a continuous listing (more than one User-Agent line sharing a single Disallow directive).

The Worst of the Worst

When you walk in and demand to see /wp-login.php, it’s all over. There is zero chance that you have any legitimate, law-abiding purpose. (I suppose if you actually have a WP site, you might invite security-testing robots to look around and confirm that nobody can get in where they don’t belong. That is not what my visitors are up to.) Close to 10% of the month’s robotic visits asked for wp-login, wp-admin, Fckeditor, xmlrpc.php and similar. (What they received, instead, was the 403 page--or, at worst, a manual 404.) A surprising number asked for robots.txt, or the meaningless “/blog/robots.txt”. I’m not clear about the purpose of the request: they did not rush over and ask for any and all roboted-out directories, but they did go ahead and ask for the standard wp list. Are they hoping to find something like “Disallow: /wp-admin876” that isn’t worth asking for unless you already know it exists?

The main feature of malign robots is miscellaneousness. Some IP shows up, requests two or six or a dozen files from a standard list, and never shows its face again. Repeat visits are a feature of named robots from known addresses.

Interestingly, about about half of the group--from every possible IP--came in with the exact UA
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
suggesting that they all started with the same script. Unfortunately, the identical UA is still in use by humans. But many others had no UA, making for a convenient Shoot To Kill twofer.

Request Patterns

Most malign robots can be identified by their behavior rather than by name and address. Current patterns:

1+3
GET /some-inner-page (always a different one) giving root as referer although the requested page is not actually linked from the root
GET / (root), giving the previously failed request as referer
GET / (root), with auto-referer
GET / (root), with auto-referer
This month--Hurrah!--every last one of this pattern was blocked.

Contact Botnet
GET /some-inner-page (always a different one) giving root as referer, as above
GET / (root) with auto-referer
GET /boilerplate/contact.html giving root as referer
GET / (root) with auto-referer
These requests are always blocked--but only thanks to headers. IP alone won’t do it; the “Contact botnet” plagued me for years. A further quirk of this pattern is that the first referer has a final / slash, but the other three don’t. And, of course, they must have friends inside, because the first request is for some page that exists but is not visible from the front page.

Multiples:
Only one page is requested--but they ask for it repeatedly, anywhere from 3 to 7 times, most often 4. Sometimes they throw in some referer spam. Almost all of them are blocked.

One is enough:
IP: various
UA and headers: sufficiently humanoid
Request: one page from one directory ... which happens to consist of unusually large HTML files. I would prefer not to send out 300K if I don’t have to, so now they get 302 redirected to a detour page at about 1/200 the weight. (The page also catches the rare human by mistake, but they’re given enough information to proceed normally.) To make these go away I’ve been reduced to blocking a handful of IP ranges--which is exactly what I’d hoped to get away from by changing to header-based access controls. Sometimes there is just no other way, darn it.

“Your root page sent me”
Referer: example.com (with or without final /)
Claiming that you got to some deep inner page directly from the root is a good way to get yourself blocked--especially if also you get the www wrong. Sending an auto-referer for the root itself is similarly effective. Most of them don’t even get as far as the RewriteRules, though.

The Current Faker

Calling yourself Googlebot seems to be going out of fashion. The one I saw most often didn’t even use the full UA string; it thought it could get away with the magic word alone:
Googlebot (gocrawl v0.4)
Even then, it hedged its bets; the “gocrawl” version alternated with
Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2
which counts as “could be better, could be worse” among humanoid UAs. All blocked, so no skin off my nose in any case.

Targeted Robots: Unauthorized

I talked earlier about robots that read an RSS feed, or a certain site’s New Releases page, and come in asking for a file from that list. The Great Divide is whether they first ask for robots.txt. These don’t:

PaperLiBot
IP: 37.187.165.31
UA:
Mozilla/5.0 (compatible; PaperLiBot/2.1; http://support.paper.li/entries/20023257-what-is-paper-li)

NewsBlur
UA:
NewsBlur Content Fetcher - 61 subscribers - http://www.newsblur.com/site/18645/new-online-books (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
Yes, it really says “61 subscribers” right there in the UA string. At least this month.

metadataparser
UA:
metadataparser/1.1.0 (https://github.com/bloglovin/metadataparser)

AppEngine
IP: 107.178.194
UA:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 AppEngine-Google; (+http://code.google.com/appengine; appid: s~feedly-nikon3)
In the past they have had other, similar UAs, but for now they seem to have settled on this form. Each time they are presented with a new title, they request it over and over for a week or so and then lose interest.

Thanks but No Thanks

On second thought, it isn’t enough to ask for robots.txt. You also have to do what it says. That means:

MegaIndex
IP: 176.9.50.244 (formerly at 88.198.48.46--they seem to have taken 2016 off, and returned at a new address)
UA:
Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
Some robots request robots.txt only after getting the front page; this is one of them. But, since they proceeded to ask for the entire contents of a roboted-out directory, it becomes pretty academic.

James BOT
UA:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 - James BOT - WebCrawler http://cognitiveseo.com/bot.html

linkdex
IP: various 54
UA:
Mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)

ltx71
IP: 52.2.10.94 and 104.154.58.95 (exactly)
UA:
ltx71 - (http://ltx71.com/)
I first became aware of this robot when it showed up requesting new ebooks, but it turns out it also likes the page linked from my profile here; I just never noticed because it was always blocked for one reason or another.

NetShelter
UA:
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0 (NetShelter ContentScan, contact abuse@inpwrd.com for information)

WBSearchBot
UA:
Mozilla/5.0 (compatible; WBSearchBot/1.1; +http://www.warebay.com/bot.html)


We don’t want your kind around here

If they never even ask for robots.txt, how will they know what it says?

Gluten Free Crawler
IP: 104.131.147.112
UA:
Mozilla/5.0 (compatible; Gluten Free Crawler/1.0; +http://glutenfreepleasure.com/)
Crawls URLs it finds listed on other sites--including but not limited to the one given in my WebmasterWorld profile. As far as I can tell, the name is meant as a joke, not as referer spam. Ancient history: Someone from this exact IP, though using a different name, asked for robots.txt on 24 November 2014. Apparently they’re still assimilating its contents; they’ve never asked for a fresh copy.

Inbound Links
IP: 192.157.254.197 and 199.193.251.133 in alternation
UA:
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Request: HEAD /
Referer:
https://www.inboundlinks.win/InboundLinks/www.example.com 
(always with incorrect www) They’re persistent, I’ll give them that; every few days they’re rattling the doorknob again.

Others

djbot/1.1 (+http://www.demandjump.com/company/about)

OrgProbe/0.9.4 (+http://www.blocked.org.uk)

Xenu Link Sleuth/1.3.8
I don’t know and don’t especially care whether this is the actual Xenu; all I know is, I didn’t order it. (I don’t know about Xenu’s ordinary behavior. The w3 link checker requests robots.txt on each site that it visits, and goes away weeping if it doesn’t find authorization.)

This Month’s Winner ...

... in the “Extra Stupid” category is:
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
And your point is...? Only that the element “User-Agent:” is part of the UA string. I also met a lone
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x
which similarly failed to read the Your New Robot instructions carefully enough.

“That’s All Folks!”

[edited by: lucy24 at 4:16 am (utc) on May 13, 2017]

3:29 am on May 13, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8477
votes: 364


Great comprehensive post Lucy24!

in the “Extra Stupid” category is:
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
And your point is...? Only that the element “User-Agent:” is part of the UA string

Many test drive versions (github) offer the user the ability to customize the UA field. A couple I've seen identify that field as "User-Agent:" but it remains in the UA string unless the user actually customizes that header field.

So this is actually an easy way to spot a fake browser and block. Too bad they all don't do it that way.
4:19 am on May 13, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3252
votes: 153


This has got to be the most exhaustive "State of the Robots" resource anywhere. Thank you lucy24 for keeping notes and compiling the data and sharing it here! Not just this year, but the history, habits and evolution of these bots is so useful when some puzzling behavior pops up. Much appreciated! :)
5:40 am on May 13, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7554
votes: 509


Raw logs have the data, but it takes a human to mine the data to come up with actionable results. Kudos! And thanks!
1:16 pm on May 13, 2017 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 25, 2003
posts:1028
votes: 214


Indeed, a great walk through write up! Thanks much.

I love this forum because (1) most posters (waves to lucy and keyplr) are more open than I and share their logic, which is a fabulous check on mine own; (2) there are always edge cases and new hazards to beware that simply are not shared elsewhere.

This forum is probably WebmasterWorld's greatest unappreciated treasure.

Unfortunately, the folks who really could benefit from this.... Ah, well.
3:45 pm on May 15, 2017 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:24187
votes: 528


Thank you lucy24 for posting this comprehensive information.

Without doubt, it's a document to be bookmarked as it's extremely valuable.
3:11 am on May 17, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8477
votes: 364


It's worth noting that this report is specific to the author's experience. Others may see a different set of User Agents (robots.)

Also, what is unwanted at one website may be valued at another. Each webmaster must determine whether each agent is something beneficial, to be tolerated, or unwanted.
3:41 am on May 17, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7554
votes: 509


True, but most of the time lucy24 nails it as to the unwanted. I'll start there an then make my own decisions on what to keep or disallow. :)
4:10 am on May 17, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8477
votes: 364


Both Baidu and Sogou are well behaved (YMMV) bots from huge Asian Search Engines and as such are a high value resource even if you do not wish traffic from these regions. The problem is, the UAs are faked often and rDNS is tricky because of the parent ISPs.
4:26 am on May 17, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7554
votes: 509


Into each life a little rain must fall, and in the process one weeds out the unwanted and makes welcome those of value. Oddly enough (info site) Asian, EU, and RU are very active... so I spend time dealing with the bad actors as they show up.
4:25 pm on May 17, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13710
votes: 451


this report is specific to the author's experience

Or, if you like,
This report is specific to the author's experience
;)

Botnets come, botnets go. Even some perfectly well-mannered robots, such as minor search engines, do their crawls at widely spaced intervals (for example I'm currently witnessing Mail.RU's semiannual Catchup Crawl) so one month's data may be widely different from a month before or after.
4:42 am on May 18, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8477
votes: 364


Ha ha... well I didn't intend it to sound like a dire warning; just trying to keep it academic for readers with various levels of experience.
3:52 pm on May 18, 2017 (gmt 0)

Moderator

WebmasterWorld Administrator webwork is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 2, 2003
posts:7909
votes: 43


I Love Lucy . . . . . . . . . . . . . . was a funny TV show in the 1950s.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members

Clicky