User-Agent Strings

Sun, 2 Feb 2003

Today I noticed another weird user-agent in my access logs:

67.83.186.136 - - [02/Feb/2003:22:19:10 +0000] "GET /index.rdf HTTP/1.1" 200 7937 "-" "Microsoft URL Control - 6.00.8862"
67.83.186.136 - - [02/Feb/2003:22:19:10 +0000] "GET / HTTP/1.1" 200 27162 "-" "Microsoft URL Control - 6.00.8862"

This is weird because there doesn't seem to be much consensus about what it actually is. One blogger seems to think it's the RIAA scouring the web for MP3's, but some others think it's spammers looking for formmail scripts they can abuse. Another person (named Chris Lutz) identifies it as a robot and bans it from his site because it doesn't respect robots.txt.

I think Chris is right, that it's the signature of robot software that is used by various people for whatever they're looking for. I don't know what this one's target is, and I'm rather intrigued by the fact that it requested my RSS feed before anything else. Chris uses an ASP script at the top of each of his pages to disallow access to certain bots. Although I can't deny having been tempted by this, I don't like this clunky solution. I like the mod_rewrite solution much better, and I am considering implementing it for the bots that have crawled my sites as well as some that haven't, such as imagelock.com and cyveillance.com, who make money by finding plagiarized materials and crawl sites as fast as possible.

Comments

Marie says:

I've seen that one too. One of my favorite resources when I run across something new is to run a site search at http://www.webmasterworld.com. Lots of helpful info there on this agent. I think Cyveillance was the first IP I blocked from my sites.

Kevin says:

It's actually just the User Agent string for Microsoft's various web tools. Most people would actually change the string if they were building proper clients but in this case it looks like someone running a little VBA macro to get information or something. The request for - is probably to see what kind of information the server returns, i.e FreeBSD/Apache/etc. and index.rdf I don't know off the top of my head. :-)

People hunting for FormMail exploits request /cgi-bin/formmail.pl and /cgi-bin/formmail.cgi and if the UA string says "Microsoft URL Control" probably means the results are getting dumped into Excel or something. Otherwise they're using some other method.

Laurabelle's Blog says:

Bots baby

At the time that I wrote my last entry about banning nasty user-agents, I tried blocking them with mod_rewrite, but

Chris Lutz says:

Hi

I just wanted to clarify: the ua-string "Microsoft URL Control" comes with VB Studio/Windows component "Microsoft Internet Transfer Control", which many developers use if the write an app to access the internet. It is not per se a spambot, but in my opinion it is save to assume that it wil ignore robots.txt. - Chris

The Overnamed says:

yeah, I know this ancient beyond belief, but the .rdf extension is indicative of Resource Definition File (I think). It's a version of XML that is fairly common Slash-Dot uses it for their news, so you can put it on other sites. And, they can interpret other sites' RDFs and use it.

John Douglas Porter says:

Chris is correct, regarding the nature of Microsoft URL Control. It is just a little module for accessing the web, which programmers can use. It is entirely analagous to the LWP module for Perl.

And so it is correct for Microsoft URL Control to not respect robots.txt. That is the job of the script writer. And while it's possible that the programmer is using the control in a "robot", it's also possible that it's being used in some kind of interactive user application, in which the control fetches URLs under the supervision of a human, for the purpose of displaying the fetched pages.

In summary, I don't think blocking any agent with this user agent string is desirable. You'd get too many false negatives AND false positives. Better to block on known robot behavior, regardless of user agent string. Any half-clever spider writer will change the string anyway.

Post a comment











XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

OpenID: If you use OpenID, your comment will be approved automatically and will not be held for moderation.