Google and comment spam

Tue, 1 Feb 2005

This weekend I noticed that Googlebot had started indexing the URLs of my newly renamed comment script, even though it wasn't actually crawling the pages. Since comment and trackback spammers don't need the content of the page, just the URL, this was not enough for me. I wrote to Google to find out what could be done.

From: Laura Melton
To: Googlebot
Subject: Googlebot is not following my robots.txt
Date: Sat, 29 Jan 2005 21:03:39 -0800

Hi,

My robots.txt file (http://www.niceperson.org/robots.txt) disallows
bots from indexing my /cgi-imps/ directory (what I've renamed
/cgi-bin/).  Yet Google has indexed scripts in my /cgi-imps/ directory.
Why is this?

Thanks,

Laura Melton

My effort was not wholly in vain. Yesterday I received this reply:

From: Googlebot
To: Laura Melton
Subject: Re: Googlebot is not following my robots.txt
Date: Mon, 31 Jan 2005 14:11:42 -0800

Hi Laura,

Thank you for your note. Although your robots.txt file prevents our robots
from crawling your pages, it will not prevent our robots from adding a
link to your page without crawling it. We reviewed the link you mention:
www.niceperson.org/cgi-imps/mt/mt-tb.cgi/45. Our robots found this page
because another page linked to it. Our robots added the link to our index
without actually visiting or crawling the page. This is why the page does
not have a detailed title or description.

Although a robots.txt file usually prevents pages from appearing in our
search results, the only fool-proof ways to keep them out of our index are
to make sure that no sites link to them, password protect them, or remove
the robots.txt file and use a NOINDEX meta tag instead.

In this case, we suggest that you remove the links to this page or use
NOINDEX meta tags. To obtain a list of the links that point to a page,
perform a Google search on the URL. From the search results page, select
the 'Find web pages that contain the term' link, and Google will provide
you with webpages that mention that address. For more information on meta
tags, please visit http://www.google.com/remove.html#exclude_pages

You can also remove this page by submitting your robots.txt file for
immediate review at http://services.google.com:8882/urlconsole/controller.
However, please note that this will only temporarily remove this page from
our search results. If sites continue to link to this page, it may be
included again in our search results.

Regards,
The Google Team

This is not what I wanted to hear, but it is informative. Contrary to my prior belief, the robots.txt file does not completely block Googlebot from directories and files. Moreover, if I have a robots.txt, then Googlebot won't pay attention to my meta tags. I'm not getting rid of my robots.txt because I need it for file types that don't have meta tags. For example, my robots.txt has been successful in blocking off my images directory. In any case, I can't give my Trackback script a meta tag either. Screw the meta tag approach.

Removing all links to those scripts is sheer absurdity, of course, as is password protection. As for my rel="nofollow" attempt, Google says only that marked links will get no PageRank credit, not that Google won't index them. It was mere wishful thinking on my part.

This only proves what Phil and I said about Google's proposed solution: Google isn't interested in a solution for anyone but them. It's mind-bogglingly easy for spammers to find Movable Type comment or trackback scripts via Google (allinurl: "mt comments cgi"), so if Google really wanted to make the spammers' work harder, they'd make their bots a little less greedy (or give bloggers ways to slap them down).

So what can Movable Type bloggers do? I've submitted my robots.txt to Google's automated removal system, so in a week or so all links to my scripts should be gone, at least temporarily. As a more permanent work-around, I'm also going to rename my comment and trackback scripts (again) so that they don't contain the words comment or trackback. Current ideas are annotate and see-also, but I haven't made up my mind. Whatever I name the scripts, the comment spammers will have to work much harder to find them.

Nofollow

Wed, 19 Jan 2005

Somebody at Google had a great idea that I didn't even think about: marking links that shouldn't be considered when calculating PageRank. Actually I shouldn't be surprised that they thought of it; after all, PageRank is their crown jewel.

The magic is simple: add the attribute/value pair rel="nofollow" to any link. This means I can now set nofollow on a link-by-link basis, rather than on a whole page (which has been possible for years). That's why this is new and cool.

Google recommends using the rel="nofollow" attribute for all links, trackbacks, etc., but as a commenter said on Shelley Powers' post on the subject,

Too bad that legitimate visitors will really be the ones who suffer from this. The spammers will just find other ways. Meanwhile, the folks that I am too happy to promote by allowing a live link to their site won't get the benefit they deserve.

He's right, and if I use this attribute, it's not going to be without a mechanism for filtering spam vs. ham. A blanket application is unfair to legitimate commenters, and in any case my site has so few comments that I have so far been able to delete them manually when they appear.

Robert Scoble has a different reason for being excited about rel="nofollow"; he wants to be able to link to sleazeballs in a blog entry without catapulting them straight to the top of search results. I like his idea (my PageRank is not as high as his but still fairly good).

rel="nofollow" is a neat new tool, but it's still just a tool. What matters is the application.