Inevitable

Tue, 12 Apr 2005

Well, I knew it was going to happen eventually. The Trackback spammers have finally found my weblog. They left fifteen little spams over the last three days, and I found it rather ironic that my last entry about fighting spam got more spam-pings than any other.

This minor deluge of spam has at last prodded me into action; as I promised in that very entry, I have renamed my comment and trackback scripts. That should slow down the spammers, if only temporarily.

Google and comment spam

Tue, 1 Feb 2005

This weekend I noticed that Googlebot had started indexing the URLs of my newly renamed comment script, even though it wasn't actually crawling the pages. Since comment and trackback spammers don't need the content of the page, just the URL, this was not enough for me. I wrote to Google to find out what could be done.

From: Laura Melton
To: Googlebot
Subject: Googlebot is not following my robots.txt
Date: Sat, 29 Jan 2005 21:03:39 -0800

Hi,

My robots.txt file (http://www.niceperson.org/robots.txt) disallows
bots from indexing my /cgi-imps/ directory (what I've renamed
/cgi-bin/).  Yet Google has indexed scripts in my /cgi-imps/ directory.
Why is this?

Thanks,

Laura Melton

My effort was not wholly in vain. Yesterday I received this reply:

From: Googlebot
To: Laura Melton
Subject: Re: Googlebot is not following my robots.txt
Date: Mon, 31 Jan 2005 14:11:42 -0800

Hi Laura,

Thank you for your note. Although your robots.txt file prevents our robots
from crawling your pages, it will not prevent our robots from adding a
link to your page without crawling it. We reviewed the link you mention:
www.niceperson.org/cgi-imps/mt/mt-tb.cgi/45. Our robots found this page
because another page linked to it. Our robots added the link to our index
without actually visiting or crawling the page. This is why the page does
not have a detailed title or description.

Although a robots.txt file usually prevents pages from appearing in our
search results, the only fool-proof ways to keep them out of our index are
to make sure that no sites link to them, password protect them, or remove
the robots.txt file and use a NOINDEX meta tag instead.

In this case, we suggest that you remove the links to this page or use
NOINDEX meta tags. To obtain a list of the links that point to a page,
perform a Google search on the URL. From the search results page, select
the 'Find web pages that contain the term' link, and Google will provide
you with webpages that mention that address. For more information on meta
tags, please visit http://www.google.com/remove.html#exclude_pages

You can also remove this page by submitting your robots.txt file for
immediate review at http://services.google.com:8882/urlconsole/controller.
However, please note that this will only temporarily remove this page from
our search results. If sites continue to link to this page, it may be
included again in our search results.

Regards,
The Google Team

This is not what I wanted to hear, but it is informative. Contrary to my prior belief, the robots.txt file does not completely block Googlebot from directories and files. Moreover, if I have a robots.txt, then Googlebot won't pay attention to my meta tags. I'm not getting rid of my robots.txt because I need it for file types that don't have meta tags. For example, my robots.txt has been successful in blocking off my images directory. In any case, I can't give my Trackback script a meta tag either. Screw the meta tag approach.

Removing all links to those scripts is sheer absurdity, of course, as is password protection. As for my rel="nofollow" attempt, Google says only that marked links will get no PageRank credit, not that Google won't index them. It was mere wishful thinking on my part.

This only proves what Phil and I said about Google's proposed solution: Google isn't interested in a solution for anyone but them. It's mind-bogglingly easy for spammers to find Movable Type comment or trackback scripts via Google (allinurl: "mt comments cgi"), so if Google really wanted to make the spammers' work harder, they'd make their bots a little less greedy (or give bloggers ways to slap them down).

So what can Movable Type bloggers do? I've submitted my robots.txt to Google's automated removal system, so in a week or so all links to my scripts should be gone, at least temporarily. As a more permanent work-around, I'm also going to rename my comment and trackback scripts (again) so that they don't contain the words comment or trackback. Current ideas are annotate and see-also, but I haven't made up my mind. Whatever I name the scripts, the comment spammers will have to work much harder to find them.

Bayesian blues

Tue, 25 Jan 2005

My Bayesian filter seems to be going through a bad spell, so I'm going to try another tactic. The filter will still be running, but now comments won't be hidden completely if the filter thinks they're spam. Instead, links (and other HTML) will merely be stripped. Hopefully this will not be as frustrating to commenters.

At least, that's the theory. I'm going to try it out right now, get myself classified as spam, and see whether links get stripped.

Update: Woohoo! The test was successful. That was supposed to be a link to illegal-art.org.

Informal poll: comments feed?

Fri, 21 Jan 2005

I'm considering creating a feed for comments but am not sure whether it is needed or desired, so I'm conducting an informal and thoroughly unscientific poll. Drop me an email or a comment if you have an opinion.

  • Would you use a comments feed?
  • Entries+comments together, or comments separately?
  • How about feeds for comments on individual entries?
  • Any other preferences or advice?

Keep in mind that I know little or nothing as yet about syndication formats. Creating a template that someone else hasn't already written will take some education and experimentation.

Update 23.01.2004 0:26: Comment feed created. Knock yourselves out.

Comment spam

Tue, 31 Aug 2004

I've been getting more spam recently (for the drug that begins with a V and the one that begins with a C), so I've closed comments for all posts older than 30 days. I regret that this shuts off legitimate comments too, but comments on old posts are even more rare than comments on new entries. I've actually been doing this for a while, except I used to do everything older than 90 days. The last comments were on entries in July.

I'm also going to re-instate Bayesian filtering. Actually the Bayesian filter plugin has been working this whole time; I just haven't been using it to determine whether comments get displayed or not. Besides, the filter has been much more accurate of late. It always catches the drug names. (Don't anyone use them in a legitimate comment!!)

So, sorry if anyone's comment gets marked as spam; I'll retrain it as needed. Probably my comments will be filtered more than anyone else's.