Google is the theme of the week in my Information Retrieval Systems class, so we've been reading some cool stuff about Google, especially this review of Google and a corresponding page about Google inconsistencies and bugs. It's really interesting to realize the limitations on Google's searching capabilities, especially Boolean operators, truncation, and wildcards, in light of the fact that it works. It's a fascinating proof of the power of indexing and ranking. It doesn't matter how many hits you get, as long as the best ones float to the top.
One of the other students in the class pointed us to an article comparing Google to a couple of other new search engines, especially Grokker. Grokker automatically generates hierarchical subject categories for search results and orders the categories by number of results; for example, see the results for Laurabelle.
Most of the categories make sense, and that kind of reminds me of the way that Bayesian spam filtering can lead to innovations in products like SpamAssassin (it turns out that the token FF0000
, the hexadecimal RGB entity, is a very good predicter of spammishness). Even though the names of some of the categories are strange, they collocate hits that actually make sense together. For instance, Www.Niceperson.Org
means statistics about my site on blo.gs and Blogtree, and Lady Crumpet's Armoire. Laurabelle 's Blog
groups librarianish blogrolls, due to the fact that Lady Crumpet directly precedes me in the alphabet. Now all Grokker needs to know is that Spawn of Dewey
means Blogroll.
Unfortunately, one category that Grokker puts me in is Child Pr0n
(a sub-category of Blog
)., and the two hits are for my blog entry about child pornography. (Note that I am being careful not to put those words in the anchor text, because I don't want to boost my relevance/importance rating even more.) I'm also #1 in Google for those keywords, and I've gotten more hits from them than any from other. This is disappointing, given that I'm supposed to be a nice person. On the other hand, I'm next to kuro5hin.org (#2), so maybe it's not so bad. Besides, it looks like people would have to work pretty hard to find real pornography with that query, so maybe they're looking for articles like mine, about pornography. Who else would click on a link whose excerpt mentions the Times Online?
homer jay says:
Information is information, plain and simple. You provide information on a subject, and a search engine leads people to it. Basically, you can't determine the intentions of someone typing "child p**n" into a search engine, and the search engine can't determine your intentions about having the subject matter on your site.
All in all, I think things are working correctly.
Laurabelle says:
I tend to disagree. No, at the moment search engines don't know exactly what you're looking for, but why shouldn't they? Isn't the purpose of information retrieval to deliver all relevant documents and no irrelevant ones? Yes, search engines are working properly for their current functionality, but is that any reason not to aim for more precision?
I think that in this case, Grokker's clustering technique is a fairly good one. Given the ambiguity of what people might be looking for, clustering will allow people to zoom in on what they want and not have to wade through all the rest.
Phil Ringnalda says:
While I was building myself a highly targeted trivia-answering program that did some scraping of search engine results, I was surprised to find that Teoma has become really, really good. The clustering is okay, sometimes, but probably because they aren't being targeted as much by SEOs, they generally seem to have better results, especially for competitive searches, than Google. And for some sites (including what I was after), they can be both deeper and fresher.
It's one of those things that we used to teach and be taught, that's sort of fallen away in the last few years: never look at just one set of search results. If I wasn't so conditioned to searching by typing "google whatever something" in the address bar, I'd switch to Teoma first. In fact, I just switched the google bookmark keyword over to search Teoma, just to see how it works out. If it doesn't work, or I really meant "no, Google!" I can always try "nogoogle whatever".