Jamie's Blog

Lessons from a life of startups, coding, countryside, and kids

Clustering, Filtering and Aggregators: A Migraine Cure?

I’m writing this post while I wait for the Ibuprofen to kick in and remove the chisel from the side of my head. I’ve tried lying down and doing nothing but that only makes the pain worse. So now I’m going to try and occupy myself out of this migraine by writing this post.


Yesterday, I killed my aggregator account so that I’d have fewer distractions from my PhD. Hilary asked how I’d keep up with what’s going on the world? The short answer is that I won’t. For the last 7 years I’ve made it my job to be the person aware of what’s going, what libraries are available, what programming languages are being used, what technology is getting hot, what new applications are being created, and so on. That knowledge used to come from reading books but over the last 5 years my main library has been the Internet, and blogs/feeds were my primary source. The knowledge has served me well (I think) but I’ve also realised that doing a PhD does not require that knowledge like other jobs have.

One of the other reason was that it was taking too much time to flick through all the posts in the morning. And then more would appear through the day (and I just couldn’t leave that little red dot alone). The thing is, there isn’t more interesting news than there used to be, it’s just that it’s repeated more often. You can read a post on Bruce Schneier‘s blog, find it echo’d on BoingBoing and a few other miscellaneous blogs, and then, two days later, the same thing turns up on Slashdot. That’s a rather inefficient use of my bandwidth to keep repeating the same story with a few (value-less?) comments attached to them.

One of the last posts I read through Bloglines was Russell Beattie’s post about integrating a Bayesian learning scheme (last seen in your favourite spam filter) into an weblog aggregator to filter and prioritise posts. That might solve some of the problems (e.g. filtering BoingBoing for the posts on freedom and sci-fi subjects, but ignoring the more esoteric posts) but, more often than not, the value of a blog post is not in the text, but in the link. So, in addition to a bayesian filter, I would propose using a graph-based clusterer to identify related posts and cluster these together. If you’re not interested in the story at all, it becomes much easier to skip the whole lot. A single authoritive source could be identified within the cluster by either measuring the incoming links (typical of these algorithms) or by applying a quality metric such as the user’s ratings, Google PageRank or Technorati ranking. The large aggregators like Bloglines, Technorati and Google are in a prime position to cluster the global view of weblog posts. On the other hand, clustering the whole universe will be very processing intensive and ultimately will present the user with much more data than current aggregators. A better solution is to simply cluster those feeds which the user already subscribes too. Oh, and the user interface would need to evolve a little too — ideally it would be rich interface in which the user can explore the blog posts positioned in a 2D space, visualise the links, overlayed clusters and highlighted cluster centers.

Clustering or filter won’t actually reduce the volume of posts from all sources. There are a few rare and valuable blogs which contain wonderful original content (NeeNaw, Random Acts of Reality), a few others which don’t necessarily conform to the general flocking of the weblog crowds (Planet Potato) and few more which I’d read because of my relationship with the author more so than the actual topic (Mike, Hilary).

Sorry if this post is too long and incoherent: now I bet you wish you’d had a Bayesian filter to discard it!

P.S. The chisel is still stuck in my right temple and tiggling the back of my eyeball. It seems blogging is not a suitable cure for migraines. Perhaps I should try some coding or writing a paper. Or just gobble some more ibuprofen