Tuesday, January 14, 2003

Maciej Ceglowski is working on a semantic search engine for blogs; recently he's been ploughing through a mass of Movable Type archives, & he's come up against the problem of linkrot:

... the search itself works ... but what doesn't work is the links. By far the majority of weblog posts are short one-liners with a link in them. The next category after that is the tossed salad variety format — a paragraph full of loosely connected ideas built around pointers to interesting sites. Of course this is the whole point — we're supposed to be making a reasonable stab at hypertext — but it turns out the links are terribly brittle.

Reading these grizzled posts is like looking through an old scrapbook, where the writing is clear but the pictures have all bleached to white. The further back you go in the past, the fewer working links you can find. "Permalinks" to other boggers get broken as people change ISPs, domain names, or software. Links to novelty sites and flavors of the month dry up; links to bubble-era dot coms have gone down with the ship. "Permanent" links to news sites get retired to a polite 404 every time the software changes.

The irony here is that most of this content still exists. More things get moved around than disappear, and much of what is really gone still lives on in the Internet Archive.... The sad part is that these old sites and old posts aren't old by any meaningful standard. The oldest blog entry I've looked at dates from 1998, and the blogger who wrote it is still in his twenties....

We're so caught up in keeping track of who is linking to what just at the moment that we've neglected to think about what is going to remain of the "blogosphere" ten years from now. *Two* years from now, for many sites.

(Via Brad DeLong.)

No comments: