miscoranda: by Sean B. Palmer

Link in a Soupstack

The problem with getting links from HTML is that the HTML you find lying about on the web is often quite broken—with broken being here defined as "that which Python's sgmllib can't parse". I wrote a little script called getlinks.py that extracts all of the links from an HTML file, but had to rewrite it almost immediately to take care of a page which had a comma after a <meta> element attribute value. I would've thought that sgmllib could cope with that, but I had to write a regular expression screen scraper instead. It's a pretty good screen scraper, though: it even properly ignores comments and CDATA sections.

John Cowan's TagSoup software is meant to do something like the above. It processes the HTML input, no matter how bad it is, and provides a series of regular SAX events. Its one problem is that it's written in Java, and JC seems to be in the not-quite-but-almost soliciting a port stage. I'm thinking about it, using a variant of the approach that getlinks.py takes. It's in its very early stages at the moment, but it would be nice to compare with TagSoup; and if it doesn't compare favourably, I may even just port TagSoup as it is, presuming that Terje doesn't beat me to it in Perl.

I wrote getlinks.py some time ago, but haven't published it until now since it's been lurking in my development folder waiting for me to publish it publically. Quite a few other files are lurking in that development area, though I've been going through it quite a bit today. One of the problems is that I like to be sure that I'm happy with the URI I'm publishing to so that it'll be appropriately cool in the TimBLian sense. But I don't like hierarchical filesystems, so one of the things that I've been developing is a meta-database that lets me add arbitrary metadata to all of my published files. One of the properties is "keywords", which means I can sort my files using a kind of virtual folder setup.

The benefits that it's brought about already, such as being able to automatically generate sitemaps etc., are enough to make me think that it's a valuable system, but it's got a way to go yet. I've even written a little shell interface to it, so that I can augment the metadata properties of the file in a fairly transparent manner. All of the actual meta-database content is just RFC 822-style headers in regular files, anyway, so it's all recoverable in the normal way.

I know that I already wrote about this a little in a previous entry, but the benefits are incrementally obvious and have been building quite a bit since then. For example, I was able to assign <priority> values to my Google Sitemap based partly on which tags I'd assigned to a document. I suppose it's somewhat folksonomical, though folksonomies are quite arbitrary, community based, and useless, whereas with this system each keyword that I assign has a specific effect around the site: from altering the sitemap priority to appearing in some index to having highlighting added, and so on.

by Sean B. Palmer, at 2005-06-04 22:22:18. Comment?

Origins of Jabberwocky · Duck Egg Blue

Sean B. Palmer