I'm not very good at choosing URIs for documents on my domains, which is to say that I'm not very good at choosing URIs which I won't later change my mind about. This is because websites and workflows are fluid and the organisation of a website is an inherent part of the data: I update a document's location just as I would update its content. But because I also believe that Cool URIs Don't Change, this creates the classic URI design problem.
Even though the web has been around for well over ten years, we're still in the very early stages of learning about URI design. Starting in, apparently, 1997 or 1998, the W3C started using one of the now most popular kinds of URI design amongst the thinking population of the web, that of datespaces. The idea is that since you shouldn't move URIs, the only information that you should put in the URIs is stuff which doesn't change. And what doesn't change about a document or thing published on the Web? The date on which it was first published, of course. So paths like /YYYY/shortname are quite common at the W3C.
But even the W3C isn't consistent in this practice, and the system has its detractors, especially amongst people who can't remember dates very well. The arguments for and against the system get more complicated, but again the main point to note is that there's no great resolution. It's still more art than science at the moment.
I bought a new domain a couple of years ago and set up a site thinking that after having studied URIs and site design for a few intense years I was ready to make things work without messing up too much. Now, two years later, I'm wondering if there's any system which even comes close to working for URI design. When I first set up the website, I identified several different schemes whereby people coin new URIs:
- a) date-stamped hierarchies, a la the W3C;
- b) a proliferation of randomly named directories, named as appropriately as possible;
- c) a Borgesque Emperor's Animals classification;
- d) a Thesaurus or Library catalogue based hierarchy;
- e) the unix FHS.
I'd previously been doing some a) and b), but had decided by then that b) with some e) was a better idea. I should've known that establishing URIs based on ideas from the FHS was not a good idea; the FHS is quite horrifically designed and out of date, and the calls for its reformation have gone so far as to actually spur people to create deviant linux distributions that no longer use it, such as GoboLinux. The reasons why GoboLinux uses its own non-FHS hierarchy have been explained fairly exhaustively by its creator, and make a fair overview of some of the ways in which the FHS doesn't pass muster.
The problem with approach b), that of short and snappy names, is that it's exceedingly difficult to come up with a name that is "good enough", yet alone perfect, when it comes to URI design. TimBL's original Cool URIs article linked above mentions quite a few of the problems involved with this, but the problems extend to some highly specialised ones which will be different for each particular site that's being designed. For example, when I wanted to set up a weblog on my new site, I wanted to avoid /weblog/ and /blog/ and any variations thereupon, and instead opt for something very neutral. I couldn't think of any abbreviation of the weblog's name that would work, so eventually I plumped for /notes/, except that it still didn't really describe the weblog properly (the notes changed from notes to full-blown essays), and moreover I had already been using that directory for something else, so I had to make it dual use, which was pretty confusing and effectively quashed the old use I had for it.
So what, you may ask, about redirecting things if you're going to move them? The problem with that is that you're stealing paths from yourself, and you're also having to maintain the redirects which, if you've only got .htaccess server configuration files to play with, can get to be very inefficient if you have a huge number of redirects. Apache will read the .htaccess file for every single request that it gets. Sometimes I get around this problem by writing a CGI that has a bunch of redirects coded into it, but that's only really possible if I've moved an entire directory. Moreover, I expose as much of the internal workings of my site as possible, so the CGI script would be visible, which is fine by me but when I move something I want to retire the old URIs and make sure that people are using the new URIs. Otherwise I wouldn't've moved it in the first place. So I generally get around to robots.txt filtering out the old directory (again, if it's a directory I'm moving), which makes me worry that search engines will screw up those pages' rankings.
Another way around this would be to have a completely different server configuration not using the filesystem for its backend, perhaps using a database filesystem, and perhaps even built on top of Apache. But the problem with experimental systems like that is that they are, well, experimental. If you're using a database you have to worry about your data getting corrupt, and migrating to other database systems in future, and so on. It makes your data less easy to access than if it's just floating around in your filesystem. I'm not sure what the perfect experimental filesystem for serving files via HTTP would be, but I've often thought about it; and I think that something that did revision control internally is a must, as well as perhaps identifying each file by just its hash, then enabling you to link that with various paths using a simple (but huge) path to hash to file mapping. On top of this I've imagined many URI systems, such as only allowing /[A-Za-z0-9]+ URIs (no more "/" segments!), and then if there are duplicate filenames, you just provide a disambiguation page instead of the actual file you're looking for. But already you can see the flaw in the system—namely that it's perplexing, hard to manage, and doesn't help people find what they're looking for. It does almost entirely remove the URI design question, but at a high cost.
So if URI design is expensive, it's because it has benefits too. Shorter and clearer URIs are easier to memorise, and so your pages become easier to recall in the future, and easier to refer to other people. Moreover, when you use short and clear URIs, people can even start to form impressions about how your site is structured from the URIs alone—that's what I do with the best of sites, at any rate—and many will know that you're a good designer. The ramifications can be very subtle, but cumulatively URI design is a very important thing. First impressions count: recent research (via BBC News) has shown that people evaluate the quality of new web pages in under 50 milliseconds.
I somewhat envy people who are able to design their websites according to some scheme and then stick to it; especially people who have very large base URIs for their sites, such as people who are using academic accounts or have homepages on other people's servers. One of the first sites to grab my attention in this way was that of Sampo Syreeni. Other people to have achieved zen-like levels of site design quality include Ian Hickson, and, of course, Dan Connolly. Some of my friends are also rather uncannily good at it: Aaron Swartz especially seems to invest not all that much time in URI design but due to enormous amounts of experience is really good at it and rarely if ever changes a URI. But most of all, Morbus Iff manages to excel himself when it comes to categorisation.
If nothing else, Morbus is a heavily repressed and frustrated librarian who, having never actually been a librarian as far as I know, works off his needs on his huge collections of movies, books, files, games, comics, magazines, and so on. He's a collector, and so he has to categorise by necessity; and he's the kind of guy who has to get things done right without being a perfectionist about it. He's a kind of pragmatic perfectionist. The best place to observe his tendencies is probably his lists directory. Note for a start how each of the entries in even the directory has its own neat little label. Note also how the right hand side of each of the labels currently (2006-01) makes a kind of wave motion. With any other person I would say that this is probably random chance; with Morbus it is almost certainly deliberate, and even if it's not it's very indicative of the level of attention to detail that he invests into such things.
His directory consists of albums, bookmarks, ebooks, videos, and quests. There are thousands of entries in each category, sometimes managed by hand, and sometimes partially automated. If you look around the site, you'll find scripts for doing many of the kinds of tasks that are needed to produce these kinds of lists and keep things categorised correctly.
But even Morbus doesn't seem entirely sure about URI design: for example, I would bet that instead of www.disobey.com he would prefer to use just disobey.com now. Instead of putting his weblog at /dnn/ where it's been for several years, he's now moved it to the / page, the front page. He previously eschewed datestamped directories (like I have done on and off), but now he's using them, though not all that often. The list goes on and on: and each of these things, I can be sure, has a very distinct set of reasons behind them because this is Morbus and Morbus Thinks About These Things; but all the same, URI design is what it is. It's impossible to get right all of the time.
I chose Morbus as an example because he's as avid about categorisation and URI design as me, but one criticism that may be levelled is that it's not worth fussing over, and that off-the-cuff ideas are often the best. Whilst it's true that off-the-cuff designing can often be the best approach, I think it's unfair to say that URI design isn't valuable, and furthermore I think it's unfair to say that it isn't interesting in its own right. It's an art and a science, and it's only slowly becoming more science than art, but all the same it's a distinct and flourishing hobby-out-of-necessity in some circles. By just thinking about it as a thing that we have to do, that's when error can creep in; and that's when links break. It might not be a particularly interesting hobby, but there are plenty of hobbies that I don't find interesting and yet still recognise them as hobbies. It's time to start recognizing URI design as a hobby, indeed as a discipline, in and of itself.
In a sense, it already has been recognised. For to design a URI is to decide upon a classification scheme for a published resource, and the history and the art and the science of classification is long and involved. For as long as there have been books, people have been wondering how to order them on their shelves. By size? By title? By date? By colour? By topic? By author? The system that's most popular in libraries is, of course, large-by-topic and small-by-author. The Dewey Decimal system, developed by Melvil Dewey in 1876, is one of the most well known of the by-topic classification systems, but you don't have to look far to find other obvious ones. Dewey classified works; Roget classified words. John Wilkins even proposed an analytical language, a language whose words were ordered according to a grand classification scheme, later essayed upon so lucidly and humourously by Jorge Luis Borges. Wilkins was a bit of a dreamer, or, as Borges put it, he was one who "abounded in happy curiosities: theology, cryptography, music, the fabrication of transparent beehives, the course of an invisible planet, the possibility of a trip to the moon, the possibility and principles of a world language". It's not surprising that we should find the construction of a world language at the tail of the list since it has been proven again and again (and this is the whole point of Borges's essay) that there is no such thing as a universal classification scheme. You can't even come close. All you can do is to create local classification schemes and hope they'll be suitable enough for some particular use that they've been put to. One of the reasons that I admire the sites of Sampo, Ian, and Dan, for example, is that they've successfully created such a scheme and employed it and stuck to it.
Wilkins was constructing a language; we need merely to construct a website. But for both languages and websites, there is one thing in common: they are generally creative endeavours. The things that words describe are fixed, but words themselves can take any forms. The files on a website generally have already been written before the URIs are chosen, but any URI can be chosen for them. Moreover, the relationships between words, and the relationships between files, are flexible, and variant across time and context and many other things.
It's the flexibility of the associations that is one of the biggest pains of URI design. For example, I like to arrange my works so that they're clustered. In other words, I like to make sure that I don't have directories with thousands and thousands of files in them; it makes things harder to find. Nor do I like directories that only have one or two files in them; it isolates them and makes the URIs unnecessarily long. I mainly prefer large amounts of files to small amounts because at least the URIs are shorter, but that's a point for me to elucidate in a moment. My problem with wanting to cluster files is that I don't know how much I'm going to write about a particular subject in future. So I might start writing about Shakespeare, and I make a file called /notes/shaks in which I write about him. Then I decide that the page is becoming too long, and it's worth having a short URI for all the things, so I make a /shaks/ directory and start putting files in there. Then I decide that I'm interested in his life and time, so I create a /shaks/bio/ directory to hold more files. Then I decide that I'm only interested in his early history, so I create /shaks/bio/early/ and have to move early.html into separate files in that directory. Then I decide that all of that could do with a shorter URI and move it all to /shaksbio/. A half-real and half-contrived example, but you can see how the process just goes on and on.
So why cluster at all? If it's easier to have one big directory containing several thousands of files, why not do that? After all, the URIs would be short and there would be no specific disadvantage, right? In Ye Olde Daies, some filesystems couldn't even handle over a few thousand small files in a single directory, but hopefully things have moved on a bit from there. The biggest problem now is that some things just naturally need to be groups. Sometimes it's an absolute requirement, such as when I'm distributing some code and I need a directory to make the project.tar.gz file from. Sure I could make a manifest file, i.e. a list of all the files that will go in the distribution tarball, but that's a pain, and it's difficult to maintain.
To dip back into the realm of experimentation and most optimal solutions, perhaps if it were possible to mount a manifest file as a virtual directory, the one-big-directory approach would not be so bad. But then why not go the other way? Viz, having lots of directories on the filesytem but making Apache recursively check through all these directories when a single filename is requested? (My problem with the latter has mainly been that it's then difficult to spot duplicates; and structure is still important.)
Note that when the flexible and ever-changing associations between things isn't present, it's a lot easier to come up with a stable hierarchy. This is obvious. For example, taxonomies in biology: once the scientific classification of organisms was discovered, it wasn't long before the system was reasonably concrete. But even the classification of organisms has had and continues to have problems. When Linnaeus started his classification of organisms it was so as to better identify; it wasn't until Darwin that we realised that the hierarchies were founded on the principle of common descent. And the hierarchies can still be really complex: for example, it's not known exactly how many species of citrus fruits there are. It's not even known roughly. It seems that Walter T. Swingle, a lumper, says there may be as few as 16 species; Tyozaburo Tanaka, splitter, says possibly as many as 145. That's a pretty major discrepancy!
Recently, I've found that the biggest thing that can help in URI design is not rushing the process. This means that a lot of my URI design is conducted well in advance of publication, and I test out the URI design for a long time before, by using a temporary directory prefix in front of the projected path that I want to use. So if I come up with a path such as /hello/ that I want to use for a project, I'll put it under /temp/hello/ (say) until such a time as I feel it's ready to be moved to /hello/ itself. During the time that it's in /temp/, though, I won't be able to publically publish the URI and this is a quite significant drawback. And even this system is far from foolproof; it just ensures that I don't make silly quick mistakes. It also makes me worry a lot about URI design, and fret over the paths to choose; I have many directories that are waiting to be published where the only thing that now needs deciding is where they should be published.
I've even been keeping a text file about each particular URI design issue, and there are several sections in it. It's interesting to see the extent to which the design of the URIs is really the design of the site, so on that front it's excellent to document it in that one place. Designing a site by its URIs is like designing a bookstore by the titles of the books that it's going to sell: kinda fun! And, actually, useful. Quite a few of the issues that I've put in this URI design file have gone on to be resolved because I've carefully documented them therein and been able to refer to all of my thoughts on the subject over time and integrate them together and find the best solution. But one big irony is that I haven't published this URI design file yet because—surprise, surprise—I can't come up with a decent URI for it yet.
Note that even the URIs for this weblog, miscoranda, don't fulfill all of my requirements for a good URI. In brief the requirements are:
- Applicability. Don't go choosing a name that has absolutely nothing to do with the resource in question, even if it fulfills all of the other criteria wonderfully. This limits the amount of choice that one has for a URI to a very high degree, and is perhaps the most important requirement.
- Memorability and uniqueness. It's a fact that people see URIs very often, and they need to be able to transcribe them and sometimes recall them from memory alone. Using a memorable and possibly unique name will aid this process.
- Persistence. You don't want to have to move it, ever, so you have to leave information that's going to change out. But not so much that it's no longer memorable, of course.
- Brevity. A short URI looks better, is easier to type, is easier to tell to other people, fits better in emails, and exudes good design sense.
- Aestheticness and palpability. In other words, a URI shouldn't be silly unless the document it's refering to is silly. It shouldn't look confusing and have lots of odd characters in it or jar the eye. It should have substance to it though; extremely short names are sometimes almost as bad as extremely long ones.
These requirements often go against one and other: it must be brief but palpable, memorable but persistent, applicable but aesthetically pleasing. There just aren't enough synonyms in the English language sometimes to be able to find a word for your document that you haven't already used and that looks good and is reasonably unique and so on... English, even though it's basically two or three languages smushed together (and more), doesn't have enough capacity to allow good URI design. And this is not to mention the fact that if you're using a more limited language you have even more of a problem.
As I was saying, a good case study is miscoranda, which is using URIs that are just integer based, such that each post has a number and each time I make a new post the number increases by one. This means that the URIs are very short—the path for this post will be /159—but what does /159 mean to anyone? Even I generally have no idea which posts were at which number, and I certainly don't expect my readers to. On another one of my weblogs, I have been using shortnames instead, i.e. brief and normally unique keywords contrived on the spot based upon the post's title, and though I thought I'd have a problem generating them and that I'd run afoul of my usual URI design issues, I've actually been fairly happy with them. I haven't had to move a single one so far. But that's a weblog, and a weblog is a relatively controlled environment of sequential posts; whereas a website can encompass any number of things and projects.
So, to conclude, URI design deserves a lot more credit as a discipline, even as a hobby, than it currently gets, and it's building on top of centuries of research with hierarchies and taxonomies and other classification schemes, but also has many new problems of its own. It's also a very unique thing, meaning that there is not just URI design in general but there are also many URI designs, a bit like snow and snowflakes. Each time you do something, you have to design a URI anew with a new set of principles, and though there are some requirements (as listed) that hold true for pretty much all URIs, each new project brings its own particular constraints and opportunities. This means that only experience can really help, and so time is an important part of the equation. Experimental solutions might gradually come into effect more and more in the future as one fundamental controller of URI design, that of the shape of our filesystems, changes over the years; but this is a slow process and in the meantime we need to make URIs that we'll be happy with both now and when any future advances occur.
This isn't talked about anywhere near enough, so if anybody has specific ideas about URI design they should feel free to reply to this entry to talk about it on the www-talk mailing list, or wherever it's most appropriate.