URL | sgmlguru.org

Found the below in my Drafts folder, unearthed after I imported my old blog to the WordPress instance on my own server. While it was written six years ago, I thought it was still worth publishing after I read it. I hope you think so too.

Two years after writing this (and having long since forgotten that I did), I presented the concepts behind URNs and the need for uniqueness in document management at XML Finland. The system was finished and done, and I was proud of it. It wasn’t perfect but it was battle-tested and we knew about its weaknesses. I really wanted to talk about it with other markup people, colleagues who knew about angled brackets, and I was sure they’d understand. In fact, I feared some might say they implemented it all years ago, only better. Yet, what is described here also happened at XML Finland; the importance of uniqueness and the advantages of semantic naming using URNs went right past them, judging by the Q&A afterwards.

Or maybe it’s just that I’m wrong.

Anyway, here goes…

===

I’ve been busy finalising an authoring system that is supposed to identify every resource ever stored in it with URNs. What follows is just a rant, but I do think about it and would like to know the why’s and the how’s. I would like to know why the concept of uniqueness is so difficult to understand.

A URN, of course, is the unique name of a document, as opposed to its location, the URL. Compare with a book in a library. Sometimes books get reorganised in a library, meaning that they will be put on another shelf (another address), but the name will remain the same. The name is unique while the address is not. When identifying content to be reused, this is the principle you need to honour.

Anyway…

It’s been my primary concern all along to ensure that everything is identified with a URN. Everything. If you create a document and link to another, meaning to insert that other document in the one you’re editing, the link should take the form URN#id, where the hash separates the name of the document from a node pointed out within the document when checked into the database. When checked out, in the XML editor, however, the form should be URL#id, since URLs are what most authoring systems can handle; we need the URL for styling the document in the editor, to publish it, and to process it in various ways.

A URN is possible, of course, but it needs to be replaced with a URL when processing, one way or another, so the decision was to use a URL when a resource has been checked out and replace it with a URN when checked in.

Early on, we did make a demo application that opened a document containing URNs pointing to other documents, replaced them with the corresponding URLs, normalised the resulting document, and published it using XSL and FOP. It worked like a charm.

Today, I found that the check-in does not replace the URLs with URNs. The file name is a pseudo-URN (with colons replaced by underscores) so I know my URN scheme is being used, but that’s as far as it goes. The URN-like file names remain.

Talking to a developer, I realised that he hadn’t even thought about it. He was using URNs to identify the resources in the database (the URN being an attribute on the object) but in spite of all our planning, all of our tests, the URLs were left in the links when the document containing them had been checked in. The object IDs in the database are unique, he said, but yes (he admitted), the file names are being used in the database so we can’t store two identically named files in the same folder in the database.

This is not a major problem since we already have the code to do all the work, but what surprises me is that nobody made the connection. Me, I assumed everyone had understood but did not check. I simply assumed that following the test, following the discussions, following the months of development, no-one could fail to understand their true meaning.

Wrong.

What is it that makes the concept of URNs so difficult?

I found a link to an article by Taylor Cowan about persistent URLs on the web. It was mostly about what happens to metadata assertions (such as RDF statements) when links break, but there was a little something on persistent links and URNs, too. A comparison with Amazon.com and how books are referenced these days was made. A way to map the ISBN number as a URN was described (URN:ISBN:0-395-36341-1 was mapped to a location by the PURL service, in this case at http://purl.org/urn/isbn/0-395-36341-1), which is quite cool and, in my opinion, both manageable and practical.

The author thought otherwise, however: But on the practical web, we donâ€™t use PURLs or URNs for books, we use the Amazon.com url. I think in practical terms things are going to be represented on the web by the domain that has the best collection with the best open content.

Now, what’s wrong about this? At first, it may seem reasonable that Amazon.com, indeed the domain with the (probably) largest collection of book titles, authors, and so on, should be used. Books are their business and they depend on offering as many titls as possible. In the everyday world, if you want to find a book, you look it up at Amazon.com. I do it and you do it, and the author does it. So what’s wrong about it?

Well, Amazon.com does not provide persistent content per se, they provide a commercial service funded by whatever books they sell. At any time, they may decide to change the availability of a title, relocate its page, offer a later version of the same title, or even some other title altogether. The latter is unlikely, of course, but since we are talking about URLs, addresses, rather than URNs, names, talking about the URL when discussing what essentially is a name is about as relevant as talking about the worn bookshelf in my study when discussing the Chicago Manual of Style.

Yes, I realise that my example is a bit extreme, and I realise that it’s easy enough to make the necessary assertions in RDF to properly reference something described by the address rather than the address itself, but to me, this highlights several key issues:

An address, by its very nature, is not persistent. Therefore, a “permanent URL” is to me a bit of an oxymoron. It’s a contradiction in terms.
Even if we accept a “permanent URL approach”, should we accept that the addresses are provided and controlled by a commercial entity? One of the reasons to why some of us advocate XML so vigorously is that it is open and owned by no-one. Yes, I know perfectly well that we always rely on commercial vendors for everything from editors to databases, but my point here is that we still own our data, the commercial vendors don’t own it. I can take my data elsewhere.
Now, of course, in the world of metadata it’s sensible to give a “see-also” link (indeed that is what Mr Cowan suggests), but the problem is that the “see-also” link is another URL with the same implicit problems as the primary URL.
URLs have a hard time addressing (yes, the pun is mostly intentional) the problem with versioning a document. How many times have you looked up a book at Amazon.com and found either the wrong version or a list of several versions, some of which even list the wrong book?

Of course, I’m as guilty as anyone because I do that, too. I point to exciting new books using a link to Amazon.com (actually I order my books from The Book Depository, mostly) because it’s convenient. But if we discuss the principle rather than what we all do, it’s (in my opinion) wrong to suggest that the practice is the best way to solve a problem that stems from addressing rather than naming. It’s not a solution, it merely highlights the problem.

sgmlguru.org

on markup, film projection and more!

Category Archives: URL

The Uniqueness of Things

Permanent URLs, Addresses and Names