Category Archives: XML

The Uniqueness of Things

Found the below in my Drafts folder, unearthed after I imported my old blog to the WordPress instance on my own server. While it was written six years ago, I thought it was still worth publishing after I read it. I hope you think so too.

Two years after writing this (and having long since forgotten that I did), I presented the concepts behind URNs and the need for uniqueness in document management at XML Finland. The system was finished and done, and I was proud of it. It wasn’t perfect but it was battle-tested and we knew about its weaknesses. I really wanted to talk about it with other markup people, colleagues who knew about angled brackets, and I was sure they’d understand. In fact, I feared some might say they implemented it all years ago, only better. Yet, what is described here also happened at XML Finland; the importance of uniqueness and the advantages of semantic naming using URNs went right past them, judging by the Q&A afterwards.

Or maybe it’s just that I’m wrong.

Anyway, here goes…

===

I’ve been busy finalising an authoring system that is supposed to identify every resource ever stored in it with URNs. What follows is just a rant, but I do think about it and would like to know the why’s and the how’s. I would like to know why the concept of uniqueness is so difficult to understand.

A URN, of course, is the unique name of a document, as opposed to its location, the URL. Compare with a book in a library. Sometimes books get reorganised in a library, meaning that they will be put on another shelf (another address), but the name will remain the same. The name is unique while the address is not. When identifying content to be reused, this is the principle you need to honour.

Anyway…

It’s been my primary concern all along to ensure that everything is identified with a URN. Everything. If you create a document and link to another, meaning to insert that other document in the one you’re editing, the link should take the form URN#id, where the hash separates the name of the document from a node pointed out within the document when checked into the database. When checked out, in the XML editor, however, the form should be URL#id, since URLs are what most authoring systems can handle; we need the URL for styling the document in the editor, to publish it, and to process it in various ways.

A URN is possible, of course, but it needs to be replaced with a URL when processing, one way or another, so the decision was to use a URL when a resource has been checked out and replace it with a URN when checked in.

Early on, we did make a demo application that opened a document containing URNs pointing to other documents, replaced them with the corresponding URLs, normalised the resulting document, and published it using XSL and FOP. It worked like a charm.

Today, I found that the check-in does not replace the URLs with URNs. The file name is a pseudo-URN (with colons replaced by underscores) so I know my URN scheme is being used, but that’s as far as it goes. The URN-like file names remain.

Talking to a developer, I realised that he hadn’t even thought about it. He was using URNs to identify the resources in the database (the URN being an attribute on the object) but in spite of all our planning, all of our tests, the URLs were left in the links when the document containing them had been checked in. The object IDs in the database are unique, he said, but yes (he admitted), the file names are being used in the database so we can’t store two identically named files in the same folder in the database.

This is not a major problem since we already have the code to do all the work, but what surprises me is that nobody made the connection. Me, I assumed everyone had understood but did not check. I simply assumed that following the test, following the discussions, following the months of development, no-one could fail to understand their true meaning.

Wrong.

What is it that makes the concept of URNs so difficult?

Peer Reviews

I’ve been peer reviewing for an XML conference, lately, and I just have to say that this markup thing doesn’t seem to be a passing fad.

Seriously, after 15+ years in the field, it still amazes me how useful it can be. Markup practitioners are a creative bunch, and more often than not, peer reviewing is a very humbling experience. There’s so much I want to (need to) learn more about, so many technologies to try, and so little time.

I should probably post this and go back to experimenting with XQuery.

Me and XML in Stockholm

I’ll be talking about XML in Stockholm on June 16th. The event is a one-day tutorial for technical writers, managers and other interested parties, organised by Dokumentinfo. They organise tutorials on various subjects related to document management and archiving, and a yearly conference where I was invited to speak last year.

So far I have few details but I’m pretty sure I’ll manage to include XLink, somehow.

An Even-Simpler Markup Language?

in his blog, Norman Walsh writes about an even-simpler-than-Mixro-XML markup language, inspired in part by John Cowan’s XML Prague poster and by James Clark’s Micro XML ideas. His ideas are well worth a serious consideration–Norm’s ideas are always worth considering–but the purist in me cringes at the idea of allowing more than one root element. I have to say that I find the idea attractive but I’m not really big on change so maybe that is why I hesitate.

The pragmatist in me, on the other hand, also cringes at Norm’s not doing away with namespaces when he has the chance. in my experience they always create more problems than they solve, but on the other hand, my experience tends to be more about strictly controlled environments where the issues one usually wishes to solve using namespaces can be dealt with using other means.

XProc

I’m going to spend the next week or two doing a test implementation of XProc for our document management system, Cassis TI. XProc, as some of you will know, is a pipeline processing language for XML processing, in the same vein as pipe processing in the *nix world. It’s intended to standardise and ease XML processing by treating the processing as a black box consisting of smaller black boxes; in other words, what is inside is less interesting than how the in- and outputs are defined and used.

The test is about producing PDF output so it’s nothing fancy or new, but it’s important because I believe we can replace our current backend with an XProc-based processor, making things easier, faster and better for programmers and users alike.

List Modelling

I’ve been reading up on DITA. I’ve looked at the specs and the DTD before, obviously, but more from the perspective of an innocent bystander. The DTDs I implement in authoring systems and elsewhere are usually my own, and whenever I need to deliver content in some other format, I simply convert to it. This time things are a bit different, however, as we are considering doing a “DITA Edition” of the content management system I’m responsible for at work, and I need to know how DITA can fit into our stuff.

DITA’s got lots of things that I like, such as the combining of topic IDs with target IDs in references to avoid ID collisions. The DITA way is a very elegant solution and probably a better one than what I would usually do, which is to (in various ways in the DTD and in the authoring environment) make sure that authors can never end up in situations like it to begin with. There’s other stuff, too, but those are best left to another blog entry at some point.

Here, I want to talk about list modelling and specifically something that not only DITA but so many other DTDs and schemas seem to ignore, and that, in my mind, results in bad markup. Let’s start by discussing list semantics first:

A list is, well, a list of things. There are several types of lists, of which unordered and ordered are the most common, and the semantics are probably clear enough: the former lists stuff without a specific order (say, grocery lists) and the latter items whose order is significant (for example, David Letterman’s top ten lists). There’s also the definition list (which, in my mind, is not a list at all but a special case of a table, namely a two-column one), and probably some other types as well. In DITA, you can find something called “simple list”, which claims to limit what’s listed to one line per item, tops, without bullets or numbers, but to me that’s less about semantics and more about presentation.

So here’s a typical DITA list (HTML, DocBook and quite a few others look exactly like it, too):

<ul>
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
</ul>

There’s more to list semantics, though, at least in my mind. If you wanted to find a complete list in a document, you’d probably want to include its qualifying introduction (“Here’s the groceries you need to buy:”), and any and all information that goes between list items without being part of them but still belonging to the list as a whole. If your spouse is kind enough to subcategorise the grocery list to vegetables, fruit, dairy products and so on (I know I need the help), we’d have a multi-part list where the participating lists are part of a larger whole.

The introductory paragraph is where it gets tricky in DITA and similar structures. There are a LOT of block-level elements to choose from, but you cannot easily do a list that meets these requirements. This one, the preferred DITA way (at least if we choose to believe the examples in the spec), lacks a wrapper that identifies the list as one unit instead of a loose paragraph that happens to be followed by a list:

<p>The fruit we need for tonight:</p>
<ul>
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
</ul>
<p>And the vegetables for tomorrow:</p>
<ul>
<li>Cucumbers</li>
<li>Tomatoes</li>
</ul>

Of course, one could argue that our grocery list is really a section, but I would argue that the introductory paragraph is actually part of the list, but not necessarily a part of the whole section. What if I wanted to include images or perhaps a note to that section? Semantically, I can think of dozens of ways to reasonably expand the structure of such a surrounding section and still keep it on topic (that is, limiting it to subject matters concerning that central grocery list).

Keeping with DITA’s topic-based approach, we could certainly use a number of such sections and wrap the whole thing in a topic, but me, I think that’s overkill. All I want to do is include an introductory paragraph.

This, of course, is where some will argue that the introductory paragraph is really a heading. Definition lists in DITA and some other DTDs actually do have a heading for this very purpose, which to me hints that somebody did touch the subject at hand at some point, but then why do the “ordinary” lists without that heading? And of course, me, I think that introduction is not a heading at all, only a qualifier for the list.

Another option in DITA and others is to use the <p> element as a wrapper:

<p>The fruit we need for tonight:
<ul>
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
</ul>
And the vegetables for tomorrow:
<ul>
<li>Cucumbers</li>
<li>Tomatoes</li>
</ul>
</p>

This is perfectly valid, of course, but it ruins the intent of the <p> element and creates a very odd (and ugly) mixed content that would be difficult to process properly.

What I would like to see is more in the lines of this:

<ul>
<p>The fruit we need for tonight:</p>
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
<p>And the vegetables for tomorrow:</p>
<li>Cucumbers</li>
<li>Tomatoes</li>
</ul>

Now we have a single list (our grocery list) that includes the necessary introduction(s). Of course, it’s still somewhat ugly; I, for one, dislike the relative lack of list item structure–I’d much rather see an item modelled more properly, perhaps divided into paragraphs and other block-level content, where the concepts block level and inline remain properly separated.

Developing SGML DTDs: From Text To Model To Markup

Quite by accident, I discovered that Eve Maler and Jeanne El Andaloussi’s Developing SGML DTDs: From Text To Model To Markup is available online. I’m one of the people lucky enough to own a hard copy, but if you aren’t as fortunate, read it at http://www.xmlgrrl.com/publications/DSDTD/. It’s one of the best books ever written about information analysis, that (far too) little used skill required to write a good DTD. In my ever-so humble opinion, the book should be mandatory for anyone involved in a markup-related project of any kind, that’s how good it is.

(Yes, I know it was written before XML came out, 12 years ago, but XML is SGML, really, and the book remains as useful today as it was when it came out in 1995.S

elementNames and attributeNames

I keep getting annoyed by the (Java-inspired) naming of elements and attributes in some people’s XML, where the names contain capital letters to help keep the names clear. I’m sure you’ve seen how it works: elementName, attributeName, myNewAndExcitingElement, ohLookICanCreateReallyLongQNamesForNoApparentReason, ad nauseam.

Why do they do this? I know there is some kind of rationalisation for it in the world of programming languages, but in XML? XML is not a programming language and I still think it should be understandable and usable by humans (I know; SGML was supposed to be human-readable but XML doesn’t have that requirement). If you find yourself writing XML in a text editor (still happens to me), not only are these names enough to drive anyone nuts but they also make the XML more error-prone because you’re bound to spl something wrong. And if you write your XML in an XML editor, the element names filling the start and end tag symbols take up a lot of space that should be left to the actual content. (And no, I don’t believe in the minimal tag symbols that some editors provide; I want to actually see the tag names and I want to see the attribute names. They help me structure my document; in fact, they are there for that purpose!)
I ask again: why? If you are writing a schema and need to name an ordinary paragraph element, surely you don’t need to name it ordinaryParagraph or even paragraph? In my schemas, p is more than enough.
SoPleaseUseShorterNamesWithoutResortingToSillyConventionsBorrowedFromElsewhere.