Category Archives: XML

List Modelling

I’ve been reading up on DITA. I’ve looked at the specs and the DTD before, obviously, but more from the perspective of an innocent bystander. The DTDs I implement in authoring systems and elsewhere are usually my own, and whenever I need to deliver content in some other format, I simply convert to it. This time things are a bit different, however, as we are considering doing a “DITA Edition” of the content management system I’m responsible for at work, and I need to know how DITA can fit into our stuff.

DITA’s got lots of things that I like, such as the combining of topic IDs with target IDs in references to avoid ID collisions. The DITA way is a very elegant solution and probably a better one than what I would usually do, which is to (in various ways in the DTD and in the authoring environment) make sure that authors can never end up in situations like it to begin with. There’s other stuff, too, but those are best left to another blog entry at some point.

Here, I want to talk about list modelling and specifically something that not only DITA but so many other DTDs and schemas seem to ignore, and that, in my mind, results in bad markup. Let’s start by discussing list semantics first:

A list is, well, a list of things. There are several types of lists, of which unordered and ordered are the most common, and the semantics are probably clear enough: the former lists stuff without a specific order (say, grocery lists) and the latter items whose order is significant (for example, David Letterman’s top ten lists). There’s also the definition list (which, in my mind, is not a list at all but a special case of a table, namely a two-column one), and probably some other types as well. In DITA, you can find something called “simple list”, which claims to limit what’s listed to one line per item, tops, without bullets or numbers, but to me that’s less about semantics and more about presentation.

So here’s a typical DITA list (HTML, DocBook and quite a few others look exactly like it, too):

<ul>
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
</ul>

There’s more to list semantics, though, at least in my mind. If you wanted to find a complete list in a document, you’d probably want to include its qualifying introduction (“Here’s the groceries you need to buy:”), and any and all information that goes between list items without being part of them but still belonging to the list as a whole. If your spouse is kind enough to subcategorise the grocery list to vegetables, fruit, dairy products and so on (I know I need the help), we’d have a multi-part list where the participating lists are part of a larger whole.

The introductory paragraph is where it gets tricky in DITA and similar structures. There are a LOT of block-level elements to choose from, but you cannot easily do a list that meets these requirements. This one, the preferred DITA way (at least if we choose to believe the examples in the spec), lacks a wrapper that identifies the list as one unit instead of a loose paragraph that happens to be followed by a list:

The fruit we need for tonight:
<ul>
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
</ul>
And the vegetables for tomorrow:
<ul>
<li>Cucumbers</li>
<li>Tomatoes</li>
</ul>

Of course, one could argue that our grocery list is really a section, but I would argue that the introductory paragraph is actually part of the list, but not necessarily a part of the whole section. What if I wanted to include images or perhaps a note to that section? Semantically, I can think of dozens of ways to reasonably expand the structure of such a surrounding section and still keep it on topic (that is, limiting it to subject matters concerning that central grocery list).

Keeping with DITA’s topic-based approach, we could certainly use a number of such sections and wrap the whole thing in a topic, but me, I think that’s overkill. All I want to do is include an introductory paragraph.

This, of course, is where some will argue that the introductory paragraph is really a heading. Definition lists in DITA and some other DTDs actually do have a heading for this very purpose, which to me hints that somebody did touch the subject at hand at some point, but then why do the “ordinary” lists without that heading? And of course, me, I think that introduction is not a heading at all, only a qualifier for the list.

Another option in DITA and others is to use the element as a wrapper:

The fruit we need for tonight:
<ul>
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
</ul>
And the vegetables for tomorrow:
<ul>
<li>Cucumbers</li>
<li>Tomatoes</li>
</ul>

This is perfectly valid, of course, but it ruins the intent of the element and creates a very odd (and ugly) mixed content that would be difficult to process properly.

What I would like to see is more in the lines of this:

<ul>
The fruit we need for tonight:
<li>Apples</li>
<li>Oranges</li>
<li>Bananas</li>
And the vegetables for tomorrow:
<li>Cucumbers</li>
<li>Tomatoes</li>
</ul>

Now we have a single list (our grocery list) that includes the necessary introduction(s). Of course, it’s still somewhat ugly; I, for one, dislike the relative lack of list item structure–I’d much rather see an item modelled more properly, perhaps divided into paragraphs and other block-level content, where the concepts block level and inline remain properly separated.

Developing SGML DTDs: From Text To Model To Markup

Leave a reply

Quite by accident, I discovered that Eve Maler and Jeanne El Andaloussi’s Developing SGML DTDs: From Text To Model To Markup is available online. I’m one of the people lucky enough to own a hard copy, but if you aren’t as fortunate, read it at http://www.xmlgrrl.com/publications/DSDTD/. It’s one of the best books ever written about information analysis, that (far too) little used skill required to write a good DTD. In my ever-so humble opinion, the book should be mandatory for anyone involved in a markup-related project of any kind, that’s how good it is.

(Yes, I know it was written before XML came out, 12 years ago, but XML is SGML, really, and the book remains as useful today as it was when it came out in 1995.S

elementNames and attributeNames

Leave a reply

I keep getting annoyed by the (Java-inspired) naming of elements and attributes in some people’s XML, where the names contain capital letters to help keep the names clear. I’m sure you’ve seen how it works: elementName, attributeName, myNewAndExcitingElement, ohLookICanCreateReallyLongQNamesForNoApparentReason, ad nauseam.

Why do they do this? I know there is some kind of rationalisation for it in the world of programming languages, but in XML? XML is not a programming language and I still think it should be understandable and usable by humans (I know; SGML was supposed to be human-readable but XML doesn’t have that requirement). If you find yourself writing XML in a text editor (still happens to me), not only are these names enough to drive anyone nuts but they also make the XML more error-prone because you’re bound to spl something wrong. And if you write your XML in an XML editor, the element names filling the start and end tag symbols take up a lot of space that should be left to the actual content. (And no, I don’t believe in the minimal tag symbols that some editors provide; I want to actually see the tag names and I want to see the attribute names. They help me structure my document; in fact, they are there for that purpose!)

I ask again: why? If you are writing a schema and need to name an ordinary paragraph element, surely you don’t need to name it ordinaryParagraph or even paragraph? In my schemas, p is more than enough.

SoPleaseUseShorterNamesWithoutResortingToSillyConventionsBorrowedFromElsewhere.

Words in Boxes

1 Reply

This is the day for reading other people’s blogs. Dave Pawson’s XProc tutorial indirectly pointed me to James Sulak’s blog, Words in Boxes. A lot of it is about XML-related stuff but I also found gems such as his rant on grammar.

An XProc Tutorial

Leave a reply

Dave Pawson has written an XProc Tutorial, with contributions from James Fuller, James Sulak and Norman Walsh. If you need to do step-by-step XML processing in your application and haven’t yet heard of XProc, follow that link, now.

XLink FTW, Part 2

Leave a reply

Reading through my yesterday’s blog entry on XLink, I feel there are things I need to clarify. In no particular order, here goes…

I think that a schema of some kind (speaking in the general sense and thus including anything from DTDs to XSDs) is always necessary for XML to work well. I know, that sets me apart from quite a few of the young whippersnappers in XML today, but I never consider the well-formedness advantage an advantage when it was marketed as such. But then, I’m a dochead, to borrow Ken Holman’s terminology.
Namespaces frequently make life difficult, especially for those of us who feel that DTDs are superior to XSDs. Those pesky namespace attributes often keep popping up when processing XML, resulting in bug reports from desperate technical writers. And they all know my mobile number, it seems. However, namespaces really are a must in these days, regardless of your refusal to import foreign namespaces to your XML, because most of XML’s really useful sister recommendations depend on them.
I do consider data typing in XML to be largely unnecessary, if we stick to using XML for documentation and publishing. Only rarely have I felt the need to include data typing in a schema, and in most of those instances I have been proven wrong by more sensible colleagues (or come to my senses on my own, resulting in the quick removal of unnecessary data types in my XSD).
It is a pain to implement XLink in an XSD, but largely because of reasons that have nothing to do with XLink as such and everything to do with problems with namespaces (such as the problem hinted at above). Plus, of course, the fact that different XML editors still seem to implement different parts of the XML Schema recommendation in differing ways (or not at all).
DTDs, on the other hand, work like a charm with XLink attributes added, provided that your tools follow the XML spec. I have experienced problems with MSXML and its derivatives, which proves my point.

Thank you for reading.

Was XLink A Mistake?

Leave a reply

This morning, I read Robin Berjon’s little something on XML Bad Practices, originally a whitepaper he presented at XML Prague 2009. I was there, presenting right after he did, and I remember that I nervously listened to his presentation while preparing my own (not my finest hour but that’s a story for another blog entry), wanting to address some of his points. While a lot of what he said made good sense, some didn’t then and certainly don’t now.

In Reusing the Useless, Robin discusses XLink, a recommendation that remains my personal favourite among W3C’s plethora of recommendations. Apparently it’s no-one else’s, at least if Robin is to be believed. “Core XML specification produced by the W3C such as XSLT or XML Schema don’t use it even though they have linking elements,” he says, adding that very few have implemented anything but the rudimentary parts of it. But I get ahead of myself; let’s see what Robin says. He starts out with this:

That feeling (and a general sense that reuse is good) leads people to want to reuse as many parts of the XML stack as possible when creating a new language. That is a good feeling, and certainly one that should be listened to carefully â€” there are indeed many good and useful technologies to reuse.

This, of course, makes a lot of sense. We are in the standardisation business so we don’t want to reinvent the wheel every time. Me, I’ve done so time and again, and the one W3C recommendation I have used again and again is… XLink. It provides me with a neat way of defining link semantics without enforcing a processing model, from very simple point-to-point relations to multi-ended link abstractions. Yes, I have used both; Simple XLinks are present in most of my DTDs requiring cross-referencing, images or indeed any point-to-point semantics, and Extended XLinks were a useful and necessary addition to the aftermarket document structures of a major car manufacturer, among other things.

But again, I get ahead of myself. Here’s what raised my eyebrows for the first time, this morning:

But that only works if everyone plays, and furthermore the cost of using XLink has to be taken into account. First, a whole new namespace is needed.

This is interesting, to say the least. I thought this was one of the main points of introducing namespaces in the first place, to avoid name collisions.

The basic idea behind namespaces is extremely simple: you use one DTD (well, maybe it’s a schema since DTDs aren’t namespace-aware; there’s a lot I would like to say on that topic, too, so either this is going to be a very long post or I need to start writing down my ideas for blog posts) but in your instances you need to include content created using other schemas. One solution is to only use unique names, but this is a pipe dream and in reality, there’s only so many names you can give, say, a paragraph (p, ptxt, para, …) or a cross-reference (ref, href, link, …), without resorting to silliness. Inevitably, your elements and attributes will have the same names as someone else’s, and that can be a huge pain. Namespaces are a neat way of getting around this problem, and as an added bonus you’ll eventually always get that question, “what does the namespace URL stand for?” from your audience when presenting your work.

My point, and the simple question I would like to ask here, is why is it suddenly a bad thing to introduce a namespace for XLink when practically every recommendation, suggestion, and badly written XML configuration file seems to use one these days? Yes, they all come at a cost, among them that if you actually want to validate that included content from that other namespace, you need to implement something doing the work, somehow. You need to validate it against the right schema and so you need all kinds of lookup mechanisms and stuff. But if you can implement one namespace, shouldn’t you be able to implement several, especially if your imported namespace provides you with a useful mechanism, say, a standardised linking mechanism?

Namespaces aren’t my favourite W3C recommendation but it is what we have. In his blog and whitepaper, Robin points out several bad practices when implementing namespaces and I fully agree with them (perhaps excepting some of the discussion on a “default” namespace for attributes without a prefix), but they are mostly outside the topic at hand because I fail to see why they’d make XLink an undesired recommendation while still encouraging various others.

Robin continues:

Second, the distinction between href and src requires a second attribute.

To be perfectly honest, I’m not sure what this means. First of all, what, exactly, is, the distinction between href and src? According to the XLink recommendation, href “supplies the data that allows an XLink application to find a remote resource,” adding that when used, it must be a URI. In simple XLinks, href‘s are all you need; the source and a reference to it are (or rather, can be) the same thing. (Yes, there is some verbosity since you’ll need that namespace declaration and the XLink type, that sort of thing, but if you use XML Schema, you’ll be far more verbose than this anyway.)

When discussing extended XLinks, though, yes, there is a difference between a “source” and a “reference” to that source (provided I understand the objection correctly). It’s one of the really neat things with extended XLink because it allows us to leave out the linking information from the document instances. We can create complicated, multi-ended, linking structures between resources without the resources ever being aware of them being part of a link. The links can instead be described out-of-line, outside the resources, centrally in a linkbase.

To do this well, there needs to be a clear distinction between pointing out link ends and creating link arcs between them. Certainly, it requires more than one attribute, and in the XLink recommendation, it could easily require three (the pointer to the source, the source’s label, and the actual link arc).

Is this the only way to do multi-ended links? No, certainly not, but it does provide us with a standardised way, one that a group of people put considerable thought into. It is possible to redo the work and maybe even do it better, but unless you have a lot of time on your hands, why should you? It’s a perfectly serviceable recommendation, with far fewer side effects than, say, namespaces on older XML specs, and it does most of the things you’ll ever need with links.

(Granted, XLink, just as any post-namespaces spec, will cause havoc for any system that includes badly implemented XML parsers wanting to interpret everything before and including the colon in an element or attribute name as throwaway strings, but that’s not an XLink problem; it’s a namespaces problem and above all an implementation problem. XML allows colons in QNames; don’t use a parser that tries to redefine what was meant, once upon a time.)

Not everyone agreed with the XLink principles and so left them out in specs that followed, but I have a feeling that what happened was at least partly political (the linking in XHTML comes to mind, with the, um, discussions that ensued), plus that the timing could have been better. At the time, implementing XLink could be something of a pain.

An aside: around the time the XLink recommendation came out, I was heavily involved in implementing large-scale extended XLinks in a CMS for a well-known car manufacturer. Extended XLink solved many of our key problems; being able to define multiple relationships between multiple resources in multiple contexts using a central linkbase made, for the first time, actual single-source publishing possible for the company, and they had been using SGML for years.

The system almost wasn’t, however, for a very simple reason. The XML editor of choice (not my choice, by the way; I was presented with it as a fact of life) and its accompanying publishing solution could not handle the processing of inline link ends or indeed any kind of inline link elements beyond ID/IDREF pairs for page references. The editor and the publishing solution chosen would simply not allow us to access and process them, no matter what we did. This was before XSL-FO was finished or in widespread use, mind, and before most editors (including this one) would offer complete APIs for processing the XML.

I won’t go into details but the solution was ugly and almost voided the use of extended XLinks. No alternative linking solution would have fared any better, however; the problem was that we were slightly ahead of what was then practical to implement and several of the tools available then just didn’t cut it.

Getting back to Robin’s blog entry, he also says:

And then there are issues with parts of XLink being useless for (or detrimental to) one’s needs, which entails specifying that parts of it should be used but not others, or that on such and such element when one XLink attribute isn’t present it defaults to something specific not in the XLink specification, etc.

It’s hard to address the specifics here since there are none. I don’t have a clue of what parts of XLink are useless or detrimental to Robin’s work and can only address his more general complaints.

Most “standards” are like this. There is a basic spec that you need to adapt to, with the bare essentials, and there are additions that you can leave out if you don’t need them. XLink makes it easy to implement a minimal linking mechanism while offering a standardised way to expand that mechanism to suit future needs. It also deliberately leaves out the processing model, allowing, for example, for a far more flexible way to define “include” links than XInclude, a linking mechanism that in my mind is inferior in almost every respect to the relevant parts of the XLink spec.

Central here is that with XLink, I can use one linking mechanism for all my linking needs, from cross-references to images to include links, and still be able to define a single processing model for all of them, one that fits my needs. I suspect it would have been very difficult to define anything sufficiently consistent (yet flexible) in the spec itself, so why force one into it?

To me, this is akin to the early criticism DTDs received for lacking data typing. XML Schema added this capacity, resulting in a huge specification with a data typing part that either remained unused or was used for all the wrong reasons. In a document-centric world, data typing is mostly unnecessary which is a good reason to why it wasn’t included in DTDs. (In the few cases where data typing was useful, it was easy enough to add an attribute for the element(s) in question, containing either a regular expression or some other suitable content definition, and add the necessary processing for the applications as needed. There was no need to write a novel for the data types no-one needed, anyway.)

As you might guess, my point is that not including the processing model in the spec is a strength, not a weakness, because a sufficiently complete, general-purpose, processing model for a complete linking mechanism is most likely too complex to do well. It would only serve to create conflicting needs and make the spec less useful. Why not leave it to implementation?

Which brings me to Robin’s next point:

Core XML specification produced by the W3C such as XSLT or XML Schema don’t use it even though they have linking elements.

I don’t pretend to know why this is; I have an idea of why XHTML didn’t, and in my mind it had very little to do with any technical merits or lack of same, and a lot to do with politics and differing fractions in the W3C. Could it be the same with XML Schema and XSLT? It might; I know that XLink could have addressed the linking needs of both specs. Certainly, XML Schema is “costly” enough to not be bothered by an extra namespace among those already included. Maybe someone close to the working groups would like to share, but what’s the point now?

In Robin’s blog, the above statement leads to:

I don’t believe that anyone implements much in the way of generic link processing.

I’ve implemented a lot in this respect, starting from about the time XML became an official spec. XLink has proved to be very useful, allowing me to benefit from my earlier work while still being flexible enough to encourage some very differing link implementations.

Granted, most of my work has been document-centric, with my clients ranging from companies very small to the armed forces of my native country, but in all of these, XLink has proven to be sufficiently useful and flexible. A friend of mine, Henrik MÃ¥rtensson, now a business management guru, wrote a basic XLink implementation more than a decade ago (yes, long before XLink was a finished spec; we were both involved in implementing XLink in various places back then), with everything that was required to create useful links, be they cross-references, pointers to images, or something else. This implementation is still in use today, and while I and others have changed a lot of stuff surrounding it, the core and the basic model remain unchanged. My presentation at XML Prague 2009, right after Robin’s, touched on some of this work, and had my computer been healthier, he would have witnessed at least one XLink implementation.

Which (sort of) leads to Robin’s last point:

Reuse of other languages should be done where needed, and when the cost does not exceed that of reinvention.

I agree with the basic notion, obviously, but not with his conclusions. XLink, to me, is exactly the kind of semantics that is far easier to reuse than to reinvent. Yes, it is possible to simply write “href CDATA #IMPLIED” (or the schema equivalent) and be done with it, but anything more complex than that will benefit from standardisation, especially if you ever envision having to do it again. XLink is a terrific option when it comes to anything having to do with linking.

Inline Tagging

Leave a reply

Here’s a trivial little piece of inline tagging that is nagging me:

It’s a classic chicken-or-egg problem, really. The tagging is commonplace enough; it’s trivial, crude, even, and represents an emphasised and superscripted number, but should the number be emphasised first and then superscripted, or should it be the other way around, like this:

I know, I really shouldn’t bother, but it is precisely this kind of nested inline tagging that can completely stop me in my tracks. In a wider context, the question is: is the order of nesting important? That is, semantically speaking, is there a difference? Am I saying that an emphasised (in other words, important) number happens to be superscripted, or that a superscripted number happens to be emphasised (important)?

More often than not, this type of inline tagging is about formatting, not semantics, so it probably doesn’t matter. Also, emphasis as an inline tag is dodgy at best because while it says that the highlighted text is important but fails to mention why, and the “why” is what is important if we want semantics, if we need intelligence. It’s the same with superscript and subscript elements, and quite a few other common inline elements that are about how things should be presented rather than structured.

But then, of course, formatting is useful, too, because it can visualise abstract concepts.

Chicken or egg, folks? Me, I don’t know. I only wrote this because I needed a break form designing an export format from a product database, a format where I need to visualise data.

Put XSD 1.1 On Hold

Leave a reply

In his latest blog entry at O’Reilly, Rick Jelliff asks W3C to please put XSD 1.1 on hold and address the deeper underlying issues that make schemas practically useless.

I’d like to go one step further and encourage the schema working group to consider Relax NG, compact syntax, instead, as a more sensible and compact alternative to XSDs. It does everything we need from a schema language, without being impenetrable or impossibly verbose. If W3C actively endorsed Relax NG, maybe we’d get the software manufacturers to support Relax NG on a wider scale. Yes, I know, Oxygen already supports it, but there are plenty of manufacturers out there that need to follow suit.

Please.

Out of Print

Leave a reply

I like O’Reilly’s books. They’re well-written, well-researched, and a lot of fun to read. They are also very cool, because O’Reilly, probably better than anyone else in the IT publishing business, know how to market their books (think The Camel Book if you don’t believe me, and resign yourself to “I need to find another blog to read” if you don’t get this particular geek reference). I would love to write for them some day.

In the meantime, I frequently surf over to their site, reading the blogs, browsing the catalogue, and planning my next buy. And sometimes I just read stuff here and there. Today, I browsed the list of out-of-print books and found this:
I LOL’d, as they say. Yes, a book from January 1900 is probably out of print by now, but I had no idea that XML was that old.

sgmlguru.org

on markup, film projection and more!

Category Archives: XML

List Modelling

Developing SGML DTDs: From Text To Model To Markup

elementNames and attributeNames

Words in Boxes

An XProc Tutorial

XLink FTW, Part 2

Was XLink A Mistake?

Inline Tagging

Put XSD 1.1 On Hold

Out of Print