Category Archives: semantics

Semantic Profiles

Following my earlier post on semantic documents, I’ve given the subject some thought. In fact, I wrote a paper on a related subject and submitted it to XML Prague for next year’s conference. The paper wasn’t accepted (in all fairness, the paper was off-topic for the themes for the event), but I think the concept is both important and useful.

Briefly, the paper is about profiling XML content. The basics are well known and very frequently used: you profile a node by placing a condition on it. That condition, expressed using an attribute, is then compared to a publishing context defined using a similar condition on the root. If met, the node is included; if not, the node is discarded.

The matching is done with a simple string comparison but the mechanism can be made a lot more advance by, say, imposing Boolean logic on the condition. You need to match something like A AND B AND NOT(C), or the node is discarded. Etc.

The problem is that in the real world, the conditions, the string values, usually represent actual product names or variants, or perhaps an intended reader category. They can be used not only for string matching but for including content inline by using the condition attribute contents as variable text: a product variant, expressed as a string in an attribute in an EMPTY element, can easily be expanded in the resulting publication to provide specific content to personalise the document.

Which is fine and well, until the product variant label or the product itself is changed and the documents need to be updated to reflect this. All kinds of annoyances result, from having to convert values in legacy documents to not being able to do so (because the change is not compatible with the existing documents). Think about it:

If you have a condition “A” and a number of legacy documents using that condition, and need to update the name of the product variant to “B”, you need to update those existing documents accordingly, changing “A” to “B” everywhere. Problem is, someone owning the old product variant “A” now needs to accept documentation for a renamed product “B”. It’s done all the time but still causes confusion.

Or worse, if the change to “B” affects functionality and not just the name itself, you’ll have to add “B” to the list of conditions instead of renaming “A”, which in turn means that even if most of the existing documentation could be reused for both “A” and “B”, it can’t because there is no way to know. You’ll have to add “B” whenever you need to include a node, old or new.

This, in my considered opinion, happens because of the following:

  • The name, the condition, is used directly, both as a condition and as a value.
  • Conditions are not version handled. If “B” is a new version of “A”, then say so.

My solution? Use an abstraction layer. Define a semantic profile, a basic meaning for the condition, and version handle that profile, updating it when there is a change to the condition. The change could be a simple name change for the corresponding product but it could just as well be a change to the product’s functionality. Doesn’t really matter. A significant change will always requires a new version. Then, represent that semantic profile with a value used when publishing.

Since I like URNs, I think URNs are a terrific way to go. It’s easy to define a suitable URN schema that includes versioning and use the URN string as the condition when filtering, but the URN’s corresponding value as expanded content. In the paper, I suggest some simple ways to do this, including an out-of-line profiling mechanism that is pretty much what the XLink spec included years ago.

Using abstraction layers in profiling is hardly a new approach, then, but it’s not being used, not to my knowledge, and I think it should. I fully intend to.

Semantic Documents

I’m back from XML Finland, where I held a presentation on how to use the concept of semantic documents in content management systems. Not everyone was convinced, but I wasn’t thrown out, either.

A semantic document is the core information carrier, before a language or other means of presentation to an audience, is added. It’s an abstraction; obviously, there can be no such thing in the real world but as a concept, the semantic document is useful.

For example, a translation of a document can using the concept be defined as a rendition of the original, just as a JPG image can be rendered in, say, PNG without the contents of the image changing. It is very strictly a matter of definition–the rendition is not necessarily identical in all details of content to the original, it’s simply defined to be a matching rendition for a target audience.

Of course, for a semantic document and its rendition in a given language to be meaningful in a CMS, none of those varying details can be significant to the semantics of the basic information carrier, only to make a necessary clarification of the core information to the target audience. In other words, a translation may differ from the original for, say, cultural reasons (if the original language’s details in question are bound to the original language and readership), but the basic meaning cannot be allowed to change.

To the concept I also added version handling, that is, a formal description of the evolution of the contents of the basic information over time. When a new version is required is, of course, also a matter of definition; I’d go with “a significant and (in some way) completed change”. What’s important is that a two matching or equivalent renditions of the semantic document must always use matching versions.

Expressed using a pseudo-URN schema, if the core semantic document in some well-defined version (say “1”) is defined as URN:1, the Swedish and Finnish versions would be defined as URN:1:sv and URN:1:fi, respectively. They would be defined to be different renditions of each other but identical in basic information. It follows that if a URN:2:sv was made, a new Finnish translation would have to be created, because the old translation would differ in some way, according to the definition

This, of course, is largely a philosophical question. In practice, all kinds of questions arise. I had several objections from the floor, of which most seemed to have to do with the evolution of the translation independently from the original. In my basic definition, of course, this is not a problem since the whole schema is a matter of definition, but in the real world, an independent evolution of a translation is often a very real problem.

It could well be that a translation is worked on rather than the original, for example, in a multi-national environment where different teams manage different parts of the content. While theoretically perfectly manageable simply by bumping the versions of that particular translation, a system keeping track of, say, 40+ active target languages becomes a practical problem.

I don’t think the problem is unsolvable if there is a system in place to keep track of all those different URNs, but only if the basic principles are strictly adhered to. For example, you can never be allowed to develop the content in different languages independently from each other at the same time, because the situation that would arise would have to deal with what in the software development world is known as “forking”, that is, developing differing content from the same basic version. While also solvable, the benefits of such an approach in documentation are doubtful.

Far easier and probably better is to define a “master language” as the only language allowed to drive content change. In the above pseudo-URNs, Swedish could be defined as a master language, meaning that any new content would have to be added to it first and then translated to the other languages.

This is the basic principle behind the CMS, Cassis, that we develop at Condesign. It works, in that the information remains consistent and traceable, regardless of language, and allows for freely modularising documents for maximum reuse.

I would be interested in hearing opposing views. Some I addressed during my talk in Finland, but I’m sure there is more. Is there a reason you can think of that would break the principle of the semantic document?

elementNames and attributeNames

I keep getting annoyed by the (Java-inspired) naming of elements and attributes in some people’s XML, where the names contain capital letters to help keep the names clear. I’m sure you’ve seen how it works: elementName, attributeName, myNewAndExcitingElement, ohLookICanCreateReallyLongQNamesForNoApparentReason, ad nauseam.

Why do they do this? I know there is some kind of rationalisation for it in the world of programming languages, but in XML? XML is not a programming language and I still think it should be understandable and usable by humans (I know; SGML was supposed to be human-readable but XML doesn’t have that requirement). If you find yourself writing XML in a text editor (still happens to me), not only are these names enough to drive anyone nuts but they also make the XML more error-prone because you’re bound to spl something wrong. And if you write your XML in an XML editor, the element names filling the start and end tag symbols take up a lot of space that should be left to the actual content. (And no, I don’t believe in the minimal tag symbols that some editors provide; I want to actually see the tag names and I want to see the attribute names. They help me structure my document; in fact, they are there for that purpose!)
I ask again: why? If you are writing a schema and need to name an ordinary paragraph element, surely you don’t need to name it ordinaryParagraph or even paragraph? In my schemas, p is more than enough.
SoPleaseUseShorterNamesWithoutResortingToSillyConventionsBorrowedFromElsewhere.

Was XLink A Mistake?

This morning, I read Robin Berjon’s little something on XML Bad Practices, originally a whitepaper he presented at XML Prague 2009. I was there, presenting right after he did, and I remember that I nervously listened to his presentation while preparing my own (not my finest hour but that’s a story for another blog entry), wanting to address some of his points. While a lot of what he said made good sense, some didn’t then and certainly don’t now.

In Reusing the Useless, Robin discusses XLink, a recommendation that remains my personal favourite among W3C’s plethora of recommendations. Apparently it’s no-one else’s, at least if Robin is to be believed. “Core XML specification produced by the W3C such as XSLT or XML Schema don’t use it even though they have linking elements,” he says, adding that very few have implemented anything but the rudimentary parts of it. But I get ahead of myself; let’s see what Robin says. He starts out with this:

That feeling (and a general sense that reuse is good) leads people to want to reuse as many parts of the XML stack as possible when creating a new language. That is a good feeling, and certainly one that should be listened to carefully — there are indeed many good and useful technologies to reuse.

This, of course, makes a lot of sense. We are in the standardisation business so we don’t want to reinvent the wheel every time. Me, I’ve done so time and again, and the one W3C recommendation I have used again and again is… XLink. It provides me with a neat way of defining link semantics without enforcing a processing model, from very simple point-to-point relations to multi-ended link abstractions. Yes, I have used both; Simple XLinks are present in most of my DTDs requiring cross-referencing, images or indeed any point-to-point semantics, and Extended XLinks were a useful and necessary addition to the aftermarket document structures of a major car manufacturer, among other things.

But again, I get ahead of myself. Here’s what raised my eyebrows for the first time, this morning:

But that only works if everyone plays, and furthermore the cost of using XLink has to be taken into account. First, a whole new namespace is needed.

This is interesting, to say the least. I thought this was one of the main points of introducing namespaces in the first place, to avoid name collisions.

The basic idea behind namespaces is extremely simple: you use one DTD (well, maybe it’s a schema since DTDs aren’t namespace-aware; there’s a lot I would like to say on that topic, too, so either this is going to be a very long post or I need to start writing down my ideas for blog posts) but in your instances you need to include content created using other schemas. One solution is to only use unique names, but this is a pipe dream and in reality, there’s only so many names you can give, say, a paragraph (p, ptxt, para, …) or a cross-reference (ref, href, link, …), without resorting to silliness. Inevitably, your elements and attributes will have the same names as someone else’s, and that can be a huge pain. Namespaces are a neat way of getting around this problem, and as an added bonus you’ll eventually always get that question, “what does the namespace URL stand for?” from your audience when presenting your work.

My point, and the simple question I would like to ask here, is why is it suddenly a bad thing to introduce a namespace for XLink when practically every recommendation, suggestion, and badly written XML configuration file seems to use one these days? Yes, they all come at a cost, among them that if you actually want to validate that included content from that other namespace, you need to implement something doing the work, somehow. You need to validate it against the right schema and so you need all kinds of lookup mechanisms and stuff. But if you can implement one namespace, shouldn’t you be able to implement several, especially if your imported namespace provides you with a useful mechanism, say, a standardised linking mechanism?

Namespaces aren’t my favourite W3C recommendation but it is what we have. In his blog and whitepaper, Robin points out several bad practices when implementing namespaces and I fully agree with them (perhaps excepting some of the discussion on a “default” namespace for attributes without a prefix), but they are mostly outside the topic at hand because I fail to see why they’d make XLink an undesired recommendation while still encouraging various others.

Robin continues:

Second, the distinction between href and src requires a second attribute.

To be perfectly honest, I’m not sure what this means. First of all, what, exactly, is, the distinction between href and src? According to the XLink recommendation, href “supplies the data that allows an XLink application to find a remote resource,” adding that when used, it must be a URI. In simple XLinks, href‘s are all you need; the source and a reference to it are (or rather, can be) the same thing. (Yes, there is some verbosity since you’ll need that namespace declaration and the XLink type, that sort of thing, but if you use XML Schema, you’ll be far more verbose than this anyway.)

When discussing extended XLinks, though, yes, there is a difference between a “source” and a “reference” to that source (provided I understand the objection correctly). It’s one of the really neat things with extended XLink because it allows us to leave out the linking information from the document instances. We can create complicated, multi-ended, linking structures between resources without the resources ever being aware of them being part of a link. The links can instead be described out-of-line, outside the resources, centrally in a linkbase.

To do this well, there needs to be a clear distinction between pointing out link ends and creating link arcs between them. Certainly, it requires more than one attribute, and in the XLink recommendation, it could easily require three (the pointer to the source, the source’s label, and the actual link arc).

Is this the only way to do multi-ended links? No, certainly not, but it does provide us with a standardised way, one that a group of people put considerable thought into. It is possible to redo the work and maybe even do it better, but unless you have a lot of time on your hands, why should you? It’s a perfectly serviceable recommendation, with far fewer side effects than, say, namespaces on older XML specs, and it does most of the things you’ll ever need with links.

(Granted, XLink, just as any post-namespaces spec, will cause havoc for any system that includes badly implemented XML parsers wanting to interpret everything before and including the colon in an element or attribute name as throwaway strings, but that’s not an XLink problem; it’s a namespaces problem and above all an implementation problem. XML allows colons in QNames; don’t use a parser that tries to redefine what was meant, once upon a time.)

Not everyone agreed with the XLink principles and so left them out in specs that followed, but I have a feeling that what happened was at least partly political (the linking in XHTML comes to mind, with the, um, discussions that ensued), plus that the timing could have been better. At the time, implementing XLink could be something of a pain.

An aside: around the time the XLink recommendation came out, I was heavily involved in implementing large-scale extended XLinks in a CMS for a well-known car manufacturer. Extended XLink solved many of our key problems; being able to define multiple relationships between multiple resources in multiple contexts using a central linkbase made, for the first time, actual single-source publishing possible for the company, and they had been using SGML for years.

The system almost wasn’t, however, for a very simple reason. The XML editor of choice (not my choice, by the way; I was presented with it as a fact of life) and its accompanying publishing solution could not handle the processing of inline link ends or indeed any kind of inline link elements beyond ID/IDREF pairs for page references. The editor and the publishing solution chosen would simply not allow us to access and process them, no matter what we did. This was before XSL-FO was finished or in widespread use, mind, and before most editors (including this one) would offer complete APIs for processing the XML.

I won’t go into details but the solution was ugly and almost voided the use of extended XLinks. No alternative linking solution would have fared any better, however; the problem was that we were slightly ahead of what was then practical to implement and several of the tools available then just didn’t cut it.

Getting back to Robin’s blog entry, he also says:

And then there are issues with parts of XLink being useless for (or detrimental to) one’s needs, which entails specifying that parts of it should be used but not others, or that on such and such element when one XLink attribute isn’t present it defaults to something specific not in the XLink specification, etc.

It’s hard to address the specifics here since there are none. I don’t have a clue of what parts of XLink are useless or detrimental to Robin’s work and can only address his more general complaints.

Most “standards” are like this. There is a basic spec that you need to adapt to, with the bare essentials, and there are additions that you can leave out if you don’t need them. XLink makes it easy to implement a minimal linking mechanism while offering a standardised way to expand that mechanism to suit future needs. It also deliberately leaves out the processing model, allowing, for example, for a far more flexible way to define “include” links than XInclude, a linking mechanism that in my mind is inferior in almost every respect to the relevant parts of the XLink spec.

Central here is that with XLink, I can use one linking mechanism for all my linking needs, from cross-references to images to include links, and still be able to define a single processing model for all of them, one that fits my needs. I suspect it would have been very difficult to define anything sufficiently consistent (yet flexible) in the spec itself, so why force one into it?

To me, this is akin to the early criticism DTDs received for lacking data typing. XML Schema added this capacity, resulting in a huge specification with a data typing part that either remained unused or was used for all the wrong reasons. In a document-centric world, data typing is mostly unnecessary which is a good reason to why it wasn’t included in DTDs. (In the few cases where data typing was useful, it was easy enough to add an attribute for the element(s) in question, containing either a regular expression or some other suitable content definition, and add the necessary processing for the applications as needed. There was no need to write a novel for the data types no-one needed, anyway.)

As you might guess, my point is that not including the processing model in the spec is a strength, not a weakness, because a sufficiently complete, general-purpose, processing model for a complete linking mechanism is most likely too complex to do well. It would only serve to create conflicting needs and make the spec less useful. Why not leave it to implementation?

Which brings me to Robin’s next point:

Core XML specification produced by the W3C such as XSLT or XML Schema don’t use it even though they have linking elements.

I don’t pretend to know why this is; I have an idea of why XHTML didn’t, and in my mind it had very little to do with any technical merits or lack of same, and a lot to do with politics and differing fractions in the W3C. Could it be the same with XML Schema and XSLT? It might; I know that XLink could have addressed the linking needs of both specs. Certainly, XML Schema is “costly” enough to not be bothered by an extra namespace among those already included. Maybe someone close to the working groups would like to share, but what’s the point now?

In Robin’s blog, the above statement leads to:

I don’t believe that anyone implements much in the way of generic link processing.

I’ve implemented a lot in this respect, starting from about the time XML became an official spec. XLink has proved to be very useful, allowing me to benefit from my earlier work while still being flexible enough to encourage some very differing link implementations.

Granted, most of my work has been document-centric, with my clients ranging from companies very small to the armed forces of my native country, but in all of these, XLink has proven to be sufficiently useful and flexible. A friend of mine, Henrik MÃ¥rtensson, now a business management guru, wrote a basic XLink implementation more than a decade ago (yes, long before XLink was a finished spec; we were both involved in implementing XLink in various places back then), with everything that was required to create useful links, be they cross-references, pointers to images, or something else. This implementation is still in use today, and while I and others have changed a lot of stuff surrounding it, the core and the basic model remain unchanged. My presentation at XML Prague 2009, right after Robin’s, touched on some of this work, and had my computer been healthier, he would have witnessed at least one XLink implementation.

Which (sort of) leads to Robin’s last point:

Reuse of other languages should be done where needed, and when the cost does not exceed that of reinvention.

I agree with the basic notion, obviously, but not with his conclusions. XLink, to me, is exactly the kind of semantics that is far easier to reuse than to reinvent. Yes, it is possible to simply write “href CDATA #IMPLIED” (or the schema equivalent) and be done with it, but anything more complex than that will benefit from standardisation, especially if you ever envision having to do it again. XLink is a terrific option when it comes to anything having to do with linking.

Inline Tagging

Here’s a trivial little piece of inline tagging that is nagging me:

<emph><super>2</super></emph>

It’s a classic chicken-or-egg problem, really. The tagging is commonplace enough; it’s trivial, crude, even, and represents an emphasised and superscripted number, but should the number be emphasised first and then superscripted, or should it be the other way around, like this:

<super><emph>2</emph></super>

I know, I really shouldn’t bother, but it is precisely this kind of nested inline tagging that can completely stop me in my tracks. In a wider context, the question is: is the order of nesting important? That is, semantically speaking, is there a difference? Am I saying that an emphasised (in other words, important) number happens to be superscripted, or that a superscripted number happens to be emphasised (important)?

More often than not, this type of inline tagging is about formatting, not semantics, so it probably doesn’t matter. Also, emphasis as an inline tag is dodgy at best because while it says that the highlighted text is important but fails to mention why, and the “why” is what is important if we want semantics, if we need intelligence. It’s the same with superscript and subscript elements, and quite a few other common inline elements that are about how things should be presented rather than structured.

But then, of course, formatting is useful, too, because it can visualise abstract concepts.

Chicken or egg, folks? Me, I don’t know. I only wrote this because I needed a break form designing an export format from a product database, a format where I need to visualise data.