Category Archives: semantic documents

Semantic Profiles

Following my earlier post on semantic documents, I’ve given the subject some thought. In fact, I wrote a paper on a related subject and submitted it to XML Prague for next year’s conference. The paper wasn’t accepted (in all fairness, the paper was off-topic for the themes for the event), but I think the concept is both important and useful.

Briefly, the paper is about profiling XML content. The basics are well known and very frequently used: you profile a node by placing a condition on it. That condition, expressed using an attribute, is then compared to a publishing context defined using a similar condition on the root. If met, the node is included; if not, the node is discarded.

The matching is done with a simple string comparison but the mechanism can be made a lot more advance by, say, imposing Boolean logic on the condition. You need to match something like A AND B AND NOT(C), or the node is discarded. Etc.

The problem is that in the real world, the conditions, the string values, usually represent actual product names or variants, or perhaps an intended reader category. They can be used not only for string matching but for including content inline by using the condition attribute contents as variable text: a product variant, expressed as a string in an attribute in an EMPTY element, can easily be expanded in the resulting publication to provide specific content to personalise the document.

Which is fine and well, until the product variant label or the product itself is changed and the documents need to be updated to reflect this. All kinds of annoyances result, from having to convert values in legacy documents to not being able to do so (because the change is not compatible with the existing documents). Think about it:

If you have a condition “A” and a number of legacy documents using that condition, and need to update the name of the product variant to “B”, you need to update those existing documents accordingly, changing “A” to “B” everywhere. Problem is, someone owning the old product variant “A” now needs to accept documentation for a renamed product “B”. It’s done all the time but still causes confusion.

Or worse, if the change to “B” affects functionality and not just the name itself, you’ll have to add “B” to the list of conditions instead of renaming “A”, which in turn means that even if most of the existing documentation could be reused for both “A” and “B”, it can’t because there is no way to know. You’ll have to add “B” whenever you need to include a node, old or new.

This, in my considered opinion, happens because of the following:

  • The name, the condition, is used directly, both as a condition and as a value.
  • Conditions are not version handled. If “B” is a new version of “A”, then say so.

My solution? Use an abstraction layer. Define a semantic profile, a basic meaning for the condition, and version handle that profile, updating it when there is a change to the condition. The change could be a simple name change for the corresponding product but it could just as well be a change to the product’s functionality. Doesn’t really matter. A significant change will always requires a new version. Then, represent that semantic profile with a value used when publishing.

Since I like URNs, I think URNs are a terrific way to go. It’s easy to define a suitable URN schema that includes versioning and use the URN string as the condition when filtering, but the URN’s corresponding value as expanded content. In the paper, I suggest some simple ways to do this, including an out-of-line profiling mechanism that is pretty much what the XLink spec included years ago.

Using abstraction layers in profiling is hardly a new approach, then, but it’s not being used, not to my knowledge, and I think it should. I fully intend to.

Semantic Documents

I’m back from XML Finland, where I held a presentation on how to use the concept of semantic documents in content management systems. Not everyone was convinced, but I wasn’t thrown out, either.

A semantic document is the core information carrier, before a language or other means of presentation to an audience, is added. It’s an abstraction; obviously, there can be no such thing in the real world but as a concept, the semantic document is useful.

For example, a translation of a document can using the concept be defined as a rendition of the original, just as a JPG image can be rendered in, say, PNG without the contents of the image changing. It is very strictly a matter of definition–the rendition is not necessarily identical in all details of content to the original, it’s simply defined to be a matching rendition for a target audience.

Of course, for a semantic document and its rendition in a given language to be meaningful in a CMS, none of those varying details can be significant to the semantics of the basic information carrier, only to make a necessary clarification of the core information to the target audience. In other words, a translation may differ from the original for, say, cultural reasons (if the original language’s details in question are bound to the original language and readership), but the basic meaning cannot be allowed to change.

To the concept I also added version handling, that is, a formal description of the evolution of the contents of the basic information over time. When a new version is required is, of course, also a matter of definition; I’d go with “a significant and (in some way) completed change”. What’s important is that a two matching or equivalent renditions of the semantic document must always use matching versions.

Expressed using a pseudo-URN schema, if the core semantic document in some well-defined version (say “1”) is defined as URN:1, the Swedish and Finnish versions would be defined as URN:1:sv and URN:1:fi, respectively. They would be defined to be different renditions of each other but identical in basic information. It follows that if a URN:2:sv was made, a new Finnish translation would have to be created, because the old translation would differ in some way, according to the definition

This, of course, is largely a philosophical question. In practice, all kinds of questions arise. I had several objections from the floor, of which most seemed to have to do with the evolution of the translation independently from the original. In my basic definition, of course, this is not a problem since the whole schema is a matter of definition, but in the real world, an independent evolution of a translation is often a very real problem.

It could well be that a translation is worked on rather than the original, for example, in a multi-national environment where different teams manage different parts of the content. While theoretically perfectly manageable simply by bumping the versions of that particular translation, a system keeping track of, say, 40+ active target languages becomes a practical problem.

I don’t think the problem is unsolvable if there is a system in place to keep track of all those different URNs, but only if the basic principles are strictly adhered to. For example, you can never be allowed to develop the content in different languages independently from each other at the same time, because the situation that would arise would have to deal with what in the software development world is known as “forking”, that is, developing differing content from the same basic version. While also solvable, the benefits of such an approach in documentation are doubtful.

Far easier and probably better is to define a “master language” as the only language allowed to drive content change. In the above pseudo-URNs, Swedish could be defined as a master language, meaning that any new content would have to be added to it first and then translated to the other languages.

This is the basic principle behind the CMS, Cassis, that we develop at Condesign. It works, in that the information remains consistent and traceable, regardless of language, and allows for freely modularising documents for maximum reuse.

I would be interested in hearing opposing views. Some I addressed during my talk in Finland, but I’m sure there is more. Is there a reason you can think of that would break the principle of the semantic document?