[sc34wg3] Canonical XTM: implementation report

Lars Marius Garshol sc34wg3@isotopicmaps.org
21 Jan 2004 11:44:04 +0100


I've now written an implementation of the entire 2003-12-30 CXTM draft
excluding only the representation of the [reified] property, and am
happy to say that it took only a few hours. The biggest obstacle was
handling type-instance associations, but even that wasn't really hard,
though some minor trickery was required.

The resulting code is 583 lines and takes just a fraction of a second
to canonicalize all of opera.xtm, so everything seems fine. The whole
thing was really quite straightforward, though it is of course still
possible that I've gotten something wrong somewhere.

My conclusion is that this is what we want. It really is implementable
and it seems to me to work well and give a clear picture of model
instances (at least once pretty-printed :). If we could get a draft
that fixes all the known issues I think we should be ready to start
creating test cases to verify that this really does work as it should.

Using Canonical XML as the basis for this specification appears to be
fine, and although that document is hard to read that may not be much
of a problem once we have some examples to work against. (Actually,
the examples in the Canonical XML document should suffice.)


If anyone wants actual canonicalized documents with corresponding
input I'll be happy to provide examples.


--- General comments

 - The RNC schema was very helpful in verifying that I'd gotten things
   right. I strongly recommend that we include it in an annex so that
   it becomes an official part of the standard.

 - All the empty XML Infoset properties being specified throughout the
   document makes the useful stuff drown in the dross and really makes
   the document hard to read. I think it would be much easier to
   review and implement this standard if we cut that out, since then
   the substance would be visible rather than hidden.

 - The resulting documents are all on one line. For opera.xtm this
   means 1.1MB of text on a single line, and it really gets awkward to
   read. I think we should add some whitespace to the canonical
   representation to make it more readable.

   The original "Canonical XTM" technical report had this in it:

   "The output document must be a canonical XML document. In addition,
   a line feed (U+00A0) must be inserted after every end tag and
   likewise after every start tag of elements that have element
   content or are empty. (This means <baseNameString>, <resourceData>,
   <topicRef>, <instanceOf>, <resourceRef>, <subjectIndicatorRef>.)"

   Maybe we should do something like it?

 - My earlier comments about relativization of locators still apply,
   of course.

 - Is this a committee draft? Will it appear in the SC34 document
   registry? Has it already? (Couldn't find it.)


--- Comments on specific section

 - 3.3: TMDM already requires strings to be in NFC, so there's no need
   to repeat it here. (It could go in as a note if someone feels it's
   a useful clarification.)

 - 3.7: There's no need to compare on [variants] here. If the first 4
   properties are equal the topic name items will have merged anyway.

 - 3.11: Association roles don't have scope, and the [parent] property
   provides what is necessary for comparisons outside the context of
   their parent association.

 - 4.4: There is no [subject address] property any more, use [subject
   locator]. (Hey, you were the one who pointed out that this should
   be changed! :)

 - 4.5: [scope] is never null, so this should say "the empty set"
   instead.

 - 4.7: The RNC schema contradicts the order given here. The schema
   has scope first, then type, while the text has it the other way
   around.

 - 4.10: Here we need to give more guidance on how to serialize
   locators. I think we should stress that they should be
   externalized, meaning that in URIs difficult characters should be
   escaped etc. Referencing some relevant W3C document specifying this
   would be good. (RFC 2396 is less clear than it could be on
   precisely which characters *must* be escaped.)

 - 4.10: Locators don't have [address]; it's [reference].

 - 4.12: This one is a lot of work to implement. You have to remember
   the position of every object in the TM in case it could have been
   reified. (Or test for it and remember it if it was reified, which
   is even more work.) 

   Given that this property is in any case redundant (there will be a
   pointer from [reifier] anyway) I think we should cut this. The only
   consideration is what to do in query results, where the reified
   might not be included. I think that the rules as given don't really
   cover that case, and that we shouldn't try to. 

   (Unfortunately, the same problem applies to [reifier]. I'm tempted
   to say that we should not try to handle it before we see that there
   is a need for it for TMQL.)

   I recommend that we leave this out. After all, we don't do
   topic.[roles]...

 - 4.13: Same two points as for 4.10. 

 - 4.13: This is the one place where the canonicalization process
   didn't feel entirely clean. We might want to make <resource> wrap
   <locator> or even lose <resource> entirely and just have <locator>.

 - 4.15: Should make it clear that the element is left out if [type]
   is null. (I thought it was supposed to be empty before I saw that
   <type> was consistently optional in the schema.)


-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
GSM: +47 98 21 55 50                  <URL: http://www.garshol.priv.no >