[sc34wg3] Canonical XTM: implementation report
Lars Marius Garshol
21 Jan 2004 11:44:04 +0100
I've now written an implementation of the entire 2003-12-30 CXTM draft
excluding only the representation of the [reified] property, and am
happy to say that it took only a few hours. The biggest obstacle was
handling type-instance associations, but even that wasn't really hard,
though some minor trickery was required.
The resulting code is 583 lines and takes just a fraction of a second
to canonicalize all of opera.xtm, so everything seems fine. The whole
thing was really quite straightforward, though it is of course still
possible that I've gotten something wrong somewhere.
My conclusion is that this is what we want. It really is implementable
and it seems to me to work well and give a clear picture of model
instances (at least once pretty-printed :). If we could get a draft
that fixes all the known issues I think we should be ready to start
creating test cases to verify that this really does work as it should.
Using Canonical XML as the basis for this specification appears to be
fine, and although that document is hard to read that may not be much
of a problem once we have some examples to work against. (Actually,
the examples in the Canonical XML document should suffice.)
If anyone wants actual canonicalized documents with corresponding
input I'll be happy to provide examples.
--- General comments
- The RNC schema was very helpful in verifying that I'd gotten things
right. I strongly recommend that we include it in an annex so that
it becomes an official part of the standard.
- All the empty XML Infoset properties being specified throughout the
document makes the useful stuff drown in the dross and really makes
the document hard to read. I think it would be much easier to
review and implement this standard if we cut that out, since then
the substance would be visible rather than hidden.
- The resulting documents are all on one line. For opera.xtm this
means 1.1MB of text on a single line, and it really gets awkward to
read. I think we should add some whitespace to the canonical
representation to make it more readable.
The original "Canonical XTM" technical report had this in it:
"The output document must be a canonical XML document. In addition,
a line feed (U+00A0) must be inserted after every end tag and
likewise after every start tag of elements that have element
content or are empty. (This means <baseNameString>, <resourceData>,
<topicRef>, <instanceOf>, <resourceRef>, <subjectIndicatorRef>.)"
Maybe we should do something like it?
- My earlier comments about relativization of locators still apply,
- Is this a committee draft? Will it appear in the SC34 document
registry? Has it already? (Couldn't find it.)
--- Comments on specific section
- 3.3: TMDM already requires strings to be in NFC, so there's no need
to repeat it here. (It could go in as a note if someone feels it's
a useful clarification.)
- 3.7: There's no need to compare on [variants] here. If the first 4
properties are equal the topic name items will have merged anyway.
- 3.11: Association roles don't have scope, and the [parent] property
provides what is necessary for comparisons outside the context of
their parent association.
- 4.4: There is no [subject address] property any more, use [subject
locator]. (Hey, you were the one who pointed out that this should
be changed! :)
- 4.5: [scope] is never null, so this should say "the empty set"
- 4.7: The RNC schema contradicts the order given here. The schema
has scope first, then type, while the text has it the other way
- 4.10: Here we need to give more guidance on how to serialize
locators. I think we should stress that they should be
externalized, meaning that in URIs difficult characters should be
escaped etc. Referencing some relevant W3C document specifying this
would be good. (RFC 2396 is less clear than it could be on
precisely which characters *must* be escaped.)
- 4.10: Locators don't have [address]; it's [reference].
- 4.12: This one is a lot of work to implement. You have to remember
the position of every object in the TM in case it could have been
reified. (Or test for it and remember it if it was reified, which
is even more work.)
Given that this property is in any case redundant (there will be a
pointer from [reifier] anyway) I think we should cut this. The only
consideration is what to do in query results, where the reified
might not be included. I think that the rules as given don't really
cover that case, and that we shouldn't try to.
(Unfortunately, the same problem applies to [reifier]. I'm tempted
to say that we should not try to handle it before we see that there
is a need for it for TMQL.)
I recommend that we leave this out. After all, we don't do
- 4.13: Same two points as for 4.10.
- 4.13: This is the one place where the canonicalization process
didn't feel entirely clean. We might want to make <resource> wrap
<locator> or even lose <resource> entirely and just have <locator>.
- 4.15: Should make it clear that the element is left out if [type]
is null. (I thought it was supposed to be empty before I saw that
<type> was consistently optional in the schema.)
Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net >
GSM: +47 98 21 55 50 <URL: http://www.garshol.priv.no >