[sc34wg3] A few comments on the CTM draft (Was: New CTM draft for Leipzig)

Andreas Sewe sewe at rbg.informatik.tu-darmstadt.de
Fri Sep 29 10:31:17 EDT 2006


Steve Pepper wrote:
> The Editors of CTM have published a new draft of the Compact Topic Maps 
> Syntax in preparation for the Leipzig meeting. It is available at 
> http://www.jtc1sc34.org/repository/0789.pdf.

I have been reading trough the new CTM draft and would like to offer
some comments:

First of all I really like the new ASSERTION-BLOCK syntax; the resulting
topic maps are now very readable. :-)

That being said, I have grouped most of the following comments according
to the section of the CTM draft (version 0.5) the comment on. Here and
there comments on more general issues (like character encodings) have
been included as well.


* Section 2:

> The following referenced documents are indispensable for the application of this document. For dated
> references, only the edition cited applies. For undated references, the latest edition of the referenced
> document (including any amendments) applies.

Here it is not clear to me whether "Extensible Markup Language (XML)
1.0" and "Namespaces in XML 1.0" count as dated reference or not; both
references 'cite' a specific edition ("Fourth Edition" and "Second
Edition", respectively) but only include the date as part of their URI:

> [XML]
> Extensible Markup Language (XML) 1.0 (Fourth Edition)
> http://www.w3.org/TR/2004/REC-xml-20060816
                                     ^^^^^^^^
> [XMLN]
> Namespaces in XML 1.0 (Second Edition)
> http://www.w3.org/TR/2006/REC-xml-names-20060816
                                           ^^^^^^^^
Maybe this could be clarified by either using URIs like
<http://www.w3.org/TR/xml> and <http://www.w3.org/TR/xml-names> which
always refer to the latest edition of the respective specification, or,
in case those two references were really meant to be dated, by adding
explicit dates to the references in question:

   Extensible Markup Language (XML) 1.0 (Fourth Edition), 16 August 2006


* Section 3:

> compact IRI
> an IRI expressed as a tuple consisting of a prefix and a local part which are concatenated on
> deserialization

First of all it is not the prefix that is concatenated with the local
part but the corresponding string. w!Opera is not supposed to be
deserialized as "wOpera" but as "http://en.wikipedia.org/wiki/Opera".

That being said, adding a note that the term "Compact IRI" is not a
synonym for "QName" might be worthwhile. (A QName is a tuple is a tuple
is a tuple and never concatenated. A Compact IRI, however, is a shorthand.)


* Also on the issue of Compact IRIs:

Using a '!' to separate the prefix from the local part is rather ugly. I
can see why a colon would be problematic, though.

CSS, however, uses a '|' for precisely the same purpose. (See the "CSS
Module: Namespaces" Draft, <http://www.w3.org/TR/css3-namespace>.) Since
the current version of CTM already looks similar to CSS you might want
to use '|' as well.

Another option which, IMHO, looks better is a simple '.' as separator.
This follows the tradition of HTML's meta tags, which use, e.g., names
like "DC.identifier".


* Sections 4.3.1, "Whitespace", and 4.3.2, "Comments":

Since the EBNF does not include explicit handling of whitespace,
problems arise when it comes to the correct placement of comments. Is,
e.g., the following allowed?

   # Some comment immediately at the start of the file
   %encoding "UTF-16"

This makes the automagic detection of the charset used difficult; but
some magic will be needed to at least be able to read the %encoding
declaration.

Furthermore, since the EBNF is silent on whitespace, the following might
even be legal:

   %encoding
   "UTF-8"

Is this intended?


* Section 4.3.2, "Comments":

I am still not happy with the two comment styles allowed: IMHO you
should either use one-line comments only (marked by '#') or allow both
one- and multi-line comments (marked by '//' and '/* */', respectively).
Everything else works against author expectations.


* Section 4.3.4, "DATATYPED-VALUE":

> CTM supports the datatypes of [XSD-2].

What about support for other sets of datatypes, e.g., the ones definable
with DTLL? Maybe a sentence along the lines of the following could be added:

   Support for other datatypes is implementation-defined.

Furthermore I wonder whether th prefix xsd is automatically declared; so
far no example declares it.


* Section 4.3.4 again:

> IRI
> IRIs are normally delimited by "<" and ">"; however, IRIs
> belonging to the HTTP scheme may be written without delimiters, or as compact IRIs, [...]

Does the latter part of the sentence ("or as compact IRIs") always apply
or only in case of HTTP IRIs -- which frankly does not make a lot of
sense. The sentence should be reworded.


* Example 4:

Is the whitespace between a DATATYPED-VALUE'S STRING and its DATATYPE
optional? The EBNF is silent on this, but I guess it is indeed optional.
If so, there ought to be an example using this form, too:

   "12-22"^^xsd:gMonthDay


* Section 4.5.4, "PREFIX-DECL":

The grammar for PREFIX-DECL uses the *-IRI rules:

> PREFIX-DECL ::= '%prefix' LOCAL-ID (DELIMITED-IRI | HTTP-IRI)

But namespace prefixes are not necessarily complete, legal IRIs; they
might simply be cut-off prefixes. In essence, namespace prefixes are
just strings, not IRIs (and as such not subject to any IRI
normalization, for example). (CSS had a similar design decision to make:
<http://www.w3.org/TR/css3-namespace/#syntax>.)


* Section 4.5.6, "MERGEMAP":

I somehow miss the definition of media types for both XTM and CTM, e.g.,
"application/topicmaps+xml" and "application/topicmaps-compact". That
would allow for the use of content negotiation (if resources are
accessed via HTTP). In such a scenario the NOTATION hint would not be
needed.

So please consider registering media types for XTM and CTM with the IANA!


* Section 4.5.7, "INCLUDE":

Do you really need a FILEREF here, given that a URI scheme (file:)
exists that fits the bill just fine?


* Section 4.6.2.6, "COMPACT-ASSOCIATIONS":

Here the NULL-RULE looks just odd; so far I have always associated (no
pun intended) the '%' with a declaration of some sort. But %null is not
a declaration but a special constant. Maybe you can find a better name
for it. So what about '...' or '_'? (The latter resembles Prolog's
anonymous variable.)

Both suggestions ('...' and '_') would lend themselves well to making
anonymous subjects explicit, too.


* Section 4.6.3, "Scope":

What about using '+' as a list-separator for SCOPE? (All other list-like
constructs in CTM are separated by commas, semicolons, etc. Only SCOPE
is not.)

And '+' seems like an obvious choice for a separator since it is already
used that way in prose:

   # three description occurrences, in the scopes 'wordnet',
   # 'wikipedia'+'en', and 'wikipedia'+'it', respectively


* Section 4.6.5 "ASSOCIATION":

Do you really need yet another syntax for associations? Granted, it is
nice to have a syntax for association without requiring an %assoc
template declaration, but that could be achieved differently, too:

   puccini -> place { born-in: lucca -> place; }

This is still a one-liner and also allows, AFAIK, for unambiguous
parsing since the '->' token tells the parser that this cannot be an
occurrence. Or am I mistaken here? (LL(1) parsers might be in trouble,
tough.)

And for the freestanding associations of Example 23, which exploit an
existing %assoc template, there is really no need for a shorter syntax:

   puccini { pupil-of: ponchielli }

is just as short as

   pupil-of(puccini, ponchielli)


* Section 4.6.6, "ASSOCIATION-SET":

Shouldn't that be a ('-')+ in the EBNF for CHILD-NODE (which has one or
more leading '-'s)?

>   CHILD-NODE ::= ('-'|'*') ( TOPIC-ID | CHILD-NODE )


* Section 4.6.6 again:

That being said I find the usefulness of the '*'-syntax questionable:

> composed [
>   puccini * la-boheme * tosca * butterfly * turandot
> ]

is just a different syntax for

   puccini {
     composed: la-boheme, tosca, butterfly, turandot
   }

It is not even more concise! IMHO, both sections 4.6.5 and 4.6.6 with
their association-centric view (as opposed to subject-centric) are
highly redundant.

The hierarchical syntax of Example 24, while slightly more useful than
the '*'-syntax of Example 25, could be emulated as follows:

   person {
     supertype-subtype:
       musician {
         supertype-subtype: composer, conductor;
       },
       writer {
         supertype-subtype: librettist, playwright;
       };
   }

All that is needed is allowing an ASSERTION-BLOCK wherever a complete
COMPACT-ASSOCIATION allows a ROLEPLAYER. In this case it might be
worthwhile, however, to have an explicit name for the anonymous topic,
since an empty list of SUBJECTS might look confusing. (See the
discussion about %null above.)

At any rate, I find the above tree as readable as the one given in
Example 24. And since it uses curly braces as delimiters it even remains
readable if insignificant whitespace is stripped. Example 24,
however, would look like this:

   supertype-subtype [ person - musician - - composer - - conductor -
   writer - - librettist - - playwright ]

Furthermore, other information like composer's subject name and subject
identifiers might be included directly in the tree as well. Since makes
the above suggestion much more versatile than the highly specialized
hierarchical syntax, which is really good for a single use case only.


* On the issue of character encodings:

What character set does CTM use internally? I guess it is always Unicode
but that is never made explicit.

Now if a CTM document is encoding in a different character set not
encoding all Unicode code points, there is currently no mechanism to
enter such characters. Maybe an escape mechanism could be included.

This would allow for the use of problematic characters in identifiers,
too. If such a character is required as part of an identifier it can be
used in its escaped form -- which never has special meaning as far as
CTM is concerned: \000025null and %null simply refer to different things.

Furthermore escapes are useful when newlines have to be used inside a
string; including the newline verbatim is ugly and ruins any indentation
scheme:

   book {
     dc!title: """A Book With
   Newlines In Its Title"""
   }


I hope the above comments have been helpful to you. Keep up the good work!

With kind regards,

Andreas Sewe



More information about the sc34wg3 mailing list