[sc34wg3] Whitespace agnosticism in CTM and descendants

Xuân Baldauf xuan--2007.04--sc34wg3--isotopicmaps.org at baldauf.org
Mon Apr 2 12:18:33 EDT 2007


Hello,

I'd like to raise the issue of whitespace agnosticism. A friend of mine
wrote an LTM parser and stumbled over the problem, that basically the
colon ':' is overloaded, being used both for namespaces and for roles.
This problem is resolved in LTM in making a difference between : on the
one hand and :  respectively  : on the other hand. The differences
between the these usages of colons are easy to miss or hard to see? Yet,
that's the problem. I fear that CTM and related upcoming standards
suffer the same.

An LTM example:
   a(b:c : d)      has a different meaning to
   a(b : c:d)

Whitespaces are meaningful here, but in other occassions, they are not
meaningful.

An LTM example:
   a(b:c : d) / e   has the same meaning as
   a(b:c : d)/e

As it is easy to make a "whitespace spelling" error (and also as it is
not straightforward anymore generate parsers for languages with mixed
whitespace meaningfulness using common parser generators), I like to
recommend that whitespaces are not meaningful in CTM, except in string
literals. "not meaningful" means that removing none, some or all of the
(non-string-literal-)whitespaces of an CTM fragment or document does not
change its meaning.

Unfortunately, by reading the current CTM specification draft (
http://www.jtc1sc34.org/repository/0820.htm#creating-qnames ), [3]
"topic-ref" may resolve to "identifier" as well to [5]
"subject-identifier", which in turn may resolve to [0] "qname", which in
turn resolves to "prefix : local", which in turn both resolve to
"identifier". Thus, effectively,

  topic-ref -> identifier | identifier : identifier | other-topic-refs

However, [16] "role" may resolve effectively to "topic-ref : topic-ref".
Thus, effectively,

  role -> identifier : identifier | identifier : identifier : identifier
| identifier : identifier : identifier : identifier | other-roles

At this stage, at least the second alternative ("identifier : identifier
: identifier") is not parseable anymore into higher level expressions,
it is ambiguous, as both "qname : identifier" and "identifier : qname"
interpretations are possible.

So what can be done about it?

   1. De-overload the colon ":".
         1. Use another symbol, like "=", "->", ":=", ",", whatever, for
            separation of "type" expressions and "player" expressions in
            "role" expressions.
         2. Use another symbol, like '#', '+', '&', '%', whatever, for
            separation of "prefix" expressions and "local" expressions
            in "qname" expressions.
   2. Forbid the second alternative, always expecting either "identifier
      : identifier" or "identifier : identifier : identifier :
      identifier", which is not ambigous. (Yet, this would bloat the
      grammar, as there would be "long-role" and "short-role" and
      "long-topic-ref" and "short-topic-ref".)
   3. Forbid the second and the first alternative, always expecting
      "identifier : identifier : identifier : identifier". Simple to
      parse, but clumsy to write.
   4. Forbid the second and the first alternative, always expecting
      "identifier : identifier". Simple to parse, but inflexible to write.
   5. Overload the colon and make the life of CTM authors and CTM parser
      authors unnecessarily harder, more troublesome and error-prone,
      forever.

Also interesting to note: the only occurence of the word "whitespace" in
the current draft is in the sentence "Comments are allowed where
whitespace characters are allowed.". Yet, what whitespace characters are
and where whitespace characters are allowed is not defined or referenced
in the current document. Thus, as Lars Heuer already comments in Section
3.19, "The EBNF is not valid". As TMCL and TMQL are supposed to base on
CTM syntax, they are currently unparseable as well (at least in theory,
if we add uncodified writing conventions, they are awkward-parseable in
practice at best).

If we do not fix this bug soon, everyone will get used to the bug (maybe
some of you already got used...), and the bug will become permanent,
plagueing every CTM, TMCL, TMQL author for ever. So, please, let's fix
this issue, get rid of hoary relics, and finally de-overload the colon
':' by replacing it at one of its usages.

Does anybody recommend an alternative symbol for one usage of the colon?

Xuân.

P.S.: To avoid these types of problems, one should really write a clean
room reference implementation of a CTM parser (maybe even along with a
test suite) before freezing the specs.
P.P.S.: Ironically, one could even use the "empty symbol" (effectively a
whitespace) to separate "type" expressions and "player" expressions in
"role" expressions. This would still be parseable (although removing
this whitespace would render the CTM document invalid).
P.P.P.S.: One could use the comma ',' as a replacement for the colon ':'
in "role" expressions. The comma separating roles (in "roles"
expressions) may then be replaced by the semicolon, letting associations
look like "type(pr:type0,pr:value0; pr:type1,pr:value1;
pr:type2,pr:value2)". This fits nicely into the current English usage of
semicolons as "super commas", quoting Wikipedia: "A common example of
[the use of semicolons] is to separate the items of a list when some of
the items themselves contain commas.".
P.P.P.P.S.: Here are some examples how some candidate solutions may look
like:

   1. family:parentship(family:mother=rf:diana,
      family:father=rf:charles, family:child=rf:william)
   2. family:parentship(family:mother,rf:diana;
      family:father,rf:charles; family:child,rf:william)
   3. family:parentship(family:mother->rf:diana,
      family:father->rf:charles, family:child->rf:william)
   4. family:parentship(family:mother:=rf:diana,
      family:father:=rf:charles, family:child:=rf:william)
   5. family#parentship(family#mother=rf#diana,
      family#father=rf#charles, family#child=rf#william)
   6. family#parentship(family#mother:rf#diana,
      family#father:rf#charles, family#child:rf#william)
   7. family+parentship(family+mother:rf+diana,
      family+father:rf+charles, family+child:rf+william)
   8. family:parentship(family:mother:rf:diana,
      family:father:rf:charles, family:child:rf:william)



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.petesbox.net/pipermail/sc34wg3/attachments/20070403/a598368e/attachment.htm


More information about the sc34wg3 mailing list