[sc34wg3] Feedback on the CTM draft

Lars Marius Garshol larsga at garshol.priv.no
Mon Aug 7 14:43:25 EDT 2006


Some new comments before I reply to your replies:

  - Why use the terms "assertion" and "assertion blocks" when TMDM uses
    "statement" for (maybe) the same thing? Could we just call them  
topic
    blocks instead?

  - encoding-directive is missing from the EBNF in Annex A.

* Steve Pepper
>
> [tutorialism]
>
> This is exactly true. But for a first draft we felt most people  
> would need
> something more like a tutorial, and that it would be a waste of  
> time to be
> too much like a specification until the general design has  
> stabilized. It's
> the latter we really need feedback on at this stage.

I think that's fair enough, but it would probably help to have a note  
to this effect in the draft.

> [not true that draft defines mapping to TMDM]
>
> This sentence is borrowed from the XTM spec and signifies our  
> intent: it
> will be true once Annex B has been written. That, in turn, has to  
> wait for
> the grammar to stabilize a little more.

You mean, the mapping to TMDM will be in annex B? Hmmmmm. It seems  
very strange for the real content of the specification to be in an  
annex, if you ask me.

> [which BNF to use]
>
> We would like more input on this. In general, ISO standards are  
> obliged to
> reuse other ISO standards when appropriate ones exist. The question  
> is: are
> there acceptable reasons for NOT using ISO 14977?

The main problem is that it's a highly idiosyncratic EBNF syntax. It  
doesn't follow the normal +*? convention, nor is it the same as the  
equally idiosyncratic (but rather better known) IETF EBNF. It uses

   foo

to mean non-terminal foo once, but

   {foo}

means the same as normal foo*. So using 14977 means everyone will  
have to learn another gratuituously different EBNF syntax in order to  
read the CTM specification. I think ISO 14977 is a misshapen creature  
that deserves a quiet death in obscurity.

I think you are probably right that there is a guideline saying we  
should use ISO standards where we can, but in this case I think we  
should just quietly pretend ignorance of ISO 14977 for as long as we  
can.

> [normative references to XTM]
>
> There aren't, but there might be in the future.

I can't imagine why you would need this, but, well, you are the editors.

> [why have both single- and triple-quoted strings]
>
> The argument was that triple-quoted strings permit strings that  
> include
> unescaped quote marks and that this is familiar to many users through
> Python. The question is whether this advantage is big enough to  
> warrant the
> additional syntax. What do others think?

But triple-quotes are longer than just using the normal escape syntax:

   """it's a "feature", they say"""

instead of

   "it's a \"feature\", they say"

I think this just means extra syntax to no real gain for the user.

> [datatype in occurrence templates]
>
> Yes. The main reason is to enable greater compactness, since the  
> datatype
> will not have to be specified on every individual instance of an  
> occurrence
> type whose values always have a datatype that is not "autodetected".

I think that's good.

> An additional advantage of allowing datatypes in a template MIGHT  
> be to
> enable more datatypes to be autodetected (e.g. "2006" could be  
> recognized as
> an xsd:gYear rather than an xsd:Integer).

Didn't get that. Surely "2006" is a string, and not an integer? Also,  
why would we autodetect a type that's hard-wired in the template? If  
you meant 2006 without the quotes that would make sense, but I think  
it would be simpler to say that all specially typed values must be  
written as strings.

> [why both . and EOL EOL as block terminators]
>
> We have gone back and forth on both of these options. It seems not  
> to be
> possible to get rid of delimiters inside assertion blocks without  
> either
> reducing expressiveness (in the case of comma), or requiring some  
> other
> additional syntax (in the case of semicolon).

Why not just use line breaks for semicolon and leave the comma as it  
is? That way you could use two line breaks for the terminator, and  
ditch the period.

> Regarding the termination of an assertion block, there seemed to be  
> strong
> arguments in favour of both the period (consistency with comma and  
> semicolon
> syntax; conservation of vertical space), and the empty line (likely  
> to be
> used for readability anyway when editing lengthy topic maps). So we  
> ended up
> giving the user the choice.

I guess this is a matter of taste but I would prefer to see just one  
of these. If the linebreak has no significance anywhere else I don't  
think it should have one here.

> One point regarding TMQL (and TMCL): CTM obviously has to be  
> aligned with
> these standards, but the fact that one or the other has made a  
> particular
> design choice in its current draft is not necessarily an argument  
> to do
> things that way. We need to find solutions that fit the  
> requirements of all
> three standards, and that may involve some modifications to the  
> current
> drafts of TMQL and TMCL.

I guess what you are saying that we may decide to change TMQL instead  
of making CTM do what TMQL does just because TMQL does it some  
particular way. I agree.

> [remove clause 6 from standard]
>
> Why? Because you think CTM should support all of the TMDM, or for  
> some other
> reason?

This sort of thing does not belong in a standard (it's not  
normative). If you really want it, I guess it could go in as a non- 
normative annex. The rationale definitely does not belong in a standard.

* Lars Marius Garshol
>
> Comments should not be included in the grammar, since they are removed
> in the lexing stage.

* Steve Pepper
>
> And yet the XML spec, which you suggest using as a model in other  
> respects,
> *does* include comments.

In most formal languages comments are allowed anywhere where  
whitespace is, which makes it a horrible pain to have to specify it  
explicitly in the grammar, because it winds up having to go  
everywhere, and it almost certainly will be forgotten somewhere where  
it was intended to be allowed.

In addition, if you use a parser generator this means you have to  
include the comment production in your code everywhere, which again  
is a real pain, but it's necessary to ensure that comments don't  
occur somewhere where they are not allowed. A much easier solution  
(used in most cases) is to have the lexer recognize and discard  
comments so that when you are matching the token stream against the  
grammar you don't see the comments at all.

None of these two points apply to XML. XML parsers are not  
implemented using parser generators (there were some exceptions  
initially, but they were horribly slow), and XML only allows comments  
in a couple of places in the grammar.

The way your grammar is currently written, the following would for  
example not be allowed

   %version ctm 1.0 # I stick to 1.0 because 1.1 sucks
   %encoding "us-ascii"

which I doubt you intended. Similarly, you don't allow

   puccini  # FIXME: don't have all the data yet
     "Giacomo Puccini".

which I don't think was intentional, either.

To avoid all of this it's better to just allow comments everywhere  
whitespace is allowed, and to state this just once. It does mean that  
people can write things like

   %version ctm # why do we have to say it's CTM, anyway?
   1.0
   %name ...

but if they want to, why not?

> Some of the editors felt it would be wrong to prevent this; others  
> felt we
> should encourage the best practice of keeping all directives and  
> templates
> in the header. More opinions on this are solicited.

You can count me in the "only allow them at the top" camp. One reason  
for this is that if you write

puccini
   sort-name "puccini, giacomo" .

%name sort-name

someone
   sort-name "else, someone" .

Then one of these will be an occurrence, and the other a name. In  
other words: you can fuck up your data by simply having one statement  
too high up. If it gave a parsing error it wouldn't be a problem, but  
this is. (Warnings are no good.)

--
Lars Marius Garshol, Ontopian               http://www.ontopia.net
+47 98 21 55 50                             http://www.garshol.priv.no




More information about the sc34wg3 mailing list