[sc34wg3] Almost arbitrary markup in resourceData

Lars Marius Garshol sc34wg3@isotopicmaps.org
18 Nov 2003 00:09:56 +0100

I think this was a really useful posting, and it takes us way into
territory that we've never properly discussed, but which was
definitely there and waiting for us to enter it. So, here goes. :-)

* Aad Kamsteeg
| Just allow for a proprietary extension to the schema, but only in a
| formalized way. In terms of RelaxNG: add an empty define
| (notAllowed) for the three XTM elements so users (rather
| maintainers) of a Topic Map can define their own additional markup
| by extending the schema in those parts only.

We have done something similar to this in the as yet unpublished
draft: <resourceData> and <baseNameString> both allow arbitrary markup
so long as it is not in the XTM namespace.

I think I should repeat here what I see as the role of the schema that
is published as part of the standard: to define the allowed structure
of the XML documents passed to an XTM processor. Basically we have a
combination of RELAX-NG and prose that makes it clear which documents
are allowed in XTM and which ones are not.

HOWEVER! This does not mean that the RELAX-NG schema is meant to have
any use whatsoever beyond that. It really truly is just meant to
shorten the standard so that we don't have to keep saying stuff like
"the <topic> element may only occur within the <topicMap> element,
where it may occur in any position and any number of times".

If someone wants to use it to validate an XTM file they should be

  a) free to do so,

  b) aware that even if the file validates according to the schema it
     need not be valid XTM, and

  c) free to use any other means they want to validate the file.

I realize that this is not the way, say, DocBook or XHTML work, but
then XTM is very different from those markup languages, because XTM is
just a serialization of the real thing, which is the data model. In
DocBook and XHTML the XML *is* the real thing.

This is actually a response to what Aad wrote, even though by this
point it may no longer seem that way. I think what Aad writes makes
good sense, but that there are three practical problems with it, both
of which stem from the issues I went through above.

The first problem is that I think we will quite often see rules like
these in applications:

  a) occurrences of type "chemical formula" have the following
     RELAX-NG grammar: ...,

  b) occurrences of type "mathematical formula" have the following
     RELAX-NG grammar: <insert MathML spec quote or whatever>,

  c) occurrences of type "address" have the following...

Well, you see where this is going. I think weaving the schema for the
application XML together with the schema for the XTM is going to be
quite awkward and also not going to be quite as powerful as one would
want, because the XML schema languages cannot take the topic map
semantics into account. Topic maps truly are not XML, and this is one
place where we see it clearly.

The second problem is similar. If I get XTM following the custom XTM
schema of organization X, and then more XTM according to the schema of
organization Y, and merge it all together, what do I do with the
resulting topic map? The rules for how to validate the application XML
is now in two separate schemas that may turn out to be impossible to
merge together.

The third problem follows on from this. What if, having merged the two
topic maps, I now want to validate the application XML? If I don't
want to serialize back to XTM, and in many of cases this may be
near-impossible because the TM is just too big, I'll have to pull out
the rules for the application XML from the custom XTM schemas and then
apply it.

I think we want to allow people to attach schemas to their embedded
XML fragments, but for the above reasons I don't think hooking those
schemas into the XTM RELAX-NG schema is the way to do it. Which leaves
us with the question of how then to do it. (More about that below.)
| When a Topic Maps owner decides to do so, the consequences are
| entirely his. The standard should give some rules in order to at
| least state a proper warning to (ignorant) adopters of that specific
| TM.

This I certainly agree with, and this really has to be our policy
here. If you include application XML in your topic map and then
interchange it it's your business to make sure that you are doing
something that makes sense, because you have no guarantee that the
people receiving this will be able to make use of it. 

I believe we should warn against the problems with embedding XML, but
then they have to make the choices.

| It must me made clear for any other party who is allowed (or
| granted) to use the TM in question. As a provision for that purpose
| an idea could be to add an optional atribute for the root-element
| that states that this TM has a proprietary extension (so all are
| warned).

We could do that, but on the other hand, once you've read the XTM
document you will see that in any case. And what would we do in cases
where there is no warning, but there turns out to be application XML
in there anyway?

| The standard should state clearly (normative) that when an extension
| is used this attribute is mandatory.

That would mean that we would have to check the flag against the
actual XTM document to see if the flag is set correctly. My feeling is
that we might as well skip the flag.

However, I think the idea that embedded XML is something people should
be on the lookout for is valuable. I'm just not sure how to get that
into the standard, and whether that kind of guidance is even
appropriate in the standard.

Opinions on this, anyone?
| Some guiding rules could be added in addition to this:
| - The owner of an extended TM is required to publish the extension in
|   cases where this TM is made public or is to be shared with others.
| - The owner of an extended TM is required to publish an instruction
|   what the preferred way of resolving this additional mark-up is in a
|   situation where the extension can not be applied, default rules
|   could be either:
| -- remove the proprietary markup and its content (for things like SVG
|    and Math-ML the most likely solution)
| -- remove the proprietay markup and keep the textual bits, (most
|    likely for added mark-up like <b> or <em>).
| - Further more the standard could urge the owner of an extended TM to
|   supply a sufficient ruleset (could be in the form of a XSLT
|   stylesheet (??) how to handle the proprietory mark-up if others want
|   to keep (and use) the added value that is archieved with this
|   extension.

Here I think you are right: there needs to be a way for people to do
this. I'm not sure we want to require this (we don't require people to
create schemas for their topic maps, so why require it for embedded
XML?), but I certainly do think we want to enable people to do things
like these.

The question is: how? Any takers?
| This way the responsibillity for extended TM's is entirely for the
| party that created the TM, not for the standards organisation. The
| standard does provide sufficient rules to handle these exceptional
| (?)  cases as decent as possible.

Yep. This has to be the aim, and we should make sure we get this
right. I think we do, but we should make sure.
| PS. I agree with using RelaxNG as the normative schema language. I
| have quite some experience in using Relax because, as consultants /
| designers of schema we use Relax in all situations. We have some
| additional rules in order to enable reliable conversion towards a
| DTD. If interested I don't mind sharing this (and the XSLT
| stylesheets that do the job) with you.

To be honest I'm not yet sure what to do with the DTD (and the XSDL
schema). I'd like to automate the generation of them, but so far I'm
clueless about how to do it, and how I'd want the generated thing to
look. Ideas are welcome.

Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
GSM: +47 98 21 55 50                  <URL: http://www.garshol.priv.no >