[sc34wg3] Almost arbitrary markup in resourceData

Mon, 17 Nov 2003 14:03:07 -0500

I've stayed out of the debate for a few days, but I believe that Bernard has
captured something important: any external markup that happens inside the
elements in question simply isn't the responsibility of the TM application
per se. It's not a matter of breaking XTM; it's rather something on which
XTM should be silent. The TM application, that is, the TM engine, for lack
of a better name, simply passes this data on to something else to process.
In that sense, it's like notation data that Martin mentioned. We've always
had tools like notations in SGML to allow us to encapsulate foreign data; we
might as well follow that pattern here.

As for interchange of arbitrary data, that's never been a goal of the SGML
community. The ODA folks may have claimed to be seeking purely arbitrary
interchange, but that was another of their will-o-the-wisps. I was saying 20
years ago that we in the SGML world were interested in constrained,
prearranged interchange. That's one of the reasons we developed DTDs and
their heirs, to communicate something about what data structures we intended
to interchange. DTDs, schemas, and their ilk still have known semantic
limitations, but they're some hedge against the receipt of absolutely
unforseen data dropping in out of the blue. We should be able to use the
tools provided us by the couple of decades of experience we've had with
SGML/XML markup instead of throwing away those strengths just because we're
doing TMs.

If I want to send a TM to an unknown recipient, I'll play it safe and avoid
anything not in XTM. But when I know my recipient and we have negotiated our
environment, I see no reason why we shouldn't consider additional markup as
part of the environment; I don't want ISO 13250 to tell me I can't structure
my application environment. I don't see why resourceData and baseNameString
shouldn't be treated as inline occurrence data. If I want to do some
programming (or get some vendor to do some programming) so that I can style
(or do whatever with) the negotiated data in my TM browser, then that means
I can sell the TM application more easily to my potential audience. If my
browser happens to be M$IE, for example, why shouldn't it be allowed to
process <xhtml:sup> or some other such element, at the same time it's
processing the XHTML pages generated by my TM engine? The browser is never
seeing XTM data; that's handled by the TM engine in the background. I expect
a TM engine to use a standard Web browser as its user agent; the TM engine
should pass through data it doesn't understand.

Another take on this issue: what's in the scope of SC34 to do. 

I get the impression from Martin that his goal (at the least) was not to
constrain what sort of data appeared in a TM. ISO 13250 was written in terms
of architectural forms, and those could be applied to all sorts of
SGML/HyTime applications. There was no DTD that said what all the elements
of any application could be, much less their content models, just a
mechanism for the AF overlay that interpreted certain hyperlink-related
elements. The specification of PCDATA was not part of the original design;
it's rather an artifact of a specific DTD developed in the XTM process. We
accepted that DTD into 13250 ex post facto, knowing that it had certain
issues associated with it. But those issues were not intrinisic to the 13250
design. 

I believe what I hear Martin saying is that we ought to be looking at ways
to make XTM capture more of the original spirit of TMs. I think the argument
about corrupting the purity of XTM is missing the goals we should have in
the standardization process: we should be seeking as much interchange
capability as we see the feasibility of implementing, not becoming rigid
about a single DTD which was, after all, only a 1.0 design. 

I'd like to hear more from Martin and the other two original editors on what
they thought their goals were for TM interchange.

Jim Mason

-----Original Message-----
From: Bernard Vatant [mailto:bernard.vatant@mondeca.com] 
Sent: Monday, November 17, 2003 1:15 PM
To: sc34wg3@isotopicmaps.org
Subject: RE: [sc34wg3] Almost arbitrary markup in resourceData

Hello all

I've been considering for quite a while before jumping in that can of worms.
Rather than follow-up any of the ongoing argument threads that have turned
out to be quite hot, I will step back to the original Lars Marius question,
to which, seems to me, pragmatic, non-theological, answer can be given,
without conflicting with the fundamental principles of Topic Maps paradigm -
on which I am optimistic enough to believe everyone in this forum (Murray
included) agrees upon.

*Lars Marius

> do we want to allow XTM elements to appear inside these elements? That 
> is, is
>
>   <resourceData>XTM is an <topicRef xlink:href="#XML"/>-based markup
>   language.</resourceData>
>
> OK? If so, what does it mean?

The latter is the fundamental question. My first-cut answer would be, along
the lines of Graham's one ("no big deal"), that it just *can't mean
anything* from the viewpoint of a TM application.

>From a TM application viewpoint <resourceData> content should be a 
>black
box. The default behaviour of a TM application should be to store and pass
that box to its environment "as is" without opening it, processing it, or
trying by any means to figure what is inside and interpret it. And, agreeing
again with Graham on that, there is IMO absolutely no difference, other than
syntactic, with what happens with <resourceRef>. The specification does not
put any limits nor constraints to what you can get when you dereference a
<resourceRef>, and that's good, so why should it put limitations or
constraints on what you can get when you "open", so to speak, the
<resourceData> box? I've always understood <resourceData> as just a shortcut
for <resourceRef id="foo" xlink:href="#foo">. Not sure this is valid syntax,
but see what I mean : the referenced resource is *here* in the file.

So, even if it allows extra markup in the <resourceData> box, the XTM
specification IMO should not say anything about any allowed, recommended or
forbidden syntax, and even less about the semantics of any of it, and the
conformance of a TM application should not include any capacity to handle
it.

Now if a specific application want to develop specific features based on the
markup embedded in <resourceData> (and I believe Jim and Eric and Martin
have excellen different reasons to want that), the architecture of the
applications should carefully make distinct what belongs to TM processing
(handling <resourceData> as black boxes) and what is "ad hoc" processing
able to open the boxes and deal with their content ... And any
implementation of that kind should be well aware that this ad hoc processing
has nothing to do with Topic Maps nor XTM specification.

Where I agree with Murray is that any kind of mix-up of XTM namespace with
extra namespaces, defined by XTM specification itself, would be opening a
can of worms and is not a good idea at all. Now I won't argue about what
validation is or is not, and if DTD or RELAX-NG schema should be normative.
I'm quite agnostic about that, as long as the schema makes clear the above
limit between the outside and inside of "resource boxes".

So I would see some recommendation in the spec along the lines of : "You can
do that, but think twice about what will be interoperable with whom."

And also
"You can do that, but think twice if you could not make it another, more
interoperable way."

I don't think the latter has been considered that much in the current
debate. Allowing embedded markup could open the door to lazy modeling,
meaning by that it might often be the case (and well, if you adopt the
Reference Model philosophy, it certainly *is* always the case) that
semantics captured in the embedded markup could have been expressed as
proper TM information at a finest level of granularity. And the
specification prose should recommend to do so whenever possible.

Example of a "lazy occurrence" of type "PostalAddress" for topic "John
Smith".

<resourceData>
	<street>Main Street</street>
	<number>23</number>
	<city>Nothing Gulch</city>
</resourceData>

It's clear that the lazy TM author could (should?) have defined
"PostalAddress" as a topic class, then "street", "number" and "city" as
occurrence types, and linked "John Smith" to "John Smith's address" using a
"PersonalAddress" association.

I'm not pretending that any embedded markup cases can practically and easily
boil down to that kind of reduction, but my ground experience, in Mondeca
real world implementations, so far, is that even in cases where
representation of fine-grained information embedded in existing resources
has been needed, a workaround to embedded markup has been found.

Bernard

Bernard Vatant
Senior Consultant
Knowledge Engineering
Mondeca - www.mondeca.com
bernard.vatant@mondeca.com

_______________________________________________
sc34wg3 mailing list
sc34wg3@isotopicmaps.org
http://www.isotopicmaps.org/mailman/listinfo/sc34wg3