[sc34wg3] Almost arbitrary markup in resourceData

Tue, 11 Nov 2003 16:30:25 -0500

My purpose in contributing to this particular discussion was twofold 1) to
state a use case which drove the discussion about arbitrary markup in some
areas of the XTM syntax; and 2) I've been too busy doing real work to
participate to the level I would have liked and had a little bit of spare
time to pop in and see what was happening.  In fact, one of the hacks that
Lars Marius mentioned in his last contribution to this discussion was what
we had to do to get around this limitation/feature in the standard.

I am speaking from the front lines of the user community, not the tool
vendor community, not the acedemic community.  I'm claiming my stake as part
of the target market - the people who want to make money using the tools and
standard as opposed to those implementing or studying.  I'm trying to
convince people here to spend a great deal of money to move to a standard
that I believe could be very useful in a large number of applications.
Currently I spend a lot of time trying to convince people that topic maps
are better as a general architectural component than RDF/RDFS/OWL.  The
problem is that there are a few idiosyncracies and issues with topic maps
that make it an uphill battle.  The closed model (arbitrary markup) is but
one of them.  Lack of TMCL and TMQL are among the other issues.  If people
are interested in the other issues, I will reply as time allows in a
separate thread.

I will be interested to see how this one plays out in the final update.

Anyway, a few responses and then I need to get back to work:

[...]

> In your last message, you said "something useful." This time, "something
> interesting." But if you don't know beforehand what kind of markup you're
> going to receive, how are you going to handle it?

I know exactly what kind of data I'm going to receive.  And it is both
interesting and useful ;-).  It is also not marked up according to the XTM
syntax and will likely never be.

[...]

> The idea that XML provides an ability to "shield off" markup 
> that matters
> is where the problem arises -- the non-XTM stuff is left 
> hanging in some
> nether world of unknown semantics, unknown processing, 
> unknown application
> handling.

That's fine.  Leave it hanging, as long as it is left there, I don't care.

[...]

> One might. I don't. If you require a standard interchange syntax to be
> able to allow *everyone* to include *everything*, how can that possibly
> still be considered an interchange syntax?

How?  By cleary and succinctly defining what is supposed to happen to those
items defined in the interchange syntax and clearly stating that anything
else is left unmolested.  (See the SVG spec discussion below.)

> You're misinterpreting what I said, and I have difficulty believing
> you think that's what I meant. The "semantics" (gad how I hate that
> word) of text (e.g., PCDATA) is completely unlimited. We have hundreds
> of thousands of books with plenty of "semantics" expressed in text. The
> Topic Map paradigm does not rely on your ability to handle Lexus Nexus
> in some custom application to still work. You can create custom Topic
> Map applications all you like. The whole point of the last few years'
> work has been to develop a means of establishing what a Topic Map
> Application is. I'm arguing for purity of the interchange syntax, not
> trying to stop you from creating your own TMAs. Global knowledge
> interchange doesn't happen when you allow proprietary markup that
> obscures the ability to interchange, it creates islands of proprietary
> functionality. It runs completely counter to Newcomb & Co.'s vision.

Yes, the semantics of PCDATA are unlimited, but we're inserting markup so
that the semantic meaning of some of the text is more narrowly defined (e.g.
identifying the given name and surname of people who are the subjects of
topics without having to double store the data using variants).  Yes, we
could define custom apps all we want and we have for 30+ years.  But one of
the reasons for adopting XML and standards based on it is that we'd much
rather use COTS tools as much as possible.  Since XML is the lingua franca
for our pipeline, XTM seems a logical choice for topic map-ish data, but its
closed model (purity of XML interchange syntax) has caused some challenges
that we brought forth to the topic map community for consideration.  (See
the part about staking my claim above.)

This issue brings the community to an interesting point.  Perceived purity
vs. perceived useability, which is more important?

[...]

> By "understand" it I mean that a Topic Map application can receive *any*
> XTM document and be guaranteed of the ability to correctly process the
> content, unambiguously, without having to throw away, "store", hide,
> spindle, mutilate, or otherwise incorrectly handle it. If that means that
> XTM documents can't directly embed Lexus-Nexus content, that's absolutely
> fine. XTM is meant as a standard, interchange format. Your application is
> one of thousands. We can't accommodate everyone's pet project in a
> standard (it was tried with HTML).

OK, if that's the community decision.  We would then need to examine whether
our hack is still the best path forward or if there are perhaps other XML
based tools that could be used to accomplish similar goals.  Also, I have a
hard time believeing that we are the only people that have bumped our heads
on this issue.  Perhaps someone with a broader range of knowledge on this
could contribute here.

> > Why can't we standardize the topic map-ish stuff and freely 
> admit that we're
> > not going to touch anything outside of that domain?
> 
> What does that mean? "not touch"? If I send you an XTM document today,
> it will function essentially the same within any 
> XTM-compliant application.
> I can open Steve's Opera Topic Map with Ceryle. If suddenly 
> the freedom
> to embed any proprietary markup within XTM exists, my application now
> has to deal with whatever the hell happens to be there.

Not touch = leave in the tree, do not expect to process with any topic map
semantic attached.  (See the SVG spec discussion below.)

> > Interoperability occurs when predictable things happen to 
> the elements
> > within the XTM namespace.
> 
> Untrue. Interoperability occurs when predictable things happen to
> the elements within XTM *documents*. Allowing non-XTM content means
> that application A differs from application B differs from application
> C in handling any given document containing markup that A, B, and C
> can correctly process. If application B differs from A and C in
> being able to handle MathML, users of A and C don't have the same
> experience as users of B. That's PRECISELY the problem with 
> Microsoft's
> approach to software. Hell, they managed to chair a Unicode committee
> and added character-level codes for bold and italic and positioning
> into Unicode, so that only Microsoft applications (or vendors who
> were willing to do what Microsoft did) would "correctly" process those
> weird-ass codes.

And an XTM *document* is what you find between the start and end XTM tags,
correct?  Did we ever define what happens when there are multiple XTM
*documents* within a single XML document?  What would an XTM processor do
with that?  Oops sorry, that was another of those other items I alluded to
before - different thread.

Unfortunately we have to deal with different levels of user experience every
day.  And you know what?  In our world, it's OK.  Graceful degradation can
sometimes be the best, if not only, solution.  If a topic map engine is
driving a web site and extra (arbitrary) tags make it through the
stylesheets, what would happen when they hit the browser?  The text would be
shown and the tags wouldn't.  Graceful degradation at work.  However, this
might not even be needed.  (See the SVG spec discussion below.)

>  > I would expect that *within* LexisNexis,
> > interoperability would occur based on the other markup as 
> well because
> > something predictable will happen to it as well.  I 
> wouldn't expect Joe Blow
> > off the street to know what to do with the LN markup.  As a 
> topic map owner,
> > I could translate the LN markup into XHTML or something 
> more general or even
> > strip it before the topic map went out for public 
> interchange.  But, as I
> > said, the markup is VERY useful in internal applications 
> and that's where
> > the requirement comes from.
> 
> So use it in internal applications. But don't expect a standardized
> interchange syntax to allow it. Where's the logic in that?

The logic is as follows: 
- we want to use COTS tools as much as possible in our pipline.
- the COTS tools all use the XML interchange syntax as an input mechanism.  
- I can't include the arbitrary data in the data going into the tools nor
can I get it back out of the tools, therefore it significantly reduces the
possibility of using the COTS tools.  
- I don't have COTS tools that support this internal, therefore I now have
to develop in-house tools
- In-house tools are expensive to develop and maintain, especially on a
not-very-widely-used standard, since knowledgeable developers are few and
far between.
- In-house tools are most likely too expensive therefore I run a higher
likelihood that the standard itself won't be used.
- Best way to increase odds of adoption of the standard: expand what is
allowed in resourceData and names.

> > > Murray said:
> > > So I'm guessing the XHTML+XTM DTD wouldn't do it?
> > Eric said:
> > Not for my use case.  We plan to use real XML with semantic tag names
and
> > everything.  But I could see its use for those whose application is only
> > presentation.
> Murray said:
> I find it only humourous that you somehow think "real XML" has "semantic"
> tag names (whatever that means), and by inference that XTM is perhaps
> not "real XML". Eric, you're too much of a markup expert to seriously
> mean that. Get real. SVG, MathML, all function because they are distinct
> markup languages. What I hear now is that some people don't want markup
> languages, they want arbitrary XML markup, i.e., no restrictions. This
> sounds like Dave Raggett talking, not you. Few of the markup experts I've
> talked to in the last five years (including about half of the original
> XML WG) think XML Namespaces is anything but a colossal failure. If
> that's "real XML" I prefer the unreal.

semantic tag names != p, li, h1, h2, h3, h4, h5, h6, a, table, td, tr, div,
ul, ol, etc.

My statement applied to the XHTML portion of your proposal.  I know the XTM
standard and it is full of semantic markup, so I'm not sure where you got
your implication.  

It's interesting that you mention SVG: From
http://www.w3.org/TR/SVG11/extend.html section 23.1 "SVG allows inclusion of
elements from foreign namespaces anywhere with the SVG content. In general,
the SVG user agent will include the unknown elements in the DOM but will
otherwise ignore unknown elements."  it goes on to say "Additionally, SVG
allows inclusion of attributes from foreign namespaces on any SVG element.
The SVG user agent will include unknown attributes in the DOM but with
otherwise ignore unknown attributes."

Looks like SVG allows arbitrary markup to coexist within it, doesn't it?  I
assume that we could safely replace the phrase "SVG user agent" with "topic
map application" and it would be a starting point for the same capability
within XTM.  I'm not saying that XTM needs to implement the DOM.  However,
it seems that the internal model for XTM could just as easily store these
new elements, just like SVG/DOM.  Look at the SVG 1.1 DTD.  There are
entities all over the DTD that allow new markup within a SVG document
(%SVG.svg.extra.content;, %SVG.g.extra.content;, %SVG.defs.extra.content;,
%SVG.desc.extra.content;, etc.).  Why can't XTM provide the a similar
mechanism?  Look at section 23 of the SVG spec.  It's all about adding
foreign namespaces and private data.

I agree - namespaces may not be the best, or even correct, solution.
Colossal failure might be a bit strong, though.  They are what is available
in the current XML world and smart people are making them work with some
degree of success.  Evil or not, they are the best thing we have right now
and they're part of my everyday life.