[sc34wg3] Almost arbitrary markup in resourceData

Sun, 16 Nov 2003 17:20:52 -0000

Having been away looking at the next generation of applications with the
European Commission I have not been able to add my voice to the debate this
week, otherwise I would have had time to get very hot under the collar by
now :-)

I'm going to take just one of the messages and expand on that as I think it
is the most sensible one of the 50 I've just had the misfortune to plough
through.

Dmitry wrote

> I would like to concentrate more on specific suggestions:
>
> TMDM:
> Define additional property for occurrence item: encoding type.
> Encoding type is enumeration:
> . Base64,
> . XML,
> . PCDATA

I will give my reasons for seconding this, and suggest an extension to it
that would make it even more useful, though it should not only be applicable
to occurrence items but to anywhere resourceData is applicable.

Murray seems to be under the misapprehension that only Eric and Jim have
applications that need embedded markup in the three PCDATA elements of XTM.
This is far from the case. Any topic map that is based on advanced
scientific data needs something more than XHTML to markup the contents of
resourceData. Expecting all scientific data to be stored externally to the
topic maps will just make them unusable in emergency situations. A key
factor of use in many applications is whether or not they can be applied
when access to external resources are not available. The safest way to
ensure this is to make them free standing, with all the data in a single
file. Requiring an application to ensure it has access to all relevant files
before it can start is a surefire way of making it too slow to be relied on
in an emergency, yet emergencies are just where topic maps can be most
useful. Embedded markup is the only solution to this problem, which is the
first reason I support Dmity's proposal

The second is historical, and will probably be rejected by all concerned,
but is stated here for the record. One of the features that was in the
original ISO 13250 standard design was the ability to use notations to
control the contents of display names. The reason for this was there was
that we foresaw situations where you would want to use symbols to identify
topics. For example, I might want to do a topic map about international road
signs. We anticipated that such signs might be available in a number of
notations, e.g. GIF, PNG or JPEG. We also recognized that new formats, such
as SVG, would be forthcoming. Therefore the standard was defined using
Notation as a clean way of supporting this well defined user requirement. It
was enivsaged that you  might want to provide the same symbol in a number of
different formats, or multiple times in the same format (e.g. for a black
and white printable image of a coloured road sign). Again, having to force
all this information into external files makes the interchangeability,
reliability and use of topic maps problematical.

There are, however, arguments against allowing "arbitrary XML" as the
contents of any element. Obviously it must be well-formed (most of the
stupid examples given were so obviously not XML because of this- please
stick to valid XML markup, not pseudo-markup, in examples). It must also be
valid. This can only be so if the DTD/Schema/RelaxNG grammar is identified.
Relax NG uses a namespace attribute for this purpose. I would suggest that
we have a similar mechanism, but with a clear restriction of a single schema
per TM element. I would suggest that there should be two attributes:

a) type
b) source

The type attribute should be restricted to:
  (PCDATA|XML-DTD|XML-Schema|RELAX-NG|Notation)
with the default being PCDATA.

NB: I've split the XML into three types simply to allow systems to precheck
whether or not they have a processor capable of validating the source, and
so that theycan pass it off to relevant processors as required.

For XML-coded resources the source attribute should be a URL that identifies
the namespace of the XML document. In the case of RELAX-NG this would become
the value of the namespace attribute. In the case of Schemas it would be the
value of the URL the namespace component of the local element names resolve
to. In the third case it should be a URL identifying a publically available
source for the DTD. In all three cases it should be an error if the relevant
definition is not accessible from the URL. In the case of Notation source
would be a URL identifying open source software capable of processing the
data and returning a Web displayable image of it. (This might be
controversial, but I can't at present think of any logical way of making it
distributable otherwise.) For PCDATA the source must be a URL that is a PSI
that identifies the character set employed, where we publish in the TM
standard PSIs for ASCII, ISO 10646 (Unicode) and each of the normally
referenced subset of ISO 10646.

(And before anyone says I'm inconsistent in arguing that the DTDs required
for validation be accessible from the URL and yet not want the resourceData
to be accessible via the web, there are times when you don't need, or have
time, to do validation, and only need to say that the validation was done
using this mechanism. A more important factor for applications is access to
the relevant XSLT data for the local application.)

Martin Bryan
IS-Thought: Thinkers for the Information Society
29 Oldbury Orchard, Churchdown, Glos. GL3 2PU, UK
Phone: +44 1452 714029 Fax: +44 1452 859991
E-mail: martin@is-thought.co.uk