Guide to the topic map standards

Please don't link to this page. It will go into the ISO document registry any day now.

This document describes what is happening with topic maps standardization right now. It describes the current activities, the problems they are intended to solve, and how those problems came to be. (In the opposite order, for ease of understanding.)

The past

The topic maps work started out within the International Organization for Standardization (ISO), in a part of it today known as SC 34 (where SC is short for subcommittee). This subcommittee works with SGML, DSSSL, HyTime, font standards, topic maps, and the new XML schema language framework called DSDL. Within SC34 the topic maps work is done by WG3.

The first substantial result of the topic maps effort was ISO 13250:2000, an ISO standard that defined a syntax for topic maps. This syntax was an SGML DTD, which used the ISO 10744 HyTime standard for linking and addressing, and so the syntax is known as HyTM (short for HyTime Topic Maps).

HyTM is not an XML syntax, is not a fixed DTD, and does not use URIs to refer to information resources. The result was that each topic map software developer made its own HyTM version, derived from the standard HyTM DTD. These things were seen as problems at the time, and in order to adapt topic maps to the web the TopicMaps.Org organization was set up to create a new topic map syntax based on XML and URIs. The syntax TopicMaps.Org created is known as XTM (XML Topic Maps), and solves the problems with HyTM. Today, the HyTM syntax is rarely used, as most people use XTM.

In October 2001 the XTM DTD was accepted into ISO 13250, and so ISO 13250 now contains two syntaxes: HyTM and XTM.

The present

Some problems remain, however. The current ISO 13250 defines two interchange syntaxes (XTM and HyTM), but does not explain how they relate to one another. As the two syntaxes are subtly different, this is a problem, since implementors are likely to map between the syntaxes in different ways, which means that the same topic map may not be treated the same way by different software.

Another problem is that both syntax specifications in the current ISO 13250 are quite informal. For the most part this is not a problem, but in a number of more subtle situations developers have interpreted the specification text differently, and this causes interoperability problems. If different implementations interpret the same topic map differently topic map applications may only work with a single implementation, which defeats the purpose of having a standard in the first place.

ISO SC34 has also resolved to create two new topic map standards:

ISO 18048: Topic Maps Query Language (TMQL), a query language for topic maps. This language is intended to be a kind of SQL (or XML Query) for topic maps, and will greatly simplify topic map application development by making it much easier to extract information from topic maps. A requirements specification has been created.
ISO 19756: Topic Maps Constraint Language (TMCL), a schema or constraint language for topic maps. Using TMCL one can write schemas for topic maps that constrain what is allowed to say in the topic map, such as "a person must be born in a place," "a person must have at least one name," and so on. A requirements draft has been created.

Both of these standards need to explain how the constructs in them are evaluated, but the existing ISO 13250 does not provide a suitable basis for such definitions. For example, when TMQL defines the "find all base names of topic X in scope Y"-operator it needs to explain carefully and formally what that operator does. This could be done in terms of the XTM syntax, but it would then be difficult to see how to apply it to the HyTM syntax. The explanation would also become very involved, as XTM provides many different ways to express the same thing, and merging of topics within the topic map must be performed before queries can be done.

So while the community is generally satisfied with the two syntaxes, their specifications are in need of improvement on three counts:

Not all developers interpret them the same way.
They need to clearly relate the two syntaxes to one another.
They do not provide suitable foundations for the TMQL and TMCL standards.

ISO SC34's solution to this is the topic map data model work that was started in May 2001, and is now beginning to produce tangible results, in the form of N0298R1 and N0299. (See also the SAM home page.)

The future

ISO SC34's current plan is to revise ISO 13250 into a multi-part standard. A key part of this new edition of the standard will be what is known as the Standard Application Model (SAM), a formal data model for topic maps. This model will be based on the same formalism as the XML Information Set. This model will define the allowed structure of topic maps, as well as how to perform key operations such as merging and duplicate removal. The SAM will allow SC34 to solve the problems with the interpretations of the specifications, relate HyTM and XTM to one another, and create a foundation for TMQL and TMCL.

The problem with the interpretation of the ISO 13250:2000 and XTM 1.0 specifications will be solved by writing new specifications for the HyTM and XTM syntaxes based on the SAM. The new versions of the syntax specifications will describe how to build an instance of the SAM model from a document in a given syntax. This will be done very formally, in a way that leaves much less room for interpretation. (The syntaxes themselves will stay the same. The only thing that will change is that their interpretation will become much clearer. DTDs will still be used to define the syntaxes, as none of the extra features in XML Schema are really needed for these syntaxes. The SAM will only be used to define their interpretation.)

Rewriting the syntax specifications in the way described above will also solve the problem of how to relate the XTM syntax to HyTM, and vice versa. The SAM will now serve as a common point of reference for the two syntaxes, and comparison of parts of the syntaxes can be done by comparing the SAM models they create. This solution will continue to work even if new topic map syntaxes are introduced, and it provides a way to relate non-standard topic map syntaxes (such as LTM and AsTMa) to the standard ones. It also provides a way to make mappings from syntaxes that do not directly represent topic maps, but closely related information, such as NewsML and XFML.

The SAM provides a much more suitable basis for TMQL and TMCL, since it unites the different syntaxes and provides a much more convenient basis for operator definitions. Defined using the SAM the "find all base names of topic X in scope Y"-operator would become something like "traverse the [base names] property of topic item X and return all base name items whose [scope] property contains topic item Y". (In practice the definition is likely to be somewhat different, but this is the basic idea.) TMQL and TMCL will then also be applicable to any topic map syntax that has a mapping to the SAM model.

Canonicalization

Although the new specifications will be clearer than the previous versions there will still be necessary to verify that implementations actually do conform to the specifications. This is best done by creating a conformance test suite, much like those already created for XML and XSLT. It is easy to create a set of topic map documents in the XTM and HyTM syntaxes, but harder to define what their correct interpretation is.

One way to do it is to create a so-called canonical syntax. In this syntax, every logically equivalent topic map would be represented as exactly the same sequence of bytes. This means that in order to see how a topic map engine interprets an XTM file, one could import that file into the engine, and then export it using the canonical syntax. The test suite could then consist of a set of XTM and HyTM documents with their corresponding canonical representations, and conformance testing could be automated.

The new ISO 13250 standard is going to contain just such a Canonical Topic Map syntax. It is expected that a conformance test suite will be developed, either within OASIS or within ISO, once the necessary infrastructure is in place. There also exists an early proposal for such a canonical syntax.

The Reference Model

The new ISO 13250 will also include a model known as the Reference Model, which is a more abstract graph model of topic maps. In this model, names and occurrence resources turn into nodes on the same level as topics, and they are related to their topics using an association-like structure of nodes and arcs. The result is a model that uses fewer constructs than the SAM, and which can be extended without changing the metamodel.

The Reference Model provides a mechanism for explaining the relationships between different knowledge representations, such as topic maps, RDF, and KIF. This will make it easier for topic maps to interoperate with these other knowledge representations.

It is planned that the SAM part of the standard will include a normative mapping of the SAM to the Reference Model. The TMQL and TMCL standards will thus relate to the Reference Model through the SAM. Obviously, it is very important that the SAM and the RM are consistent, and much work will go into ensuring that this is the case.

Overview

Below is shown a conceptual diagram of the relationships between the different parts of the new ISO 13250, as well as TMQL and TMCL:

The parts of the new ISO 13250 standard will be:

Part 0: A guide to the structure of the standard (Lars Marius Garshol)
Part X: The Standard Application Model (Lars Marius Garshol and Graham Moore)
Part X: The Reference Model (Steven R. Newcomb and Michel Biezunski)
Part X: The XML Topic Maps syntax (XTM) (Lars Marius Garshol and Graham Moore)
Part X: The HyTime Topic Maps syntax (HyTM) (Lars Marius Garshol and Graham Moore)
Part X: Canonicalization of topic maps (Currently unknown)

There is currently no clear timeframe for the finalization of these specifications.

Meanwhile, at OASIS...

In order for topic maps created by different parties to merge correctly it is crucial that these parties use the same identifiers for their topics. This is unlikely to happen by itself, however, and therefore three Technical Committees (TCs) have been formed within OASIS, in order to work on something called published subjects. These are URIs and descriptions for concepts considered important by some publisher.

The three OASIS TCs are:

Published subjects TC: Creates guidelines and recommendations for how to create, publish, and maintain published subject sets.
XML Vocabulary TC: Creates a vocabulary (or ontology) consisting of published subjects for the domain of core XML standards and technologies.
Geography and languages TC: Creates published subject sets for geographical and linguistic concepts. These published subject sets will be based on existing code sets such as ISO 639 and ISO 3166.

The published subjects activity within OASIS will layer on top of specifications produced by ISO SC34, and will not in any way interfere with what SC34 is doing.