[sc34wg3] Essence of the TMRM

Thu, 29 Jul 2004 16:06:59 -0400

Greetings!

While I appreciate the recent postings by Barta, Dmitry, and Garshol for 
the reference model workshop I feel that I must apologize for the 
continued failure of the "TMRM people" to adequately communicate the 
essence of the TMRM.

Rather than replying directly to the partial models offered by Barta, 
Dmitry and Garshol, let me make yet another attempt to clarify one of 
the basic concepts of the TMRM, that of subject identity property (SIP).

 From ISO 13250, Topic Link Architectual Form, note that the subject 
identity attribute of a topic is defined as follows:

 >The optional subject identity (identity) attribute refers to one or
 >more indications ('subject descriptors’) of the identity of the
 >subject (the organizing principle) of the topic link. All of the other
 >topic characteristics specified by the topic link are regarded as
 >elaborating, and in no way contradicting, the subject described by the
 >subject descriptor(s), if any. There are no restrictions on the kinds
 >of information that may be referenced by an identity attribute.

Note that: "There are no restrictions on the kinds of information that 
may be referenced by an identity attribute."

Now, if we look at the above description of the subject identity
attribute without looking at the rest of the standard, we might feel
justified in thinking that all topic maps should regard the values of
subject identity attributes (in other words, of information addressing
expressions) as the only basis for merging topics.

The standard does distinguish a specific type of subject descriptor,
the so-called "public subject descriptor", as being designed in such a
way as to permit the comparison of their addresses as a way of
determining whether two such addresses, when used as values of subject
identity attributes, effectively indicate the same subject.  (The
"public" [published] nature of "public subject descriptor" means that
its address can be standardized.  If the addresses of all public
subject descriptors are standardized, then the addresses of any two
public subject descriptors can be directly compared for sameness.  The
"public" subject descriptor is specialized precisely in order to allow
this optimization.)

But the standard is careful to leave completely unconstrained the
bases on which non-public subject descriptors will be compared.

Here's 13250's definition of "public subject descriptor" (3.15):

 > A subject descriptor (see the definition of `subject descriptor')
 > which is used (or, especially, which is designed to be used) as a
 > common referent of the identity attributes of many topic links in
 > many topic maps.  The subject described by the subject descriptor
 > is thus easily recognized as the common binding point of all the
 > topic links that reference it, so that they will be merged.

Now let's look at the *general* definition of "subject descriptor"
(3.19):

 > Information which is intended to provide a positive, unambiguous
 > indication of the identity of a subject, and which is the referent
 > of an identity attribute of a topic link.  (See also the
 > definition of `public subject descriptor'.)

Note that a subject descriptor is "information", and that it is not
the "address of some information".  In the *general* case of subject
descriptors, the referenced information is the indication of the
subject.  The address of a subject descriptor merely tells how
the information can be accessed, and any addressing expression that
resolves to the same subject descriptor would work equally well.

This may seem strange to those who are accustomed to "webby" thinking.
In Web-land, a given URI doesn't necessarily always resolve to exactly
the same information, on account of content negotiation.  More
abstractly, according to the REST doctrine, what's *really* being
addressed by a URI is an abstract notion called a "resource" (the term
"resource" has a special definition in Web-land).  In webby thinking,
URIs are the only constants; the information that any particular URI
addresses is allowed (and is often expected) to change.

It's important to understand 13250 in the context in which it
explicitly demands that it be understood, which is HyTime (ISO 10744),
and *not* the Web.  The design goals of HyTime are orthogonal to the
design goals of the URI and HTTP paradigms.

(Brief digression: The HyTime paradigm and the Web paradigm do not
invalidate each other, and are not mutually exclusive.  On the
contrary, marvelous opportunities await those who use them in each
other's contexts.)

The design goal of HyTime -- the one that's relevant to this
discussion, anyway -- is to provide protection for information
addressing expressions from loss of value due to changes in the
technical environment.  HyTime allows addressing expressions to be
complex and even arbitrarily algorithmic, but it requires them to be
expressed in terms of the structure of the information being
addressed.  (Neither the notation nor structure of addressed
information is constrained by HyTime.  What HyTime constrains is the
disclosure of the structure of information, so that the components of
information resources can have permanent, implementation-independent
addresses.  Such disclosures are called "property sets" in HyTime.)

The goal of HyTime appeals more directly to librarians and archivists
than to webheads.  It appeals to people whose mission is to preserve
the value of knowledge, and access to knowledge, for its own sake, and
far beyond its original context, purpose, or storage/access
technology.  In HyTime, the addressed information is assumed to be
where all the value is, while the addresses of information are simply
tools whose ability to be used to gain predictable, precise, and
permanent access to information is what must be preserved.  This idea
contrasts sharply with webby thinking, where, as already noted, the
information to which an addressing expression refers may change from
time to time.

In Web-land, there is little or no distinction between providing a URI
for a "resource", and providing that resource with a name.  In HyTime,
naming is only one class of addressing convention, and hierarchy is
only one kind of storage structure.

Now, with our HyTime hats firmly on our heads, we are in a good
position to understand the following words from 13250 (these are found
in Note 5, which immediately follows the definition of "subject
descriptor" (3.19)):

 > There is no requirement that a subject descriptor be text,
 > although it can be the text of a definition of the subject.  It
 > can also, for example, be a listing in a catalog of subjects,
 > such as an acquisition number of an asset in a museum
 > collection, a catalog number in a sales catalog, or a subject
 > heading in a catalog of library subject headings...

It's important to understand that none of the above examples (listing,
acquisition number, etc.) is intended to be understood as being the
value of a subject identity attribute.  By definition, a subject
descriptor is what the value of a subject identity attribute *resolves
to*.  And there is no notion here, at least not in the general case,
that the *addresses* of these pieces of information are intended to be
compared in order to determine whether or not the subjects that they
indicate are the same.  It is the pieces of addressed information --
the subject descriptors -- that must be compared.  Moreover, the
nature of that comparison is not constrained, other than by the nature
of the information of which the subject descriptor consists.  And, as
we've just seen, it's explicit in 13250 that the nature of subject
descriptor information is not constrained.

If there is any doubt about the latter point, it is laid to rest by
Note 6, which immediately follows the above Note 5:

 > Subject descriptors may be offline resources.

"Offline resources" here means information whose address cannot be
resolved by a computer, including information found in books, analog
information, and information we may learn by asking human beings.
(HyTime explicitly defines a class of such computer-unresolvable
addressing expressions.)

Thus, the conclusion that 13250 leaves wide open the nature of subject
descriptor information, and the methods used to compare subject
descriptors, is inescapable.

Now, let's compare that to the model offered by the TMDM:

(Explanation of syntax: "|" represents "or", "*" is one or more, and
the SIP is structured:

Namespace:SIP((Name_of_Property::Value)*)

Subject Proxy (in TMDM terms, a topic information item)

TMDM:SIP(subject_identifiers::Value|source_locators::Value|subject_locator::Value|reified::Value)

Comparison Rule: If any of these values are equivalent or if the 
source_locator:Value component of a SIP matches the 
subject_identifiers::Value component of another SIP, then the two SIPs
are deemed equivalent.

Along with that identity attribute (in ISO 13250 terms), a subject
proxy (topic information item) also has the following (OPs in TMRM terms):

TMDM:OPs(topic_names::Value,occurrences::Value,roles_played::Value,parent::Value)

 From the above grammatial productions, which I hope fairly reflect 
certain structural features of the TMDM, it should be readily apparent 
that the proposed TMDM imposes constraints which contrast sharply with 
13250's radically unconstrained definition of "subject descriptor".

Which is perfectly OK, in fact it is necessary to make choices in terms 
of what information will be found in a SIP to construct a topic any 
actual instance of a topic map.

What is NOT OK is to redefine 13250 in such a way that all subject
proxies in all topic maps must use only certain standardized kinds
of information as subject descriptors, and only certain ways of
comparing such restricted kinds of information.  In other words, it is 
not OK for TMDM to re-state 13250 in such a way that, where 13250
had left things wide open for Application designers to decide, 13250
will now be a single, pre-designed Application.

Topic map designers should be as free as they were in ISO 13250 to
declare any basis they wish to use to determine when two subject proxies 
represent the same subject.

Consider the following example that uses the same syntax:

Y12_Bomb_Works

Subject Proxy for Anti-Matter Lathe

Y12:SIP(Equipment_Type::value, Material::value, 
Stock_size(max-min)::value, Operator::value, Project::value)

Comparison rule: If all the components of the Y12:SIP match another
Y12:SIP, then the SIPs of the two subject proxies are equivalent.

Note that even though the SIP attribute of the subject proxy for
equipment contains Equipment_Type, Material, Stock_size, Operator,
Project, that other subject proxies are not required to match all
those values, but must declare/disclose what values must be matched
and on what basis that comparison is made.

For example, there could be another subject proxy, say for
anti-matter, that matches only on the basis of:

Y12:SIP(Material::anti-matter)

that causes all the SIPs that contain the component
Y12:SIP(Material::anti-matter) to be considered as equivalent.

The important point being that the basis for determining the identity 
represented by a subject proxy must be left open, so that topic map 
designers can be responsive to local requirements.

The TMRM captures the unconstrained notion of the identity attribute in 
ISO 13250.

The TMRM is an effort to provide the groundwork for disclosing the
components chosen for the identity attribute as it suits the needs of
particular users of topic maps.

Or as memorialized in US advertising culture, "Have it your way!"

Hope everyone is having a great day!

Patrick

-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
Patrick.Durusau@sbl-site.org
Chair, V1 - Text Processing: Office and Publishing Systems Interface
Co-Editor, ISO 13250, Topic Maps -- Reference Model

Topic Maps: Human, not artificial, intelligence at work!