[sc34wg3] new working draft of 13250-5 (Reference Model)

Tue, 9 Nov 2004 12:45:39 +0100

Steve and all

I've read the new TMRM draft with much interest. Some "thinking aloud" about it.

You are certainly aware of a current thread of thought on "identification as dynamic
process" vs "identity as static set of properties" (see various posts about it on
universimmedia blog below), and my reading of the new draft has been along those lines.
Please note all of this is quite new thought to me also, and I am not sure yet where it
goes.

Anyway I would like here to make the case for a shift from the static notion of "Subject
Identity Properties" (SIP) to the dynamic one of "Subject Identification Rules" (SIR, yes
Sir). I already happily notice that the notion of "rules" has explicitly appeared in many
parts of the draft. I will focus on demanding that TMA disclose, among other things, "the
rules for determining when multiple proxies are surrogates for the same subject".

I definitely like this way of putting things, which opens the door to any kind of rules of
identification. So I wonder why to restrict those rules to disclosure of SIP classes.
Agreed, a SIP can be as complex as can be, but what happens in most "natural"
identification process is that various rules are applied on properties which are not
absolutely SIPs. I will take two examples.

Exemple 1 : Identifying books :
How do you make sure that a book X you find on Amazon is the same as the book Y you are
looking for?
I will use "X :: Y" to indicate that you decide that they are actually the same (watever
this sameness means).

I guess you could apply a kind of following heuristic (succession of rules)

	if ISBN (X) = ISBN (Y)
		     then  X :: Y

	else 	if ISBN (X) or ISBN (Y) is not specified
		and AuthorName(X) = AuthorName(Y)
		and Title(X) = Title (Y)
		and PublicationDate(X) = PublicationDate(Y)
		and EditorName(X) = EditorName(Y)
			then  X :: Y

In the TMA disclosure, ISBN would be defined as a SIP, whereas AuthorName, Title,
PublicationDate, EditorName are OPs, although together they "act as" a SIP. Of course you
can create a complex SIP from this actual combination, but it looks more natural to
present it like an identification rule rather than a "property" in the object-oriented
sense of the term, or it's a "complex property", a notion more difficult to explain and
grasp than the notion of identification rule.

Exemple 2 : Identifying news :
How do you make sure that a news X from Reuters today is the same as a news Y from AFP
yesterday?
e.g. I already know from X that "Georges W.Bush was re-elected" so I don't care to be said
by Y that "Bush is the new President of the USA".

This is much more tricky. Admitting you have defined "news" as a class of documents, maybe
is this case your TMA includes a text mining engine, applying complex, context-sensitive,
linguistic analysis rules to infer that X and Y "have the same subject" and therefore
should be identified as the same. Should the TMA disclosure include all the rules applied
by the linguistic tool? What would be the classes of SIPs? Or would not the TMA disclose
simply that it uses the Text Mining Application such and such, or the Google News
algorithm, to compare news?
Actually this is not academic. We have in Mondeca a Text Mining partner providing
technology plugged to ITM through API, and the first application of this coupling we have
made was for succesfully mining Reuters Financial News, with efficient extraction, storing
and merging of subjects like companies and their announced relationships (buy, merge,
partnership, participation, ...).

So what I am questioning is that "Subject Sameness Detection Rules" (that I would more
simply put as Subject Identification Rules) should always be linked to a class of SIPs.
This is just the simplest case, like ISBN in Example 1.

Bottom line : the word "rule" has so many occurrences in the document that it might
deserve some definition in the Glossary.

My 0.02 Euros - currently a little more than 0.02 $ :))

Bernard

**********************************************************************************

Bernard Vatant
Senior Consultant
Knowledge Engineering
bernard.vatant@mondeca.com

"Making Sense of Content" :  http://www.mondeca.com
"Everything is a Subject" :  http://universimmedia.blogspot.com

**********************************************************************************

> -----Message d'origine-----
> De : sc34wg3-admin@isotopicmaps.org
> [mailto:sc34wg3-admin@isotopicmaps.org]De la part de Steven R. Newcomb
> Envoye : lundi 8 novembre 2004 16:30
> A : sc34wg3@isotopicmaps.org
> Objet : [sc34wg3] new working draft of 13250-5 (Reference Model)
>
>
> All -
>
> A new working draft of 13250-5, "Topic Maps - Reference Model",
> is now available at http://www.jtc1sc34.org/repository/0554.htm
>
> It's significantly shorter, and we hope and believe it's easier to
> understand, too.
>
> -- Steve
>
> Steven R. Newcomb, Consultant
> Coolheads Consulting
>
> Co-editor, Topic Maps International Standard (ISO 13250)
> Co-drafter, Topic Maps Reference Model
>
> srn@coolheads.com
> http://www.coolheads.com
>
> direct: +1 540 951 9773
> main:   +1 540 951 9774
> fax:    +1 540 951 9775
>
> 208 Highview Drive
> Blacksburg, Virginia 24060 USA
>
> _______________________________________________
> sc34wg3 mailing list
> sc34wg3@isotopicmaps.org
> http://www.isotopicmaps.org/mailman/listinfo/sc34wg3
>