[sc34wg3] CTM: IRI pattern contest

Lars Heuer heuer at semagia.com
Fri Nov 7 07:20:27 EST 2008


Hi all,

In Leipzig we 'decided' (based on FCD comments) that we want an
explicit pattern for automatically detectable IRIs. These IRIs can be
used without the need to embed them into '<' / '>' brackets. I.e.
http://www.semagia.com would be detected as an IRI.

These IRIs are a bit special since they cannot end with a dot ('.') or
a semicolon (';')  (c.f. Oslo meeting notes [1])

I wonder if it wouldn't make sense to restrict the IRIs further:
- They shouldn't end with a "(" or a ")" because associations use them
  (and templates)
- They shouldn't end with a "," or ":" because roles use them
  (and templates)

I propose that these characters should not occur at the very end of an
IRI: ".", ";", ":", "(", ")", ","

Note: It is still possible make them part of the IRI if the IRI is
embedded into <>

Anyway, we need a good pattern for it, so any suggestion would be
helpful. Currently I use the following pattern (assuming that we
restrict the IRIs further as I've proposed):

    schema-name ::= [a-zA-Z]+[a-zA-Z0-9\+\-\.]*
    autodetectable-iri ::= schema-name '://'
((;|\.|\(|\)|,|:)*[^\s;\.\(\)\,:])+')

All characters may occur within the IRI but not at the very end (they
may appear at the end but they are not treated as part of the IRI).

I think TMQL needs something similar, so any simplification or
improvement would be helpful.

Here a little Python program which shows the results of the proposed pattern:

>>> import re
>>> pattern = re.compile(r'[a-zA-Z]+[a-zA-Z0-9\+\-\.]*://((;|\.|\(|\)|,|:)*[^\s;\.\(\)\,:])+')
>>> def matchiri(s):
	m = pattern.match(s)
	if m is None:
		print 'No match for <%s>' % s
	else:
		print 'Match: <%s>' % s[m.start():m.end()]

		
>>> matchiri('http://www.semagia.com;')
Match: <http://www.semagia.com>
>>> matchiri('http://www.semagia.com(foo')
Match: <http://www.semagia.com(foo>
>>> matchiri('http://www.semagia.com,')
Match: <http://www.semagia.com>
>>> matchiri('http://www.semagia.com)')
Match: <http://www.semagia.com>
>>> matchiri('http://www.semagia.com:')
Match: <http://www.semagia.com>
>>> matchiri('http://www.semagia.com:80')
Match: <http://www.semagia.com:80>
>>> matchiri('http://www.semagia.com.')
Match: <http://www.semagia.com>
>>> matchiri('http://www.semagia.com ')
Match: <http://www.semagia.com>
>>> matchiri('http://www.semagia.com(foo) ')
Match: <http://www.semagia.com(foo>
>>> matchiri('http://www.semagia.com)')
Match: <http://www.semagia.com>
>>> matchiri('http://www.semagia.com)foo')
Match: <http://www.semagia.com)foo>
>>>

So, Perl lovers and RegEx experts, your turn! :)

[1] <http://www.itscj.ipsj.or.jp/sc34/open/1023c.htm>

Best regards,
Lars
-- 
Semagia
<http://www.semagia.com/>


More information about the sc34wg3 mailing list