[sc34wg3] TMQL: counting within groups of tuples

Mon Sep 14 19:27:33 EDT 2009

Hello,

this problem occurred to me in practice and is surprisingly hard to
solve in SQL:

Given a table (= tuple sequence), partition the rows of the table into
sets of rows (= tuples) which have a common criterion (like a common
value in a particular column). For each of these partitions, sort the
rows according to another criterion (e.g. another particular column) and
then assign each row a consecutive integer index (e.g. the first row
within the partition gets the value 0, the next row within the partition
gets the value 1, the next row within the partition gets the value 2,
...). This index may be called "rank".

Take this example:

id | fkey | data
================
1  | 5    | foo
2  | 5    | bar
3  | 8    | bla
4  | 5    | baz
5  | 8    | blub

The partitioning criterion is "fkey", and the sort-order is along "id".
This table should be translated to:

rank | id | fkey | data
=======================
0    | 1  | 5    | foo
1    | 2  | 5    | bar
0    | 3  | 8    | bla
2    | 4  | 5    | baz
1    | 5  | 8    | blub

Note that the output ordering of individual rows does not matter, but
the rank numbers must be correct. However, as we may be talking about
sets of rows, the result

rank | id | fkey | data
=======================
0    | 1  | 5    | foo
1    | 2  | 5    | bar
2    | 4  | 5    | baz
0    | 3  | 8    | bla
1    | 5  | 8    | blub

would be equally ok.

I think that TMQL should be powerful enough to formulate such data
pŕocessing (or at least be as extensible as possible that, if such power
is not part of the standard, it can be easily and uniformly integrated).

Thus, I'd like that this example should be considered for being added to
the TMQL requirements.

It is an interesting challenge for parallel TMQL engines: While each row
that is subject to such an operation is not completely independent from
other rows (because the existence of other rows affects the own rank
number), each row is not fully dependent on other rows either (for a
given row, other rows with a different fkey are totally ignorable).
Thus, the processing of this operator can be "somehow partially"
parallelized across independent machines if each partition of rows
happens to exist in at most one machine.

ciao,
Xuân.