2011-05-29

A well neglected data structure

I've been reading up on data structures and algorithms for a long time, as a database and data representation nerd. Especially with those having to do with relational databases and external memory overall. Every now and then one of them strikes me as being brilliant. Some of them even seem useful and elegant.

One structure which fills all of the prerequisites is the compact 0-complete tree. In principle, it is a self-balancing, variable arity search tree, much akin to B+ and splay trees. It's also compactly coded, which is nice. But the true revelation behind it is even more profound: in order to do a tree search, you do not need to store entire key values in the tree, but only the minimal binary decision points that are relevant. You can also construct a dynamically binning, self-balanced tree algorithm on top of that.

This means that the tree can depend pretty much on the total cardinality of the search space, and be asymptotically independent of key length. While remaining efficient even for external memory indexing, such as that we do within databases. Plus it implies the possibility of efficiently compressing arbitrary length keys into a fixed space, as long as the latter can enumerate the entire key space at any one time. (Here we do require the tree of decision points as side data, but with proper encoding that grows in an o(n log n) fashion with a very low multiplicative coefficient, by comparison with more well known trees.)

I find it truly exceptional that this reasoning hasn't been actively utilized in commercial databases, because it leads to amazingly space efficient, high fanout search trees. Also, it's almost insane that the compressive implication hasn't been exploited e.g. in routing hardware: such compression would preserve both strict lexical order and binary prefix-routability. This, despite the original paper having been published in a prestigious Proceedings, and it being from '88.

There are pretty much just two other papers in data structures which I'd rank this high, that I know of. One is Ailamaki et al's analysis on the practical multidimensionality of modern hard drives. And the other, one paper I saw on how to do multicolumn array search in equal complexity over all of the columns (I haven't been able to find it afterwards). Of those only Ailamaki's group actually came up with something clearly applicable.

How on earth could it be that these sorts of brilliant algorithms could be overlooked for over two decades, despite being publicly available, well documented, and clearly applicable even as such? To my eye it seems like more than one Second Coming suddenly went thoroughly unnoticed.

1 comment:

  1. Could this patent have anything to do with this algorithms absence in practice?

    http://www.wikipatents.com/US-Patent-7096235/computer-implemented-compact-0-complete-tree-dynamic-storage-structure/Page-6

    ReplyDelete