2019-03-23

A relational data format (ARDF), part 1: Introduction

For as long as I've known computers, I've been interested in data formats, representations and encoding. Once I got into them, especially via collecting different file format specifications, the question has been how to unify the lot, into one universal data (meta) format.

I'm not the first one to have such lofty goals. But after a couple of decades of delving into the field, especially given how I've gone from haphazard binary dumps to markup languages to relational databasing, the semantic web, object denotations and back, and read my share of information, coding and compression theory, the pieces finally seem to be falling into place. It's time to jot down what I've learned in a more or less systematic fashion.

The relational model, and what it lacks

I fiercely believe the relational model is the starting place for all data modelling. Codd's vision for it was and remains the only truly systematic attempt at unifying data management. It was born of punched cards to be sure, and purposely aimed at solving the most common problem of business like, tabular data. However, it has been studied well beyond its original limits, it can be tweaked to formalize far more complex representational problems, and when imbued with extra semantics as in the semantic database literature of the nineties, it is capable of subsuming most of the work that followed.

The strong points of relational literature as I see them are:

  • a regular, easily implementable, and yet rather general data model
  • a neat, standard mathematical formalism for all of data representation, querying, and schema design
  • a well developed body of theory of how to execute and optimize all of the relevant operations in practice
  • great practical support all across the board from IBM to the smallest and newest of FLOSS outfits
  • at least theoretical separation between the query interface and how the data are actually stored
  • a widely implemented and used interoperability language in SQL, some of which has been standardised and adopted as well
  • tremendous economic and historical backing

On the other hand, the current relational practice goes amiss at least in, or lacks:

  • a query language/formalism—the relational algebra—purposely lacking in power; recursion and a high level of logical schema dependence are two notable downfalls
  • typical implementations which are rather simplistic even with regard to the original relational vision of physical data model independence
  • inability to address the needs of logic databases
  • the inadequate integration of temporal aspects in the data model, especially with regard to schema evolution
  • SQL's status as a de facto interoperability standard, despite its haphazard construction and current lack of machine/human-interface distinction
  • lack of any universal binary interoperability standard between relational database management systems, especially in the current networked environment
  • the widely held belief that relational thought is only about practical tables, pointers/surrogates, and whatnot, instead of the overarching vision outlined by Codd and fleshed out in the research literature of some decades now
  • far too few specialised implementations which could show the full strength and breadth of the model

Below, I will aim as best I can at rectifying each problem in turn. Since what I have in mind is gathered from a long line of thought and this is the first attempt at making sense of it all, the exposition will necessarily be a bit fragmented at first. I'll also do what I always do, which is to move around stuff and rewrite what I've said; none of the posts which follow will be static, as a result. I'll hope you'll bear with me, and share the ideal of leaving behind As Complete a Solution as Can be Arrived At.