2010-05-06

Optimizing multi-tier storage on the margin

It seems like all of the studies in multi-tier disk storage (secondary, tertiary, and ostensibly beyond/in-the-middle) do hard choices as to the categorization of data in one class or another.

Why so? Factually, when we go from purely technical optimization within a set of data to a multi-tiered architecture, we inevitably deal with an economical problem: we segregated the data for a reason, that reason derives from human concerns, and human concerns always have to do with economic cost. That much is also known: we do analyze our choices in terms of total cost of ownership, energy costs, the fixed costs of hardware, and so on. We even try to make those different kinds of costs commensurable when we discount future costs and so on.

But when do we ever apply the rest of the economic theory? Particularly, when do we ever translate any of that back into the physical level, so that the automated framework better reflects our intuitive reason, or what it should rationally speaking tell us? Almost never. A particular example of that is that we almost never directly apply automatic optimization over the current cost frontier, using marginals. Not the supply one that we're so hard trying to quantify anyway, and certainly not the demand one which a large corporation such as mine could ostensibly measure/try out.

My lauching point for this rant is about how to use disks right with large data. The rational way to go in today's environment is that most of the disks should contain rarely accessed data, and mostly be powered down. (If you're familiar with the idea, that's called "massive arrays of idle disks"/MAID, in contrast with the more well known "redundant arrays of inexpensive/independent disks"/RAID. There's a whole literature dealing with such concepts; just search for "PARAID" to get a hint of what can be done.) Yet people rarely try to optimize that sort of thing using expected extra benefit from topmost items. That is, deduction on the margin/efficiency frontier.

Even the nicest article I've seen, which was on "disk cooling" doesn't do that. Quite certainly it doesn't do what it does do in a generalized way. There is no real cost function to be seen even there. Plus the only real cost-aware things I know of -- the more developed relational database optimizers -- don't really let you input your values into the optimization process either. Much.

I'd really like to see multi-tier storage, which incorporates an optimizer which can be tuned declaratively, by declaring cost weights and/or entire cost functions. Over multiple variables, like expected average throughput, interconnect contention, maximum latency at a certain n-tile, expected total operating cost per time, and so on, over (roughly) comparable units. Systems which then utilize an intuitively sensible cost function over those parameters, and piecewise, slowly optimize over the efficiency frontier to yield adaptation towards a guaranteedly efficient eqiulibrium.

Markets do that, and they don't have nearly the computational capacity or rationality/consistency for an individual chooser that automated systems do. So why not leverage the market mechanics? Especially since, in a closed system, you can make the actors fully altruistic if really need be. There is absolutely no reason not to employ on-the-margin, incremental adaptation, except for algorithmic discontinuities in the cost landscape (which then should be avoided; cf. the parametric optimization literature within RDMBSs), equilibrium selection issues (which can mostly be automatically randomized away via purposely introduced mixed equilibria and optimization over the mixtures as a whole) and joint concavity of the optimands (most an issue with people, with limited time-to-personal-extinction).

As a simple and practical application, somebody already got this: "disk cooling" is all about moving individual pieces of data between "hot" and "cold" areas of the disk. At least when most efficiently implemented. Some hysteresis might be necessary to detract from the overall churn, true, but even then that's about economically quantifiable transaction costs. The model just works, and has already lead to very well founded models. Why not do it in the open, on the cloud, then, for real and with all of the additions over time?!?

No comments:

Post a Comment