Okay, so you have your basic peek-ahead peak-limiter. Everybody knows how to do the compression part. But how do you track that peak, over a fixed window? Especially, how do you track it efficiently in the ideal case where you have to output a peak value over the window after each and every update to the window?
After a couple of comments on the music-dsp list, with one of them pointing to "something like mipmapping", I came up with this one.
Take a 2^n samples long recirculating delay buffer. As usual, round up to the next power of two for the maximum length of the delay/window, and then tone down from that via the lag between the read and write pointers. Now feed that buffer with what ever sort of maximum you'd like to track. Begin with all-zeroes (or identities), and consistently reset all of the values going out of your window to that as well. Just set all of the incoming values into the buffer, via the write pointer.
After that, you'll have a buffer which you can aggregate as a whole, at each point in time. No matter the lag between the read and write pointers: the data between them won't affect the final result because it's set to zero and/or unity.
Then suppose you want to do a maximum aggregate over your real window. Now it's the same as the aggregate over the whole of your buffer, regardless of whether there's a real value there or just the space between the read and write points. Updating something like that is a whole lot easier: as a first pass, build a full, balanced, binary tree over the whole buffer, and do every aggregation pairwise between adjacent values, then pairwise between the results, then...recurse right upto the top. That will always work because max(.,.) is associative, so it doesn't matter which order you do it in if you don't outright commute the values.
Then take that as an invariant: the whole pyramid on top of the buffer needs to reflect a symmetrical, binary, parenthesized order of calculating maxima from the base data. How do you update that sort of thing incrementally?
Well, you only have two changes per cycle: the one at the read pointer and the one at the write one. Each of those is independent, so you do the same thing to each. And the thing is to recurse upwards towards the root of the tree, compare the new value to the sibling of the shared parent, and then update the parent if the new maximum differs. If it doesn't, stop early. Do that to both pointers for each sample, and always return what is at the top of the pyramid afterwards.
What you then have is a worst-case guaranteed log(n) algorithm in n=window length.And if you then array all of the intermediate maxima in the pyramid after the primary buffer, you'll get a data structure with no explicit bookkeeping or varying datatypes. Just values and other values derived from them. A homogeneous, densely packed array in an order which you'll access linearly per cycle, and whose index arithmetic boils down to a couple of shifts and adds. Tottally neat both for code and cache efficiency.
If you then want to use SSE style vector operations, that is easy too: the data is already in a format where any 2^m wide vector op works at all levels except the top-most ones. They work in linearly ascending order as well, so you can apply them even when you don't really have to; that cuts down on needless conditionals and the resulting pipeline stalls. Unrolling, that's easy as well -- although your superscalar processor probably does that for you with this kind of code.
Extending the data structure is a matter of copying and 2^n interleaving. Not cheap as such, but very regular and as such linear in the size of the existing structure -- which is asymptotically as good as extensible arrays.
And finally, the biggest thing... In all of the construction, we never relied on anything beyond associativity, really.Not essentially at least. All of the stuff is being kept in order, so commutativity is only a problem when we wrap around the end of the buffer. Thus, utilizing associativity, we just need to special case the wrap and everything else works the same. We're also utilizing the existence of an identity value between the read and writer pointers -- but we can dispense with those via another if-clause as well which connects those two values to each other and jointly propagates the results upwards. It's still fully log(n).
In fact in the Starburst and Exodus sort of database research, they already implemented even the requisite operations to insert stuff in the middle and to remove it, willy-nilly. There the semantics they used were the ones which attach to concatenation. But aren't those pretty much the same we're using here? Insertion of subsequences, deletion of them, and the conceptual manipulation of null strings/epsilons. They did that on the higher level of aggregate indices, and our manipulations would then follow those fully (within the constraints of a cyclical buffer) because the higher level indices in both cases abstract back from similar basic invariants/axioms: associativity in the presence of identity (possibly forcibly introduced, but still).