I'm reasonably sure programming language compilers need to have a "secure section" directive, in addition to whatever they now have. One which basically guarantees serial–equivalent execution at the processor level. So as to deter any timing analysis or the like.
The reason I say it has to be a compiler directive is to decouple the semantics from the implementation. On a serial executing core, such a new kind of "critical section" would simply be a no-op. On a simple pipelined processor it could be implemented via a full pileline stall. On a superscalar it could be implemented by stopping register renaming for the duration of the (newest, semantically different) fence, and calling for temporally deterministic cache access, one read/write per time unit.
Or it could be implemented by whatever those transactional memory folks think of tomorrow. Maybe it could even be implemented by the compiler fully expecting what the processor might any-and-ever-do, and working against it; like going over the Alpha, MIPS and whatnot-like processors' whole internal state graph, and proving over their combined superscalar nature that the emitted code is always equal to the compiler directive, any absent emitted code whatsoever.
The point being that this sort of thing can be formalized, and it can be optimized as well. Rather simply as well, if we only forgo optimum betterment. It could take as simple a form as a superscalar pipelined µP retiring to null everything but the original first-most linearly ordered execution path it has right now seen. So it could very well speculatively execute anything else besides, leading to very little loss in effiency, for this kind of "security fence". Especially since this calls for very little besides one bit in the speculation machinery, at least minimally speaking, and can well be combined with branch prediction machinery. And most of this can also be done within the compiler, without invoking any machine level instructions; compilers already inject null instructions for this precise occasion; and whatever instructions are needed, can in fact be made into more general purpose fences/barriers within the lower level ISA as well; there's even current research and application towards there, just now.
As one implementation idea, in how to make this efficient, why not schedule a fully serial microtask within the hypertasking model, to the other barrel-scheduled processor? I mean it can do that already for many tasks. It can do quite a number of fences, of more than one kind. And in lesser processor models, it already does these blocking thingies already.
Why not just let it be a bit more stupid, when the main thread tells it to be so? Let it just go into fully serial execution, and amortize the latency which goes with that sort of thing? Maybe let there be state for a transient, third, "stupid, linear state in execution", because that could actually be rather interesting and efficient in other execution states and fences/barriers as well?
Because those stupider combined states over all of execution histories are 1) rather easy to implement at the hardware level, combined to general pipelined superscalar execution, 2) they are easier to optimize, and 3) they are easier to interface between compiler and hardware API/ISA.Because it'd do rather useful work if it did so, and it wouldn't in any case do as much useful work as a co-worker to the main thread, which we also know it wouldn't, and then, if we optimized the second thread to be asymmetric in execution (here and otherwise), we might in fact be able to disable or omit an execution unit or two from the architecture as a whole. Save on power, let's say. While keeping up average HyperThreading performance, bringing in more asymmetric performance, implementing provable security measures, and lowering total cost in both silicon real estate and both average and minimal power. Just see SIMD and for instance how Fortran compilers' loop stacking turns into efficient fused MAC on DSP's.
I'm rather certain such a crystallised, one-sided design might be a thorough win, when applied to just one side of a processor's hardward pipeline.
No comments:
Post a Comment