CS395t Paper Review

CS 395T Advanced Compilers (52060)

Paper Review

Name:	Ben Hardekopf
Paper Number:	4
Paper Citation:	Continuous Program Optimization: Design and Evaluation, T. Kistler and M. Franz. IEEE Transactions on Computers, 50(6), June 2001, pp 549-566.

Abstract

This paper presents a system in which the already executing user code is continually and automatically reoptimized in the background, using dynamically collected execution profiles as a guide. Whenever a new code image has been constructed in the background in this manner, it is hot-swapped in place of the previously executing one. Control is then transferred to the new code and construction of yet another code image is initiated in the background. Two new runtime optimization techniques have been implemented in the context of this system: object layout adaptation and dynamic trace scheduling. The former technique constantly improves the storage layout of dynamically allocated data structures to improve data cache locality. The latter increases the instruction-level parallelism by continually adapting the instruction schedule to predominantly executed program paths. The empirical results presented in this paper make a case in favor of continuous optimization, but also indicate some of the pitfalls and current shortcomings of continuous optimization. If not applied judiciously, the costs of dynamic optimizations outweigh their benefit in many situations so that no break-even point is ever reached. In favorable circumstances, however, speed-ups of over 96 percent have been observed. It appears as if the main beneficiaries of continuous optimization are shared libraries in specific application domains which, at different times, can be optimized in the context of the currently dominant client application.

Novelty

The main novelty of the system presented in this paper is the concept of dyamic adaptability -- rather than a particular procedure being optimized (perhaps incrementally over time) until it is in its most optimal state, the procedure is continually optimized with regards to the current usage characteristics as recorded by a dynamic profiler. The difference is that in addition to using the profiler to determine *which* procedures should be optimized, it is also used to determine *how* to optimize the procedure.

The paper also discusses two new optimizations that take advantage of this dynamic adaptability; these optimizations were previously published by the authors, so technically they aren't novel to this paper.

Flaws

The authors could have made a much stronger case for the need for dynamic adaptability. They try to show its benefits using the 2 new opts, but in my opinion all they showed was that the new opts had very limited usefulness; this doesn't say anything one way or another about the usefulness of dynamic adaptability itself. The authors should have done a study showing quantitatively that there exist a large class of applications whose behavior at a procedural level changes over time. This would have shown that the concept of dynamic adaptability has merit -- instead the authors have some hand-wavy examples that aren't very convincing.

An important question (that would be answered by the above suggested study) is over what time-scale a particular application's behavior changes. If the time taken to optimize a procedure (which is done in the background and hence is fairly slow) is not much smaller than this time-scale, then the system will always be optimizing the code for behavior that is no longer relevant -- it will always be playing catch-up.

The authors suggest that shared libraries could be a main beneficiary of continuous optimization, making the assumption that users usually use the libraries for one thing at a time; this ignores multi-user systems in which, while one particular user may not use the library for multiple things at once, each user is using it for a different purpose.

The authors basically punt on the problem of identifying idle cycles that can be used for the optimizations without impacting the performance of the application. They recognize this as a problem, but don't offer a real solution; the system they implemented simply assumes that putting the optimizer on a low-priority thread approximates this closely enough, but I am not convinced -- there is no guarantee at all that the low-priority thread isn't still stealing usable cycles from the app.

When presenting the benchmark numbers, the authors didn't say how the numbers for the optimized apps were obtained -- since the programs are being continuously optimized, how did the authors decide to derive the performance numbers? If they just waited for some time and then took them, how long did they wait? What did the curve look like form going to the original performance to the optimized performance (i.e. was it steep, shallow, etc).

When calculating the break-even points for the benchmarks, the authors claim that only the first optimization done to the app needs to pay the price of identifying hot-spots and inserting profiling code in the correct places; however if the program behavior changes over time (which is after all the whole premise of the paper) then perhaps it is necessary for future optimizations to redo this work, making this assumption invalid.

The break-even points for the benchmarks were done using a back-of-the-envelope calculation -- more convincing numbers would be had from simulations or actual experimentation.

The benchmark performances were calculated without taking into account the overheads of optimizing the code and hot-swapping back into the app; probably because adding those in kills any performance gain from using the system. Besides the obvious amount of overhead from doing the optimizations (which unless idle cycles can be correctly identified will impact the app's performance), there is a significant amount of work involved in swapping the optimized code back into the app -- e.g. keeping track of threads that are in the affected region of code, updating dependencies, and possibly undoing previous opts in other regions of code that impinges on this one.

Broader Picture

An obvious point of future work is to do the studies suggested in the first two points of the Flaws section. The authors also recognize some future work left to do (e.g. calculating whether the benefit of a particular opt is worth the cost, figuring out the most useful metrics to collect). Another interesting questions is what other traditional optimizations (other than instruction scheduling) can be adapted to benefit from continuous optimization? Some quick ideas --- some opts, like Partial Redundancy Elimination, have the potential to make performance worse rather than better; dynamic profiling can detect when this happens and remove the offending optimization. Perhaps the metrics reported by the profiling could help in calculating more accurate spill costs for register allocation; if the apps behavior changes the spill costs may also change.

A big problem revealed by the work is the cost of dynamically profiling an application. This involves a large amount of overhead. It is a problem that we also may have seen in the Crusoe paper -- they didn't give performance numbers at all, much less break them down into causes, but my impression is that performance was a problem, and some percentage of that came from the profiling they did. Of course, Crusoe did its profiling in HW, which presumably is much less expensive than SW profiling -- it would be very interesting to see some sort of breakdown of Crusoe performance and compare the profiling cost there with the profiling cost in the system described in this paper.