Continuous Program Optimization: Design and Evaluation, T. Kistler and M. Franz. IEEE Transactions on Computers, 50(6), June 2001, pp 549-566.
Abstract
This paper presents a system in which the already executing user code
is continually and automatically reoptimized in the background, using
dynamically collected execution profiles as a guide. Whenever a new
code image has been constructed in the background in this manner, it
is hot-swapped in place of the previously executing one. Control is
then transferred to the new code and construction of yet another code
image is initiated in the background. Two new runtime optimization
techniques have been implemented in the context of this system: object
layout adaptation and dynamic trace scheduling. The former technique
constantly improves the storage layout of dynamically allocated data
structures to improve data cache locality. The latter increases the
instruction-level parallelism by continually adapting the instruction
schedule to predominantly executed program paths. The empirical
results presented in this paper make a case in favor of continuous
optimization, but also indicate some of the pitfalls and current
shortcomings of continuous optimization. If not applied judiciously,
the costs of dynamic optimizations outweigh their benefit in many
situations so that no break-even point is ever reached. In favorable
circumstances, however, speed-ups of over 96 percent have been
observed. It appears as if the main beneficiaries of continuous
optimization are shared libraries in specific application domains
which, at different times, can be optimized in the context of the
currently dominant client application.
Novelty
The main novelty of the system presented in this paper is the concept
of dyamic adaptability -- rather than a particular procedure being
optimized (perhaps incrementally over time) until it is in its most
optimal state, the procedure is continually optimized with regards to
the current usage characteristics as recorded by a dynamic
profiler. The difference is that in addition to using the profiler to
determine *which* procedures should be optimized, it is also used to
determine *how* to optimize the procedure.
The paper also discusses two new optimizations that take advantage of
this dynamic adaptability; these optimizations were previously
published by the authors, so technically they aren't novel to this
paper.
Flaws
The authors could have made a much stronger case for the need for
dynamic adaptability. They try to show its benefits using the 2 new
opts, but in my opinion all they showed was that the new opts had very
limited usefulness; this doesn't say anything one way or another about
the usefulness of dynamic adaptability itself. The authors should have
done a study showing quantitatively that there exist a large class of
applications whose behavior at a procedural level changes over
time. This would have shown that the concept of dynamic adaptability
has merit -- instead the authors have some hand-wavy examples that
aren't very convincing.
An important question (that would be answered by the above suggested
study) is over what time-scale a particular application's behavior
changes. If the time taken to optimize a procedure (which is done in
the background and hence is fairly slow) is not much smaller than this
time-scale, then the system will always be optimizing the code for
behavior that is no longer relevant -- it will always be playing
catch-up.
The authors suggest that shared libraries could be a main beneficiary
of continuous optimization, making the assumption that users usually
use the libraries for one thing at a time; this ignores multi-user
systems in which, while one particular user may not use the library
for multiple things at once, each user is using it for a different
purpose.
The authors basically punt on the problem of identifying idle cycles
that can be used for the optimizations without impacting the
performance of the application. They recognize this as a problem, but
don't offer a real solution; the system they implemented simply
assumes that putting the optimizer on a low-priority thread
approximates this closely enough, but I am not convinced -- there is
no guarantee at all that the low-priority thread isn't still stealing
usable cycles from the app.
When presenting the benchmark numbers, the authors didn't say how the
numbers for the optimized apps were obtained -- since the programs are
being continuously optimized, how did the authors decide to derive the
performance numbers? If they just waited for some time and then took
them, how long did they wait? What did the curve look like form going
to the original performance to the optimized performance (i.e. was it
steep, shallow, etc).
When calculating the break-even points for the benchmarks, the authors
claim that only the first optimization done to the app needs to pay
the price of identifying hot-spots and inserting profiling code in the
correct places; however if the program behavior changes over time
(which is after all the whole premise of the paper) then perhaps it is
necessary for future optimizations to redo this work, making this
assumption invalid.
The break-even points for the benchmarks were done using a
back-of-the-envelope calculation -- more convincing numbers would be
had from simulations or actual experimentation.
The benchmark performances were calculated without taking into account
the overheads of optimizing the code and hot-swapping back into the
app; probably because adding those in kills any performance gain from
using the system. Besides the obvious amount of overhead from doing
the optimizations (which unless idle cycles can be correctly
identified will impact the app's performance), there is a
significant amount of work involved in swapping the optimized code
back into the app -- e.g. keeping track of threads that are in the
affected region of code, updating dependencies, and possibly undoing
previous opts in other regions of code that impinges on this one.
Broader Picture
An obvious point of future work is to do the studies suggested in the
first two points of the Flaws section. The authors also recognize some
future work left to do (e.g. calculating whether the benefit of a
particular opt is worth the cost, figuring out the most useful metrics
to collect). Another interesting questions is what other traditional
optimizations (other than instruction scheduling) can be adapted to
benefit from continuous optimization? Some quick ideas --- some opts,
like Partial Redundancy Elimination, have the potential to make
performance worse rather than better; dynamic profiling can detect
when this happens and remove the offending optimization. Perhaps the
metrics reported by the profiling could help in calculating more
accurate spill costs for register allocation; if the apps behavior
changes the spill costs may also change.
A big problem revealed by the work is the cost of dynamically
profiling an application. This involves a large amount of overhead. It
is a problem that we also may have seen in the Crusoe paper -- they
didn't give performance numbers at all, much less break them down into
causes, but my impression is that performance was a problem, and some
percentage of that came from the profiling they did. Of course, Crusoe
did its profiling in HW, which presumably is much less expensive than
SW profiling -- it would be very interesting to see some sort of
breakdown of Crusoe performance and compare the profiling cost there
with the profiling cost in the system described in this paper.