2009年12月10日 星期四

Loop unrolling最佳化

Loop unrolling(迴圈展開):

Each loop iteration costs two instructions in addition to the body of the loop: a subtract to decrement the loop count and a conditional branch.

We call these instructions the loop overhead. On ARM7 or ARM9 processors the
subtract takes one cycle and the branch three cycles, giving an overhead of four cycles
per loop.

You can save some of these cycles by unrolling a loop—repeating the loop body several
times, and reducing the number of loop iterations by the same proportion. For example,
let’s unroll our packet checksum example four times

Only unroll loops that are important for the overall performance of the application.
Otherwise unrolling will increase the code size with little performance benefit.
Unrolling may even reduce performance by evicting more important code from the cache.


Suppose the loop is important, for example, 30% of the entire application. Suppose you
unroll the loop until it is 0.5 KB in code size (128 instructions). Then the loop overhead
is at most 4 cycles compared to a loop body of around 128 cycles. The loop overhead cost
is 3/128, roughly 3%. Recalling that the loop is 30% of the entire application, overall the
loop overhead is only 1%. Unrolling the code further gains little extra performance, but has
a significant impact on the cache contents. It is usually not worth unrolling further when
the gain is less than 1%.

loop body內指令的execution cycle 假設是N, loop overhead 至多是4 cycles (subtract takes one cycle and the branch three cycles), 而程式整體需要的execution cycle 假設是M ,所以當loop overhead佔整體程式的execution cycle比例 < 1%((4/N)*(N/M)=4/M), 使用Loop unrolling最佳化是不值得的。

沒有留言: