2009年12月10日 星期四

The sample of Writing Loops Efficiently under ARM

/*

How to compile and disassmble:

/usr/local/arm/4.2.2-eabi/usr/bin/arm-unknown-linux-gnueabi-gcc -O -fomit-frame-pointer -c test-loop.c

/usr/local/arm/4.2.2-eabi/usr/bin/arm-unknown-linux-gnueabi-objdump -S ./test-loop.o

Full optimizations are turned off by default for the GNU compiler. The -fomit-framepointer
switch prevents the GNU compiler from maintaining a frame pointer register.
Frame pointers assist the debug view by pointing to the local variables stored on the stack
frame. However, they are inefficient to maintain and shouldn’t be used in code critical to
performance.


checksum_v1相較於checksum_v2執行指令數比較少(good code code density)且需要較少execution cycle,因為checksum_v2中的for-loop一開始執行時需要做"N!=0"判斷,而checksum_v1的wile-lopp 不需要,


Loop unrolling(迴圈展開):

Each loop iteration costs two instructions in addition to the body of the loop: a subtract to decrement the loop count and a conditional branch.

We call these instructions the loop overhead. On ARM7 or ARM9 processors the
subtract takes one cycle and the branch three cycles, giving an overhead of four cycles
per loop.

You can save some of these cycles by unrolling a loop—repeating the loop body several
times, and reducing the number of loop iterations by the same proportion. For example,
let’s unroll our packet checksum example four times

Only unroll loops that are important for the overall performance of the application.
Otherwise unrolling will increase the code size with little performance benefit.
Unrolling may even reduce performance by evicting more important code from the cache.


Suppose the loop is important, for example, 30% of the entire application. Suppose you
unroll the loop until it is 0.5 KB in code size (128 instructions). Then the loop overhead
is at most 4 cycles compared to a loop body of around 128 cycles. The loop overhead cost
is 3/128, roughly 3%. Recalling that the loop is 30% of the entire application, overall the
loop overhead is only 1%. Unrolling the code further gains little extra performance, but has
a significant impact on the cache contents. It is usually not worth unrolling further when
the gain is less than 1%.

checksum_v3使用loop unrolling最佳化。

loop body內指令的execution cycle 假設是N, loop overhead 至多是4 cycles (subtract takes one cycle and the branch three cycles), 而程式整體需要的execution cycle 假設是M ,所以當loop overhead佔整體程式的execution cycle比例 < 1%((4/N)*(N/M)=4/M), 使用Loop unrolling最佳化是不值得的。


Summary Writing Loops Efficiently:

■ Use loops that count down to zero. Then the compiler does not need to allocate
a register to hold the termination value, and the comparison with zero is free.
■ Use unsigned loop counters by default and the continuation condition i!=0 rather than
i>0. This will ensure that the loop overhead is only two instructions.
■ Use do-while loops rather than for loops when you know the loop will iterate at least
once. This saves the compiler checking to see if the loop count is zero.
■ Unroll important loops to reduce the loop overhead. Do not overunroll. If the loop
overhead is small as a proportion of the total, then unrolling will increase code size and
hurt the performance of the cache.
■ Try to arrange that the number of elements in arrays are multiples of four or eight. You
can then unroll loops easily by two, four, or eight times without worrying about the
leftover array elements.


*/


/*

00000000 :
0: e1a02000 mov r2, r0
4: e3a00000 mov r0, #0 ; 0x0
8: e4d23001 ldrb r3, [r2], #1
c: e0800003 add r0, r0, r3
10: e2511001 subs r1, r1, #1 ; 0x1
14: 1afffffb bne 8
18: e12fff1e bx lr


*/


unsigned int checksum_v1(unsigned char *data,int N){

unsigned int sum=0;

do{

sum+=*(data++);

}while(--N!=0);

return sum;
}



/*

0000001c :
1c: e3510000 cmp r1, #0 ; 0x0
20: 01a02001 moveq r2, r1
24: 0a000004 beq 3c
28: e3a02000 mov r2, #0 ; 0x0
2c: e4d03001 ldrb r3, [r0], #1
30: e0822003 add r2, r2, r3
34: e2511001 subs r1, r1, #1 ; 0x1
38: 1afffffb bne 2c
3c: e1a00002 mov r0, r2
40: e12fff1e bx lr



*/


unsigned int checksum_v2(unsigned char *data,int N){

unsigned int sum=0;

for(;N!=0;N--)
sum+=*(data++);
return sum;
}


/*
00000044 :
44: e3a0c000 mov ip, #0 ; 0x0
48: e5d03001 ldrb r3, [r0, #1]
4c: e5d02000 ldrb r2, [r0]
50: e0833002 add r3, r3, r2
54: e5d02002 ldrb r2, [r0, #2]
58: e0833002 add r3, r3, r2
5c: e5d02003 ldrb r2, [r0, #3]
60: e0833002 add r3, r3, r2
64: e08cc003 add ip, ip, r3
68: e2800004 add r0, r0, #4 ; 0x4
6c: e2511004 subs r1, r1, #4 ; 0x4
70: 1afffff4 bne 48
74: e1a0000c mov r0, ip
78: e12fff1e bx lr


*/

unsigned int checksum_v3(unsigned char *data,int N){

unsigned int sum=0;
do{

sum+=*(data++);
sum+=*(data++);
sum+=*(data++);
sum+=*(data++);
N-=4;

}while(N!=0);


return sum;
}

沒有留言: