for (i=0; i<10; i++) sum += 1; return sum; }
Simulating an ARM7-based NXP microcontroller [9] running at 12 MHz (the target used in this article unless otherwise specified) with a proprietary ARM C compiler and compiling at the highest optimization level (–O3, which will be discussed in more detail later), the compiler outputs the following ARM assembly code:
loop
MOV MOV ADD CMP ADD BCC
r0, #0x00000000 r1, r0 r1, r1, #0x00000001 r1, #0x0000000A r0, r0, #0x00000001 loop
The compiler uses r 0 as the sum and r1 as the counter i, incrementing both the counter and sum, comparing the counter to ten each time around the loop. The reasons these particular registers are used can be found in ARM’s procedure call standard (called the AAPCS [ 1]). The CMP instruction sets the condition code flags based on r1 minus ten and the BCC instruction indicates a branch back to the loop label if the carry flag is clear in the condition code register.
Profiling the code in the simulator shows that countUp takes 1.050 microseconds to execute. However, the compiler can do better than this by using conditional execution and flags. As a general rule, loops running on ARM processors should always be written so that the counter decrements down to zero:
int countDown() {
unsigned int i;
int sum = 0;
for (i=10; i!=0; i––)
sum += 1;
return sum; }
With the same target and compiler settings, the compiler produced the following output:
loop
MOV MOV SUBS ADD BNE
r0, #0x00000000 r1, #0x0000000A r1, r1, #0x00000001 r0, r0, #0x00000001 loop
The compiler initializes the loop counter to ten and decrements the counter down to zero, setting the condition code flags with each subtract (indicated by using SUBS versus SUB). The code falls out of the
loop once the Z flag in the condition code register is set (meaning a zero result was produced by the SUBS instruction). This is one example of how the ARM processor can significantly reduce code size and execution speed via intelligent conditional execution and flag usage. Profiling the code in the simulator shows that countDown takes .883 microseconds to execute. The code is smaller and faster because there is no need to make a comparison (using the CMP instruction) each iteration of the loop. Although targeted at the ARM architecture, mostly because of its ability to conditionally execute instructions, this is an instance of a source-level, hand optimization that could work for other architectures.
Another source-level optimization that is even more specific to the ARM architecture deals with the way in which parameters are passed. Part of the procedure call standard used by the compiler states that no more than four parameters can be passed to a function by way of registers, which is a faster method than passing through stack memory. Passing more than four parameters to a function will result in the compiler spilling the fifth, sixth, seventh, etc., parameters onto stack memory. Therefore, programmers should try to pass no more than four parameters to functions if possible. If not, the most frequently used parameters should be passed before the others for the same reasons.
There are other architectures that have similar standards, i.e., MIPS, but this type of optimization is more architecture-specific than, for example, a compiler’s ability to inline function calls [ 8]. This type of optimization lets a compiler expand a function call into the actual function body, typically improving the execution time of the application by eliminating function call overhead. However, this usually comes at the expense of code size. Inline expansion can result in further optimizations, such as the function no longer requiring procedure call restrictions. In addition to being a very common compiler optimization (even across other types of systems such as supercomputers), most developers do not write high-level code with compiler inlining in mind. It is usually left completely up to the compiler to do when activated.
Most compilers can use a set of general optimizations that usually depend on the stage of the application’s development as well as the programmer’s debugging needs. In general, assembler output that has been highly optimized is more difficult to understand and, consequently, to debug. The levels of general optimizations performed by the ARM C compiler are summarized in Table 1.
Most of the optimizations detailed in this article would be performed based on the level of generalized optimization selected. There are, however, some specific compiler switches that can override options for a particular type of optimization. In the case of the ARM C compiler, there must always be a level of optimization used, whether it be the default or specified by the developer.
–O0
–O1 –O2 –O3
The lowest level of optimization. Simple optimizations are performed so not to impair the debug view. This switch gives the best possible debug view. A restricted level of optimization giving a satisfactory debug view and good code density. A high level of optimization which might give a less satisfactory debug view. This is the default option. The highest, most aggressive level of optimization. The optimizations performed depend on the whether the –Ospace or –Otime options are enabled. This typically leaves a poor debug view.
Table 1: General levels of optimization options for ARM C compiler.
References:
Archives