The Fundamental Compiler Optimization Choice:
Code Size vs. Execution Speed

Many compilers allow developers to specify whether they want their code to be optimized more toward performance or code size. Speed used to be the most crucial, however, that has started to change as embedded systems get smaller and have more limited resources. Many modern embedded systems do not boot from a hard disk, but instead, use limited amounts of non-volatile memory such as Flash and ROM. Also, many of these systems do not utilize virtual memory which limits the amount of RAM available.

Most compilers have different switches to signify optimization for size or speed, depending on the goals of the application. Typically it is possible to apply different switches to different portions of code. The ARM C compiler uses the following:

It should be noted that this output was produced by also applying the -O3 and no_inline options. The no_inline option prevents the compiler from inlining any functions and is only applied here to illustrate concise differences between the -Otime and -Ospace options. Otherwise, the checksum function would have been expanded inside of main.

Comparing this to applying the -Ospace switch, again also using the -O3 option:

checksum
loop
-Ospace

This switch, enabled by default, tells the compiler to apply optimizations for reducing image size, possibly at the expense of speed.

-Otime

This switch tells the compiler to apply optimizations for execution speed, possibly at the expense of code size.

MOV
MOV
ADDS
LDRNESH
SUBNE
ADDNE
BNE
ADD
MOV
MOV
BX
r3, #0x00000000
r2, #0x0000000A
r12, r2, r1
r12, [r0], #0x02
r2, r2, #0x00000001
r3, r3, r12
loop
r0, r3, #0x00000005
r0, r0, LSL #16
r0, r0, ASR #16
r14

As a trivial example, suppose we have a function which calculates a 16-bit checksum of a data packet containing at least ten 16-bit values and then adds five to it. A main function defines the packet and calls the checksum function:

short checksum(short *data, int n) {
  unsigned int i;
  int sum = 0;
  for (i = 10; (i+n) != 0; i––)
 sum += *(data++);
  return (sum+5); }
int main() {
  short a[15] =
 {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
  checksum(a,5);
  return 0; }

 

Compiling with the - Otime option produces the following ARM assembly for the checksum function:

checksum
loop
done
ADDS
MOV
MOV
BEQ
LDRSH
SUB
ADD
ADDS
BNE
ADD
MOV
MOV
BX
r12, r1, #0x0000000A
r3, #0x00000000
r2, #0x0000000A
done
r12, [r0], #0x02
r2, r2, #0x00000001
r3, r3, r12
r12, r2, r1
loop
r0, r3, #0x00000005
r0, r0, LSL #16
r0, r0, ASR #16
r14

In the -Otime output the compiler duplicates the loop condition test once outside of the loop before the actual loop begins. This results in one less loop iteration compared to the -Ospace output, which speeds up execution of the algorithm but requires one more instruction. See Table 2 for the code size and execution speed comparisons based on simulations with the tools. Since this is an instruction level simulator of a microcontroller running at a fixed frequency, the execution times do not vary over multiple runs depending on host PC factors such as memory usage and background processes or threads.

The LDRNESH instruction indicates signed halfword (16-bit) loads from memory if the Z flag is clear in the condition code register (meaning a non-zero result was produced by the ADDS instruction) using post-increment addressing. The BX instruction is a specific type of branch instruction that the compiler will use to return from functions using a return address that was previously stored in the link register (r14). The last thing to note is the shifts LSL and ASR (Logical Shift Left and Arithmetic Shift Right) at the end of both functions. Since we are returning a piece of short data, the compiler must ensure that the return value only occupies the bottom 16 bits of the register since all ARM registers are 32 bits. For this reason, 32- bit integer data types should be used for local variables when possible. This will eliminate any shifting and masking instructions that ensure data only occupies the appropriate bits of a register.

In the case of the ARM C compiler, -Otime or -Ospace will always be applied, whether it be the default or specified by the developer.

 

Compiled with –Otime, O3 Compiled with –Ospace, O3

Execution Time
(microseconds)
2.383
2.467

Program Size*

(bytes)

150

134

*Program size reflects object file that includes main and all associated data, but not startup code.

Table 2: Checksum simulation comparisons.

References:

http://www.acm.org/crossroads

Archives