Architectural- and Processor-Specific Optimizations As processor architectures advance, new instructions and enhancements are introduced. In many instances the compiler must be made aware of the target processor and architecture in order to take advantage of the new features. Some examples include having the compiler use new DSP instructions to speed up some mathematical library routines, as well as allowing a processor to perform unaligned memory accesses that would perhaps cause faults on a different processor.
Another reason you might specify this information to the compiler is to allow it to appropriately schedule instructions to make use of a particular processor’s pipeline [ 10]. Instruction scheduling can be used, for example, to avoid interlocks or maybe to take advantage of a dual issue pipeline. A dual issue pipeline has the ability to pass two instructions from a decode stage to an execute stage in the same cycle [ 7, 12].
Consider the following C code:
void schedule(int *p, int *q, int z) {
int x = *p;
int y = *q;
x = x x;
z = z + 1; }Unless specified otherwise, the ARM C compiler will always schedule instructions for an ARM9 pipeline (which has no effect on code that will run on earlier cores), resulting in the following assembly output:
schedule
LDR LDR MUL ADD
r0, [r0] r1, [r1] r0, r0, r0 r2, r2, #1
However, this code would not take advantage of a processor’s dual-issue capability, which the ARM Cortex-R4 core (which implements ARM architecture version 7-R) and Cortex-A8 core (which implements ARM architecture version 7-A) both have. By applying the --cpu=7-R or --cpu=7-A compiler switch to specify an architecture that can perform dual-issuing [ 12], the output would look like:
schedule
LDR ADD LDR MUL
r0, [r0] r2, r2, #1 r1, [r1] r0, r0, r0
Since memory access and data processing instructions have separate back-end pipelines on these cores (see Figure 1), the initial LDR and ADD instructions can be executed simultaneously, allowing for increased performance.
Figure 1: The Cortex R4’s Dual-issue Pipeline.
Multiply operations also have their own backend pipeline. The Pre-Decode stage allows for branch prediction and some necessary instruction formatting.
Most ARM cores have the option to execute a compressed version of the standard instruction set. The 16-bit Thumb [ 5] instruction set is a compressed subset of the standard 32-bit ARM instruction set and can provide not only better code density, but also better performance on 16-bit memory systems. These additional instruction sets are relevant because they are usually generated by the compiler using a switch as opposed to hand-coding them (which can prove difficult to do). Also, it may take multiple 16-bit Thumb instructions to accomplish what one 32-bit ARM instruction can do—yet another reason to leave the decision about which instruction set to use up to the compiler.
Let us revisit the checksum example. If we recompile this example with the -O3, -Ospace, and now the --thumb option, we get the following Thumb assembly output:
checksum
loop
start
PUSH MOV B MOV LDRSH ADD SUB ADD ADD BNE ADD POP LSL ASR BX
{r4}
r2, #0x0A
start
r4, #0x00
r4, [r0, r4]
r0, r0, #2
r2, r2, #1
r3, r4, r3
r4, r2, r1
loop
r3, r3, #5
{r4}
r0, r3, #16
r0, r0, #16
LR
With half the instruction width as standard ARM instructions, Thumb instructions are more limited in a number of ways, some of which are evident here. Compared to the ARM code, one of the most noticeable differences is that Thumb does not allow conditional execution of instructions other than branches, and every instruction sets the condition code flags (there is no S-bit in Thumb instructions). Another difference is register usage. Most Thumb instructions can only access registers r0 to r7, and fewer instructions can access registers r8 to r12, as opposed to ARM instructions which can access all of the general purpose registers. However, the code size will still typically be smaller due to the smaller instruction size. Also notice the PUSH and POP stack instructions, which are not part the standard ARM instruction set. Table 3 compares the code size and execution speed when using Thumb assembly versus ARM assembly for the checksum example.
Although we get a very noticeable difference in program size, the execution time takes a relatively large hit because there were more Thumb instructions when using --thumb to execute than ARM instructions when omitting --thumb. Moreover, the types of instructions used might not have been as efficient in Thumb.
An ARM system oftentimes employ a few different memory types and sizes. In these systems it is common for the software to be a mix of 32-bit ARM and 16-bit Thumb segments (again all specified by the compilation tools).
References:
Archives