Thursday, December 11, 2008

An Introduction to ARM Assembly Language (Jason Fuller)



An Introduction to ARM Assembly Language

Jason Fuller


Who is this document for?


This document is intended for anyone who occasionally needs to debug compiled ARM code at the assembly language level.


Why would I want to do that?


Because retail builds have compiler optimizations turned on, and compiler optimizations confuse the source-level debugger. For example, the values it displays for your local variables are often wrong because the real values are kept in registers, not on the stack where the debugger knows how to find them.


Registers


The ARM CPU has 15 registers:


r0 through r3 are used as general purpose registers, but they are also used to pass the first four parameters into a function. Their values are not guaranteed to be preserved across function calls.


r0 is also used to return the return value from a function.


r4 through r11 and r13 are general purpose registers. It is the responsibility of a called function to preserve these values, i.e., to ensure that the values of these registers are the same when the function exits as when it was entered.


sp is the Stack Pointer.

lr is the Link Register, which stores the return address when a function is called. But note that lr can also be used as a general-purpose register when it is not being used as the link register, so be careful.

pc is the Program Counter, i.e., the instruction pointer.



Condition Flags


There are four condition flags that are set as the result of executing instructions:


N Negative Set if result is negative

Z Zero Set if result is zero

C Carry Set if a carry occurs, or a bit is shifted off the end by a shift instruction

V Overflow Set if overflow occurs, i.e., a signed result is bigger than 31 bits



Instructions


A full list of ARM instructions can be found at http://www.arm.com/pdfs/QRC0001H_rvct_v2.1_arm.pdf, but it's not particularly readable or educational, so I'll go over the most common instructions here.


Throughout this document, I'll use C code in the right column to explain what the assembly instruction in the left column does. I use C syntax simply because everyone is familiar with it.


Instruction C language equivalent

------------- ---------------------------

mov rx, ry rx = ry // Move register ry into rx


mov rx, ry lsl #5 rx = ry << 5 // Logical Shift Left


mov rx, ry lsr #6 rx = ry >> 6 // Logical Shift Right


mov rx, #0x12 rx = 0x12 // Move x12 into rx


mov rx, #0x21, 28 rx = 0x21 rotated right 28 bits

(see below)


str rx, [ry, #0x12] DWORD *ry; ry[0x12/4] = rx

// Store rx into memory at ry + x12



str rx, [ry, rz] DWORD *ry; ry[rz/4] = rx

// Store rx into memory at ry + rz


ldr rx, [ry], -rz ry -= rz; rx = *ry



cmp rx, ry Compare rx to ry, and set the Condition flags

accordingly, for example, Z = (rx == ry)


cmp rx, #0x12 Compare rx to 0x12.


add rx, ry, #0x12 rx = ry + 0x12



sub rx, ry, #0x12 rx = ry – 0x12


mul rx, ry, #0x12 rx = ry * 0x12


orr rx, ry, #2 rx = ry | 2


and rx, ry, #2 rx = ry & 2


bic rx, ry, #5 rx = ry & ~5 // Bit Clear


bx rx rx(); // Jump to the address in rx



A little explanation is in order about instructions that use "shifter operands" such as mov rx, 0x21, 28. It may seem like an odd instruction but is actually quite common, because it allows the compiler to stuff 32 bit constant values into just 12 bits of an instruction : 8 bits for the constant (0x21) and 4 bits for the shift amount (28) which can be any even number between 0 and 31. Without this trick of stuffing the constant into the 32-bit instruction, the compiler would have to load a constant from memory, which is much more expensive.


By the way, the example we've been using:

mov rx, 0x21, 28 rx = 0x21 rotated right 28 bits

is essentially the same as :

rx = 0x21 << (32 – 28)

or

rx = 0x21 << 4



PC – relative addressing


Even though the use of shifter operands can sometimes allow the compiler to fit a constant into an instruction, sometimes the constant simply won't fit and must be loaded from memory. Where does the compiler put these constants? Right in the instruction stream, bteween where one function ends and the next one starts. This allows the compiler to load a constant using "pc-relative addressing", that is, using the program counter as if it were a pointer to data. For example:


ldr r1, [pc, #0x1C] DWORD *pc; r1 = pc[0x1c / 4];


The one catch is that the value of pc used in the pointer arithmetic is not the address of the instruction itself. It is the address of the instruction plus 8. (This is just an artifact of the way the chip works. By the time the instruction actually executes, the pc has already been incremented.) So, for example:


Address Instruction Disassembly

------- ----------- -----------

01F05640 e59f101c ldr r1, [pc, #0x1C]

01F05664 12345678 ??? // The data is at address 1F05640 + 8 + 1C


By the way, this is why when you're looking at a disassembly, some instructions will look weird, or show as ???. It's because they're not really instructions, they're data.



Suffixes


B and H

By default, instructions operate on 32-bit words. (Note that this definition of a word is different from the Win32 concept of a WORD, which is 16 bits.) However, an instruction that has the H suffix operates on halfwords (16 bits), and an instruction with the B suffix operates on bytes. For example:


strb r1, [r8, #0x28] BYTE r1 = ((BYTE*)r8)[0x28]



S

Some instructions take an optional S suffix, which means "update the condition flags based on the result of this instruction".



Conditional suffixes


All ARM instructions are conditional, i.e., they can all be modified by a suffix indicating under what conditions the instructions should be executed. Here are the most common condition suffixes:


Suffix Condition under which instruction is to execute

------- --------------------------------------------------------

eq equal Z == 1

ne not equal Z == 0

hi unsigned higher C == 1 && Z == 0

ls unsigned lower or same C==0 || Z==1

ge signed greater or equal N==V

lt signed less than N != V

gt signed greater than Z==0 && N==V

le signed less than or equal Z==1 && N != V


Don't worry too much about the third column. The suffixes work the way you would expect them to. For example:


cmp r1, #0x5

addeq r2, r1, r3 if (r1 == 5) r2 = r1 + r3;

movne r3, #0x18 if (r1 != 5) r3 = 0x18;

movgt r3, #0x18 if (r1 > 5) r3 = 0x18;

bleq Function if (r1 == 5) Function();



If the condition is not true, then the instruction does nothing.


Conditional instructions allow the compiler to compile simple "if then else" statements without using any "jump" instructions. For example:


if (r1 == 5)

r2 = 6;

else

r3 = 7;


would become :


cmp r1, 5

moveq r2, #6

movne r3, #7

whereas the x86 compiler (generally) would have to generate two "jump" instructions: one to jump over the "then" clause if the condition was not met, and one to jump over the "else" clause if it was.




Data alignment


The ARM CPU can only access DWORDs in memory that are aligned on addresses that are divisible by 4. Likewise, it can only access 16-bit values on addresses divisible by 2. An unaligned access will result in a Datatype Misalignment exception.




Function Calls


One of the most important mechanisms to understand is how a function call happens.


Step 1 - Parameters:


The caller sets up the parameters. r0 through r3 are used to transfer the first four parameters of a function. If there are more than four, the rest are pushed on the stack. (The rightmost parameter is pushed first).


Step 2 - Call:


The function is called by executing the Branch and Link instruction:


bl MyFunction lr = pc + 4; MyFunction();


Note how this instruction sets lr to be the address to return to when the function is done.


For C++ method calls, you'll see the bx instruction instead of bl. Bx jumps to the address in a register. For example, if r0 is your C++ "this" pointer:


ldr r2, [r0] r2 = &vtable

ldr r3, [r2, #0xC] r3 = vtable[3] // == fourth method

mov lr, pc Manually set up lr since we're not using bl

bx r3 Jump to r3, i.e., call the fourth method


Step 3 – Preserve registers:


The first instruction in a function usually looks something like this:


stmdb sp!, {r4 - r6, lr} push(lr); push(r6);

push(r5); push(r4);


This is the "Store Multiple Decrement Before" instruction, which is my favorite CPU instruction of all time. It pushes an entire specified set of registers onto the stack in one instruction. This serves two purposes. First, it preserves the values of r4 through r11, and r13 (remember it is the responsibility of the called function to preserve these). Second, it safely stores away the return address, lr.


Note that the order in which it pushes the registers may be the opposite of what you expect. The nice thing about this, though, is that if you are looking at a memory dump of the stack, the registers will be in the same order in memory as they are listed in the code.


Step 4 - Locals:


Next, the callee decrements the stack pointer in order to reserve space on the stack for its local variables:


sub sp, sp, #0xC sp -= 12;


Note that the size it reserves may not be what you expect from looking at the local variables in the C code. This is because the optimizing compiler may not need to store a local on the stack at all; it may be able to get away with using registers.


Step 5 – Body


Next, the body of the function is executed. Somewhere along the way, (the optimizing compiler is free to decide where) the compiler sets r0 to the return value of the function.


Step 6 – Return


If space was allocated on the stack for locals, it is released:

add sp, sp, #0xC


Then the registers that we saved away at the beginning of the function are restored using the "Load Multiple Increment After" instruction:


ldmia sp!, {r4 – r6, lr} pop(r4); pop(r5); pop(r6); pop(lr);


And finally we jump to the return address:


bx lr



Note, if you are debugging an app compiled for a version of Windows Mobile prior to v5.0, a different code sequence will be used:


ldmia sp!, {r4 – r6, pc} pop(r4); pop(r5); pop(r6); pop(pc);


Note that this is the same list of registers as used in the stmdb instruction at the beginning of the function, except that now lr has been replaced by pc. So the value of lr when the function was entered, i.e., the return address, is now loaded into pc, the program counter. This has the effect of jumping to the return address. In other words, returning from the function.




The Frame Pointer


The Frame Pointer is not a register; it's just a concept. The Frame Pointer is the value of the stack pointer while the body of a function is executing. In other words, it's a pointer to a stack frame. What's in a stack frame? Well, from steps 3 and 4 above, a stack frame contains local variables, followed by the registers that the function needed to preserve.


Every function call in a call stack has a frame pointer. Platform Builder will show you the frame pointers if you right-click in the Call Stack window and check "Frame Pointer". In Visual Studio 2005, you need to double-click on the frame in the Call Stack window, then look at the value of "sp" in the Registers window. (If the Registers window says "No Data Available", right-click the Registers window and check "Device Registers".)




Finding local variables


Now that you understand the basics of ARM assembly, you can use them to do some cool debugging tricks. For example, the debugger often can't give you the value of local variables in a retail build, because the debugger assumes locals are on the stack, and the optimizing compiler often keeps them in registers instead. But, by looking at the disassembly window, and figuring out what the disassembly is doing and how it relates to the C source code, you can find where the compiler is storing your local variables.


Understanding the optimized code the compiler generates can be tricky, and there's no cookbook to finding your locals, but here's one tip:


Before a function is called, the parameters have to be loaded into r0, r1, r2, and r3. So it's fairly easy to look at what the compiler is loading into these registers and to match them up to the C code. For example:


RECT rc;

GetWindowRect(hWnd, &rc);


becomes:


add r1, sp, #0x20 // This tells us that rc is

// located at sp + 0x20


mov r0, r4 // This tells us that hWnd was in r4

// before this snippet of code

bl GetWindowRect



Finding local variables in other stack frames


Now let's tackle a harder problem. Suppose you want to know the value of a local that lives in a function that is not at the top of the call stack? For example, suppose your call stack looks like this:


Generic.exe!MyRegisterClass

Generic.exe!InitInstance

Generic.exe!WinMain

Generic.exe!WinMainCRTStartup


And you want to know what the value of hInstance was in WinMain. The first step is to do what we did before: read the disassembly and figure out where the value was before InitInstance was called:


int WINAPI WinMain(HINSTANCE hInstance,

HINSTANCE hPrevInstance,

LPTSTR lpCmdLine,

int nCmdShow)

{

00011960 stmdb sp!, {r4, lr}

00011964 sub sp, sp, #0x1C

00011968 mov r4, r0

MSG msg;


// Perform application initialization:

if (!InitInstance(hInstance, nCmdShow))

0001196C mov r1, r3

00011970 bl |InitInstance ( 117f8h )|



Since hInstance is the first parameter to WinMain, hInstance must have been in r0 when WinMain was entered. The mov r4,r0 instruction tells you that hInstance was in r4 when InitInstance was called. Of course, it's not still in r4 by the time we got to MyRegisterClass. So how do you figure out what r4 used to be?


Remember that it is the responsibility of a called function to preserve the values of r4 through r11 and r13. So, one solution is to just step in the debugger back out to WinMain and then look at r4. But there are a number of reasons why this might not be possible: you may be looking at a post-mortem Watson dump; the debugger might be misbehaving; you might need to do more investigation before you step back out and lose your current state, etc. So then what do you do?


Since it is the responsibility of a called function to preserve r4 – r11, that means that one of the functions in the call stack (above the function you care about) must have preserved your r4. So, starting at WinMain (since that's where your local lives) walk up the stack one frame at a time looking for a function that preserved r4 on the stack.


So we start at InitInstance. Find the beginning of the function:


BOOL InitInstance(HINSTANCE hInstance, int nCmdShow)

{

000117F8 stmdb sp!, {r4 - r7, lr}

000117FC sub sp, sp, #0x75, 30


And there it is. InitInstance preserves r4. But where exactly did it put r4? Remember what the stack looks like:


MyRegisterClass locals MyRegisterClass frame pointer

MyRegisterClass preserved registers

InitInstance locals InitInstance frame pointer

InitInstance preserved registers

WinMain locals WinMain frame pointer

WinMain preserved registers


So first you need to find the frame pointer for WinMain, which Platform Builder's call stack window will give you. Or, if you are using Visual Studio 2005, double-click WinMain's frame in the Call Stack window, then look at the value of sp in the Registers window. (If for some reason you don't have a debugger, you can even figure out the frame pointer yourself by starting with the current value of the stack pointer and looking at the prolog of each function in the stack to see how big each frame is.)


So now we know WinMain's frame pointer, which points to its locals. The "sub sp,sp,#0x75,30" instruction tells us that WinMain has 0x1D4 bytes of locals (0x1D4 == 0x75 << (32-30)). So WinMain's preserved registers start at the frame pointer + 0x1d4. And since r4 is the first of the preserved registers, r4 lives at the frame pointer + 0x1d4. And there you have it, you found the value of WinMain's local hInstance variable.


By the way, if you reach the very top of the call stack, and none of the functions preserved r4, that means that none of them trash r4, and so the current value of r4 is what you are looking for.