ARM 64 Assembly Series — Load and Store

Previous posts: Basic definitions and registers, lab setup, offset and addressing modes

7 min readJul 14, 2022

As we discussed in the previous post:

The AArch64 architecture supports a single instruction set called A64 which consists of fixed-length 32 bit instructions that can be used to: Load and store data, change the address of the next instruction to be executed, perform arithmetic or logical operations, perform a special operation
AArch64 is a load-store architecture, which means that only load and store instructions can access the memory.
The load register ldr and store register str instructions are used to transfer: bytes (8 bits), half-words (16 bits), words (32 bits) and double words (64 bits) from a memory address to registers or from registers to a memory address.

In this post we are going to cover the load and store instructions and, most importantly, we are going to see how they can be formed in order to carry information about the size of the data that they are operating to. This, in conjunction with the offset and addressing syntax might seem a little bit confusing in the beginning, but hopefully by the end of this article you will be able to fully understand these concepts.

Loading and Storing Data

The ldr and str instructions can be used to load or store one or a pair of registers at a time. Let’s see the corresponding syntax in each case:

Single register

As the title implies, in this case, a single register is used a a source or a destination during a data transfer from -or- to memory. The basic syntax is as follows:

op<sz> Rn, <address>

The op refers to the instruction mnemonic, which can be ldr or str (capitalisation is optional)
The <sz> refers to the size of the data to be transferred (see below)
The Rn refers to the source or destination register
The <address> refers to the memory address to which or from the data will be transferred

When the <sz> parameter is omitted, the data size to be moved is determined by the symbol which is used to refer to the register (remember x implies 64bit size and w to 32bit size).

Let’s see an example to clarify this case:

ldr x1, <address>       //store 64 bits from <address> to X1
str x1, <address>       //store 64 bits from X1 to <address>-----------------ldr w1, <address>       //store 32 bits from <address> to w1
str w1, <address>       //store 32 bits from w1 to address

The <sz> can be used to force a different than the default size. This parameter can be either b, h or w and indicates an unsigned byte, a half word or a word respectively. Finally, adding an s in front of these letter (sb, sh, sw), it will force to the cpu handle the data as signed.

Let’s see some examples:

ldrb x1,[x2]       //store the least significant byte from *x2 to x1strh x1,[x2],#3    //store a half word (2 bytes) from x1 to *x2 and set x2 = x2 + 3strsh w0,[w3]      //store a half word (2 bytes) from w0 to *w3 and sign extend it

By sign-extend we mean that the transferred data will be signed when they get stored to the destination:

https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/ARMv8_InstructionSetOverview.pdf

In the first case (see figure above), the byte 0x8A will be loaded to the w4 (32bits) register and the remaining 3 bytes will be modified in order to indicate that the number is signed. Exactly the same happens in the second case, with the only difference that x4 refers to 64 bits, thus 7 bytes are going to be sign extended. Omitting the s extension (last case) will pad the remaining destination bytes with 0.

Pair of registers

The ldp, stp instructions can be used to move data twice as much as the ldr, str since they can use a pair of registers each time. The general syntax is as follows:

<op><sz> Rn,Rm, <address>

This operation can brake down to the following steps:

Load or store Rn to <address>
Increase <address> according to the size of Rn (4 bytes for 32 bit transfer or 8 for 64 bit transfer)
Load or store the second register to the (increased) address

Further than that, the rest parts of the instruction have the same meaning as in the previous case, so let’s go straight to the examples:

Example 1: *x2 will be stored to w0 and *(x2 + 4) will be stored to w1

ldp w0, w1, [x2]

Example 2: sp (the stack pointer) will be set to sp -16 bytes, then x29 will be stored to the address indicated by the sp and x30 will be stored to sp + 8bytes

stp x29,x30, [sp, #-16]!

Example 3: the value stored in the memory address where sp shows will be stored to x29, the value stored at sp+8bytes to x30 and finally sp will be modified to sp+16bytes

ldp x29,x30, [sp], #16

If you ever used a disassembler in the past, then the last two examples may seem familiar as they can be used to allocate space on the stack during a function call:

Example

Let us now write a program that demonstrate the instructions we discussed so far. If you have set up your lab, use the following oneliner to start the vm:

qemu-system-aarch64 -m 1024 -M raspi3b -kernel kernel8.img -dtb bcm2710-rpi-3-b-plus.dtb -sd 2022-01-28-raspios-bullseye-arm64.img -append "console=ttyAMA0 root=/dev/mmcblk0p2 rw rootwait rootfstype=ext4" -nographic -device usb-net,netdev=net0 -netdev user,id=net0,hostfwd=tcp::5555-:22

If you haven’t set up your lab yet, you can use this link to do your experiments (unfortunately it doesn’t support ArmV8 yet but it can be very helpful for simple examples). Next, copy the following code:

And compile it with:

$as filename.s -o filename.o && ld filename.o -o filename

In line 2, you see what is called a label, which is something like a function for higher level languages. The _start defines the entry point of the program while the .global is a way to export a function. The instructions at lines 9,10 form a system call (or syscall in short):

syscall() is a small library function that invokes the system
call whose assembly language interface has the specified number
with the specified arguments. Employing syscall() is useful, for
example, when invoking a system call that has no wrapper function
in the C library.

Simply said a system call is like requesting a task from the kernel. These tasks are indexed and identified by an integer which is passed through a special register followed up by a software interrupt instruction, indicated by the svc #0 mnemonic (in the case of AArch64).

syscall conventions depending on the architecture

In our example above, the exit system call for AArch64 is indicated by number 93, so in our case we first mov this value to w8 (the special register we were talking about) and then use the svc #0to perform the call. Let’s load the program to gdb, set a breakpoint to the beginning of the function _start and hit run:

The mov instruction, will store the values 10 and 20 to the registers x29 and x30 respectively, so after they get executed you will see the following:

Also, notice that sp is pointing to 0xfffffb30 which brings up right to the next instruction that will first subtract the value 16 from sp and store the values 10 and 20 to the stack:

0x7ffffffb20: 0x000000000000000a, 0x7ffffffb28: 0x0000000000000014

The next two instructions will store the values 16 and 11 to x29 and x30:

Next is ldp, which as we said it will restore the previous values of x29 and x30 from the stack and set sp back to 0xfffffb30:

Finally, the b exit will branch the execution to the exit function and finish our program:

Food for though

To make these posts more interactive, here is a challenge until the next post:

Assume the following C statements:

int x[] = {1,2,3,4,5};x[0] = 6;
x[1] = x[2];
x[3] = x[0];

Write the arm version of it using only ldr, str and mov.

.global _start_start:
     ldr r0, =x     @ write your program here.data 
x: .word 1,2,3,4,5