Although the typical Arduino programmer is probably not interested in writing assembly code, in some situations assembly programming is essential. Let’s have a look at these situations and see what one can do.
Why assembly coding?
The typical use case for assembly programming is when tight timing constraints have to be met. One example is a fast (and lightweight) bit-banging I2C library that runs also on MCUs without I2C hardware. Peter Fleury wrote such a library for general AVR MCUs. I took his code, extended it somewhat, and turned it into an Arduino library by using GCC inline assembly coding. I also implemented a pure C++ version of this library. It turned out that this library takes double as much flash memory and the maximal communication speed is roughly 6 times slower.
Assembly programming is not only good when you want to do things fast, but also if you want to have precise control over the timing. Because you know the exact number of cycles each machine instruction uses, you can implement, e.g., a software UART, which is as good as hardware UARTs.
Another example is the logic analyzer implemented on an Arduino UNO, as described in my previous post. The core of it is an acquisition loop that reads the port inputs at the given sample rate. Algorithmically, this is trivial. The challenging part is to time it precisely and to do it at high sample rates. I wrote an inline assembly code part that uses Timer 1 for timing the sampling, which works up to 1 MHz sampling frequency. For 2 MHz, I had to use a stripped-down version, which we will have a look at later.
A final example is a more time and space-efficient implementation of the Arduino core, which makes heavy use of inline assembly programming.
So, I am not claiming that assembly coding is better than using C/C++ in general (although you might save some bytes in flash memory, but this is usually not crucial). To the contrary, I try to restrict assembly coding to a minimum! It is tiresome and error-prone. And assembly code is hard to debug as well.
My strategy when dealing with tight timing constraints is to code the time-critical part in C++ first. After that, I have a look at what the compiler generates and see whether timing constraints can be met. And only then I start to think about how to do it better, i.e., with fewer cycles – if necessary at all.
So, how do you get hold of what the compiler generates? There are many ways. The most straightforward one is probably to change/introduce the platform.local.txt
file with the lines I showed you in the post about how to make the Arduino IDE ready for debugging. With this modification, you will get an ELF file when you select Export compiled Binary
in the Sketch
menu. Using the program avr-objdump, which you can locate in your Arduino package or have downloaded as part of the AVR-GCC toolchain, you can now generate a listing that contains the generated machine code:
> avr-objdump --disassemble --source --line-numbers --demangle --section=.text ELF-file > LST-file
When you want to do some assembly coding in an Arduino context, you have to have an idea what kind of machine instructions you can use and on what operands they work on. This is what I will cover next. Furthermore, you need to learn how to embed assembly code in your C++ program, which we will look at afterward. In the end, we will have a look at the high-speed logic analyzer acquisition loop.
RISC architecture
The AVR MCUs use a RISC instruction set. This means all instructions are highly specialized (dealing only with logic operations or only with memory transfer) and optimized so that most instructions need just one or two clock cycles. The AVR Instruction Set Manual provides a good reference and presents all the details one needs to know. Instead of going through each instruction, I will only present the big picture and introduce the concepts you need to know in order to understand how to code.
Memory types
The AVR MCUs use a modified Harvard architecture, which means that program memory and data memory are in two different address spaces. In addition, there exists EEPROM data memory, which is only indirectly accessible using I/O functions.
Data memory consists of 32 general-purpose 8-bit registers named r0
–r31
, 64 I/O registers for accessing peripheral functions, such as control registers and other I/O functions, up to 160 extended I/O registers (depending on MCU type), followed by internal SRAM, which consists in case of the ATmega328P of 2048 bytes.
All data memory cells can be accessed by using load and store instructions using the addresses shown on the right. The 64 I/O registers can be accessed over the I/O bus with in
and out
instructions using an I/O address that is lower by 0x20 than the regular memory address. Such I/O instructions use just one cycle instead of the two cycles that are needed for general load and store instructions. And in the first 32 of the I/O registers, one can set and clear individual bits using just two cycles. If you try to do that in SRAM, you need three instructions and at least five cycles.
While all registers are 8-bit registers, the 6 upper registers form three pairs of 16-bit registers that can be used for indirect addressing (see below). r26/r27
is also called X
-register, r28/29
Y
-register, and r30/r31
Z
-register. AVR MCUs are little-endian systems, which means that the byte with the lower address contains the least significant byte. So, if we want to store the number 0x0F12
in the Z
-register, r30
would contain 0x12
, r31
0x0F
.
Addressing modes
Instructions can address only particular parts of the address space and may do that in different ways. These different ways are called addressing modes. We will go through all of these modes and give examples of how they are used.
- Direct register addressing, single register: The operand is contained in one register, and the result of applying the instruction is stored in the same register. Example:
neg r0
, which replaces the contents of the registerr0
with its two’s complement. - Direct register addressing, two registers: The operands are contained in two registers. If the instruction has a result, then it is stored in the first register. Example:
sub r1, r2
, which subtracts the contents ofr2
fromr1
and places the result inr1
. Another example iscp r1, r2
, which performs the same operation, but no result is stored. Only the status register is updated, which contains a number of different flags, such as the Zero-flag and the Carry-flag. - Immediate value: One operand is an immediate value that is part of the instruction. The other one is in a register (or register pair). Depending on the instruction, different ranges of values are permitted. Examples:
adiw r30, 5
, which adds the value 5 (0-63 is allowed) to the register pairr30/r31
, andori r0, 0xFE
, which executes a logical or on registerr0
and0xFE
, storing the result inr0
. - I/O direct addressing: A byte from a register is read from one I/O register or written to it (I/O register addresses range from 0x00 to 0x3F). Examples:
in r0, 0x05
, which loads one byte from I/O register0x05
to registerr0
, andout 0x06, r1
, which outputs the contents of registerr1
to I/O register0x06
. - Direct data addressing: One operand is a byte in data memory addressed by a 16-bit address given in the instruction. Examples are:
lds r2, 0x05FF
andsts 0x0020, r0
, which loads the byte contained in the byte addressed by0x05FF
intor2
and stores the contents ofr0
to address0x0020
, respectively. - Indirect data addressing: One operand is a byte in data memory addressed by the X-, Y- or Z-register. Example:
st X, r5
, which stores the contents ofr5
to the memory addressed by theX
-register. - Indirect data addressing with displacement: One operand is a byte in data memory addressed by the Y- or Z-register and a constant displacement of 0 up to 63. Example:
ldd r0, Y+5
, which loads from the address pointed at by theY
register plus 5 into registerr0
. - Indirect data addressing with pre-decrement or post-increment: One operand is a byte in data memory addressed by the
X
-,Y
– orZ
-register, and the address register is either decremented before access or incremented after access. Examples:ld r0, X+
, which loadsr0
with the byte pointed at by theX
register, which is incremented afterward, andst -Y, r1
, which stores the byte contained in registerr1
to the location which is addressed after theY
register is decremented. - Program memory constant addressing: This mode is only usable with the program instructions
lpm
,elpm
, andspm
. These instructions load bytes from program memory or write words to program memory. In both cases, the address is given by the Z-register. - Program memory addressing with post-increment: Similar to the previous mode, but with post-increment of the Z-register.
- Direct program memory addressing: Program execution continues at the address immediate in the instruction word. Example:
call 0x0555
, which means that the subroutine at program address0x0555
(which is not a byte address, but a word address!) is called. Note that as a programmer one provides a symbolic label for the subroutine address, so one does not have to care about the specific numerical value. - Indirect program memory addressing: Program execution continues at the address the Z-register points to (again, a word address, not a byte address). Example:
ijmp
. - Relative program memory addressing: Program execution continues at the address relative to the current program counter given by a displacement (again a word address). And again, the programmer just puts in a symbolic value. Example:
rjmp back
, which means that program execution is continued at the program location labeled withback
.
Instruction types
Instructions can be of different types based on what they accomplish. Some of them just move data around, others perform arithmetic. The AVR instructions can be classified into the following types of instructions:
- Arithmetic and logic instructions: These are instructions that perform logic and arithmetic operations. Almost all of them work on two registers, and some of them take an immediate value. They all affect the status register, e.g., zero- and carry flags are set. The latter is particularly important when performing multibyte arithmetic operations. For example, if we want to add the two-byte integer in registers r2/r3 to the two-byte integer in r4/r5 (remember, LSB in the registers with the lower address), this could be done as follows:
add r4, r2 ; add low byte in r2 to r4 without carry adc r5, r3 ; add high byte in r3 to r5 with carry-flag
- Branch instructions: Branch instructions are all instructions that change the linear flow of execution. For example, the above-mentioned
call
instruction belongs to this set and also theret
instruction, which returns from a subroutine call. These two are unconditional. There is also a large set of conditional branch instructions, such asbreq
(branch if equal), which only jumps to the specified program address if the Zero-flag is set. - Data transfer instructions: Data transfer instructions move data around, e.g. from data memory to registers (
ld
) and vice versa (st
). There are also two instructions that move data between registers (mov
andmovw
), and instructions that manipulate the stack (push
andpop
). And there are, of course, the instructions mentioned above that read from I/O registers (in
) and that write to them (out
). All of them do not change any flags in the status register. - Bit manipulation instructions: These are instructions that manipulate bits in one of the general registers, in the I/O registers, or in the status register. An example of the first kind of instructions is
rol
, which rotates the contents of a register to the left, shifting the carry flag into the least significant bit and moving the most significant bit into the carry flag. An example of the second kind of instruction issbi
which sets a particular bit in an I/O register to one. Finally, an example of the third type isclc
, which clears the carry flag. - MCU control instructions: The only instruction from this set you probably will ever use is
nop
, which does nothing and takes one clock cycle to do so. This instruction is often used for delaying execution in order to time a loop. - Word instructions: Orthogonally to the above classification, one can classify instructions along the dimension of how many bytes are operated on. There are actually only three instructions that operate on words instead of on bytes. These are
movw
, to move a word from one register pair to another one,adiw
, to add an immediate value (0-63) to a word in one of the four upper register pairs, andsbiw
, which subtracts an immediate value.
Inline assembly coding
Once you have understood how to use assembly instructions, you may want to write an assembler program. This can be done by writing such a program into a file with an S extension and then handing it over to avr-gcc. Unfortunately, the Arduino IDE does not support that. Although the Arduino IDE has supported that for some time by now, it would mean that you also need to learn about the syntax of assembly language files. Instead, there is the possibility to insert inline assembly code into C/C++ code, as in the following basic example, where we enable interrupts:
asm volatile("sei");
Note that we use the asm
statement with the volatile
qualifier. This is something one should do in order to avoid the compiler optimizing the assembly code away or moving it out of a loop (strictly speaking, this is only necessary when there is an output operand that is not used afterward in the C++ code, but it never hurts anyway).
Let us now try to read the input port PINA
, which should then be stored in the C-variable invalue
of type byte
. One might be tempted to use an asm
statement like the following
asm volatile("in invalue, PINA");
which will not work for several reasons. First of all, the first operand of an in
instruction needs to be a register, not a C-variable. So, we need to establish a connection between a register that the compiler will choose and the output variable invalue
. Second, we cannot use a compile-time constant such as PINA inside a string because the pre-processor does not change anything inside a string. In order to deal with both problems, the asm
statement usually contains more than simply the assembly code. The general form of the asm
statement is:
asm volatile(<code> : <output operands> : <input operands> : <clobbers>)
The <code>
section is simply a string containing the assembly code, where the instructions have to be separated by a newline symbol. In order to make the listing generated by the compiler nicer, one should also add a tabulator. Finally, since in C and C++ string constants that follow each other and are only separated by white space are considered as one string constant, one can write each instruction in one line, e.g.,
"nop ; this is a comment \n\t" "rjmp LABEL ; jump to LABEL \n\t"
The <output operands>
section is a comma-separated list of descriptions of how outputs should be returned to the C/C++ context. Such a description consists of a constraint and a C-expression in parentheses. In the case of an output operand, this must be a lvalue, i.e., something that could be on the left-hand side of an assignment operator.
Similarly, the <input operands>
section specifies how input is fed into the asm
statement. Finally, the <clobbers> section is a list of comma-separated registers that are used in the inline code that need to be saved before the asm
statement and restored afterward. Our example from above could be written as
asm volatile("in %0, %1" : "=r" (invalue) : "I" (_SFR_IO_ADDR(PINA)));
The notations %0 and %1 refer to the first and second operand specifications, respectively. The first operand specification in the output section is “=r” (invalue), which means that inside the asm statement %0 stands for a register, and after the asm statement is finished, the register contents should be stored in the C-variable invalue. The specification “I” (_SFR_IO_ADDR(PINA)) means that %1 should be substituted by a constant and this constant is the I/O address of PINA (i.e., 0x20 is subtracted from the memory address of the input port A).
For an output specification, the following constraints can be used:
"=r"
meaning that a register should be used, and the value is write-only,"+r"
meaning that a register should be used, and the value should be loaded at the beginning and stored in the end,"=&r"
meaning that a register should be used, and this register is exclusively reserved.
For input specifications, many more constraints are possible. We already saw "I"
meaning a 6-bit positive integer. There is also, for instance, "M"
meaning an 8-bit positive integer constant, "r"
meaning a register, "e"
meaning one of the three address registers, etc. Here is a table with all possible constraints.
Constraint | Used for | Range of allowed values | Possible registers to be allocated by compiler |
---|---|---|---|
a | simple upper register | r16-r23 | |
b | base pointer register pairs | Y, Z | |
d | upper register | r16-r31 | |
e | pointer register pair | X, Y, Z | |
l | lower register | r0-r15 | |
q | stack pointer | SPL:SPH | |
r | any register | r0-r31 | |
t | temporary register | __temp_reg__ | |
w | special upper register pair | r24, r26, r28, r30 | |
x | pointer register X | X | |
y | pointer register Y | Y | |
z | pointer register Z | Z | |
G | floating point constant | 0.0 | |
I | 6-bit positive integer | 0-63 | |
J | 6-bit negative integer | -63-0 | |
K | integer constant | 2 | |
L | integer constant | 0 | |
M | 8-bit integer constant | 0-255 | |
n | 16-bit integer constant | 0-65535 | |
N | integer constant | -1 | |
O | integer constant | 8, 16, 24 | |
P | integer constant | 1 | |
Q | memory address based on Y or Z pointer with displacement | ||
R | integer constant | -6-5 |
One interesting feature of the inline assembler is that you cannot only handle one-byte values and variables but also two- and four-byte values and variables. If, for instance, you want to swap the bytes of a two-byte variable val
, this can be done as follows:
asm volatile("mov r5, %A0\n\t" "mov %A0, %B0\n\t" "mov %B0, r5\n\t" : "+r" (val) : : "r5")
Here %An
stands for the first (least significant) byte register of the operand %n and %Bn stands for the second byte. In the case of four-byte values, one would also use %Cn
and %Dn
. The variable val
is here input and output, which is signified by "+r"
. Further r5
is mentioned as a clobbered register, i.e., the compiler must make sure that values previously stored in r5
get restored. Actually, instead of using r5
, we could have used the temporary register r0
, which is always available. Instead of using r0, one should use __temp_reg__
, however, in order to make the code independent of changes in register allocations different compiler versions may introduce.
The AVR-GCC Inline Assembler Cookbook gives the whole picture of how to write inline assembly code but is dense at some points. If you want more of a tutorial-style introduction, then the Arduino Inline Assembly Tutorial is probably the right thing for you. It also gives a gentle introduction to AVR assembly programming in general. Meanwhile, the author has also published an e-book based on his blog. Let me finally point you to a tutorial write-up by wek at AVRFREAKS, which addresses some possible misunderstandings and pitfalls.
Example
In order to demonstrate the power of using inline assembly coding, let us look at the example of the 2MHz acquisition loop for the Arduino UNO logic analyzer (simplified):
byte trigger, trigger_value; byte logicdata[1024]; // This function provides sampling for 2 MHz with no pre-trigger data void acquire2MHz() { byte inp; unsigned int index = 0; cli(); // disable interrupts do { inp = PINB; // read sample } while ((trigger_values ^ inp ) & trigger); // as long as no trigger logicdata[0] = inp; // keep first trigger sample // keep sampling for 1023 samples after trigger for (unsigned int i = 1 ; i < 1024; i++) { logicdata[i] = PINB; } sei(); // enable interrupts again }
If we now look at what the compiler makes out of it, it looks quite efficient:
void acquire2MHz() { byte inp; cli(); 11e: f8 94 cli do { inp = PINB; 120: 93 b1 in r25, 0x03 ; 3 } while ((trigger_values ^ inp ) & trigger); 122: 80 91 05 05 lds r24, 0x0505 ; 0x800505 <trigger_values> 126: 89 27 eor r24, r25 128: 20 91 04 05 lds r18, 0x0504 ; 0x800504 <trigger> 12c: 82 23 and r24, r18 12e: c1 f7 brne .-16 ; 0x120 logicdata[0] = inp; 130: 90 93 04 01 sts 0x0104, r25 ; 0x800104 <logicdata> for (unsigned int i = 1 ; i < 1024; i++) { 134: 81 e0 ldi r24, 0x01 ; 1 136: 90 e0 ldi r25, 0x00 ; 0 138: 81 15 cp r24, r1 ; [1] 13a: 24 e0 ldi r18, 0x04 ; 4 [1] 13c: 92 07 cpc r25, r18 ; [1] 13e: 38 f4 brcc .+14 ; 0x14e [1 if false] logicdata[i] = PINB; 140: 23 b1 in r18, 0x03 ; 3 [1] 142: fc 01 movw r30, r24 ; [1] 144: ec 5f subi r30, 0xFC ; 252 [1] 146: fe 4f sbci r31, 0xFE ; 254 [1] 148: 20 83 st Z, r18 ; [1] for (unsigned int i = 1 ; i < 1024; i++) { 14a: 01 96 adiw r24, 0x01 ; 1 [1] 14c: f5 cf rjmp .-22 ; 0x138 [2] } sei(); 14e: 78 94 sei } 150: 08 95 ret
The interesting part is the for
-loop between 138 and 14c. I have annotated the listing with the clock cycles in brackets. When you add them up, you get a result of 12 clock cycles. However, when you want to have a sample rate of 2 Ms/sec, this does not fit. Then each loop iteration should only be 8 clock cycles. So, here we have a case where inline assembler is definitely a must. The final acquisition procedure looks as follows:
void acquire2MHz() { byte * ptr = &logicdata[0]; int delayCount = 1023; cli(); asm volatile( "TRIGLOOP: in __tmp_reg__, %[CHANPINaddr]; read input [1]\n\t" "st %a[LOGICDATAaddr], __tmp_reg__ ; store input [2]\n\t" "eor __tmp_reg__, %[TRIGVAL] ; inp = inp XOR trig_val [1]\n\t" "and __tmp_reg__, %[TRIGMASK] ; inp = inp AND trigger [1]\n\t" "brne TRIGLOOP ; wait for trigger [1 if false]\n\t" "adiw %A[LOGICDATAaddr], 1 ; increment pointer [2]\n\t" ";This makes it 8 cycles! Now the sampling loop:\n\t" "SAMPLOOP: in __tmp_reg__, %[CHANPINaddr]; read input data [1]\n\t" "nop ; cycle padding [1]\n\t" "st %a[LOGICDATAaddr]+, __tmp_reg__ ; store input & post incr. [2]\n\t" "sbiw %A[DELAYCOUNT], 1 ; decrement delayCount [2]\n\t" "brne SAMPLOOP ; continue until end [2]\n\t" ";sum: 8 cycles\n\t" : : [LOGICDATAaddr] "e" (ptr), [CHANPINaddr] "I" (_SFR_IO_ADDR(PINB)), [DELAYCOUNT] "e" (delayCount), [TRIGMASK] "r" (trigger), [TRIGVAL] "r" (trigger_values)); sei(); }
Now only 8 clock cycles are used and in addition the triggering sample is also only 8 cycles before the general acquisition starts. There are a number of additional features used with respect to what we have introduced so far, though. First, I used symbolic references instead of positional references for output and input operands. Second, I used the notation %a[ref]
, in the instruction st %a[LOGICDATAaddr]+, __tmp_reg__
. This means that the respective pointer register should be used, i.e, X, Y, or Z.
All in all, this example shows that with inline assembly, you can push the boundaries of timing on an AVR MCU quite a lot. But it is also clear from the code that it would be impossible to achieve a sample rate of 4 or 5 Ms/second with an AVR MCU running at 16 MHz using a loop. This is only achievable when one uses loop unrolling.
[Edit Nov-2023: An attentive reader noted that it is indeed possible to link .S files into an Arduino project. I changed that part.]
May 30, 2023 — 21:36
I think you are correct. For many tasks the assembler is better.
And is very fun understand how the microcontroller works.
Regards