Real Programmers Write Assembly Code

Although the typical Arduino programmer is probably not interested in writing assembly code, in some situations assembly programming is essential. Let’s have a look at these situations and see what one can do.

Why assembly coding?

The typical use case for assembly programming is when tight timing constraints have to be met. One example is a fast (and lightweight) bit-banging I2C library that runs also on MCUs without I2C hardware. Peter Fleury wrote such a library for general AVR MCUs. I took his code, extended it somewhat, and turned it into an Arduino library by using GCC inline assembly coding. I also implemented a pure C++ version of this library. It turned out that this library takes double as much flash memory and the maximal communication speed is roughly 6 times slower.

Assembly programming is not only good when you want to do things fast, but also if you want to have precise control over the timing. Because you know the exact number of cycles each machine instruction uses, you can implement, e.g., a software UART, which is as good as hardware UARTs.

Another example is the logic analyzer implemented on an Arduino UNO, as described in my previous post. The core of it is an acquisition loop that reads the port inputs at the given sample rate. Algorithmically, this is trivial. The challenging part is to time it precisely and to do it at high sample rates. I wrote an inline assembly code part that uses Timer 1 for timing the sampling, which works up to 1 MHz sampling frequency. For 2 MHz, I had to use a stripped-down version, which we will have a look at later.

A final example is a more time and space-efficient implementation of the Arduino core, which makes heavy use of inline assembly programming.

So, I am not claiming that assembly coding is better than using C/C++ in general (although you might save some bytes in flash memory, but this is usually not crucial). To the contrary, I try to restrict assembly coding to a minimum! It is tiresome and error-prone. And assembly code is hard to debug as well.

My strategy when dealing with tight timing constraints is to code the time-critical part in C++ first. After that, I have a look at what the compiler generates and see whether timing constraints can be met. And only then I start to think about how to do it better, i.e., with fewer cycles – if necessary at all.

So, how do you get hold of what the compiler generates? There are many ways. The most straightforward one is probably to change/introduce the platform.local.txt file with the lines I showed you in the post about how to make the Arduino IDE ready for debugging. With this modification, you will get an ELF file when you select Export compiled Binary in the Sketch menu. Using the program avr-objdump, which you can locate in your Arduino package or have downloaded as part of the AVR-GCC toolchain, you can now generate a listing that contains the generated machine code:

> avr-objdump --disassemble --source --line-numbers --demangle --section=.text ELF-file > LST-file

When you want to do some assembly coding in an Arduino context, you have to have an idea what kind of machine instructions you can use and on what operands they work on. This is what I will cover next. Furthermore, you need to learn how to embed assembly code in your C++ program, which we will look at afterward. In the end, we will have a look at the high-speed logic analyzer acquisition loop.

RISC architecture

The AVR MCUs use a RISC instruction set. This means all instructions are highly specialized (dealing only with logic operations or only with memory transfer) and optimized so that most instructions need just one or two clock cycles. The AVR Instruction Set Manual provides a good reference and presents all the details one needs to know. Instead of going through each instruction, I will only present the big picture and introduce the concepts you need to know in order to understand how to code.

Memory types

The AVR MCUs use a modified Harvard architecture, which means that program memory and data memory are in two different address spaces. In addition, there exists EEPROM data memory, which is only indirectly accessible using I/O functions.

Data memory consists of 32 general-purpose 8-bit registers named r0–r31, 64 I/O registers for accessing peripheral functions, such as control registers and other I/O functions, up to 160 extended I/O registers (depending on MCU type), followed by internal SRAM, which consists in case of the ATmega328P of 2048 bytes.

Data memory (**Microchip Developer Help**)

All data memory cells can be accessed by using load and store instructions using the addresses shown on the right. The 64 I/O registers can be accessed over the I/O bus with in and out instructions using an I/O address that is lower by 0x20 than the regular memory address. Such I/O instructions use just one cycle instead of the two cycles that are needed for general load and store instructions. And in the first 32 of the I/O registers, one can set and clear individual bits using just two cycles. If you try to do that in SRAM, you need three instructions and at least five cycles.

While all registers are 8-bit registers, the 6 upper registers form three pairs of 16-bit registers that can be used for indirect addressing (see below). r26/r27 is also called X-register, r28/29 Y-register, and r30/r31 Z-register. AVR MCUs are little-endian systems, which means that the byte with the lower address contains the least significant byte. So, if we want to store the number 0x0F12 in the Z-register, r30 would contain 0x12, r31 0x0F.

Addressing modes

Instructions can address only particular parts of the address space and may do that in different ways. These different ways are called addressing modes. We will go through all of these modes and give examples of how they are used.

Direct register addressing, single register: The operand is contained in one register, and the result of applying the instruction is stored in the same register. Example: neg r0, which replaces the contents of the register r0 with its two’s complement.
Direct register addressing, two registers: The operands are contained in two registers. If the instruction has a result, then it is stored in the first register. Example: sub r1, r2, which subtracts the contents of r2 from r1 and places the result in r1. Another example is cp r1, r2, which performs the same operation, but no result is stored. Only the status register is updated, which contains a number of different flags, such as the Zero-flag and the Carry-flag.
Immediate value: One operand is an immediate value that is part of the instruction. The other one is in a register (or register pair). Depending on the instruction, different ranges of values are permitted. Examples: adiw r30, 5, which adds the value 5 (0-63 is allowed) to the register pair r30/r31, and ori r0, 0xFE, which executes a logical or on register r0 and 0xFE, storing the result in r0.
I/O direct addressing: A byte from a register is read from one I/O register or written to it (I/O register addresses range from 0x00 to 0x3F). Examples: in r0, 0x05, which loads one byte from I/O register 0x05 to register r0, and out 0x06, r1, which outputs the contents of register r1 to I/O register 0x06.
Direct data addressing: One operand is a byte in data memory addressed by a 16-bit address given in the instruction. Examples are: lds r2, 0x05FF and sts 0x0020, r0, which loads the byte contained in the byte addressed by 0x05FF into r2 and stores the contents of r0 to address 0x0020, respectively.
Indirect data addressing: One operand is a byte in data memory addressed by the X-, Y- or Z-register. Example: st X, r5, which stores the contents of r5 to the memory addressed by the X-register.
Indirect data addressing with displacement: One operand is a byte in data memory addressed by the Y- or Z-register and a constant displacement of 0 up to 63. Example: ldd r0, Y+5, which loads from the address pointed at by the Y register plus 5 into register r0.
Indirect data addressing with pre-decrement or post-increment: One operand is a byte in data memory addressed by the X-, Y– or Z-register, and the address register is either decremented before access or incremented after access. Examples: ld r0, X+, which loads r0 with the byte pointed at by the X register, which is incremented afterward, and st -Y, r1, which stores the byte contained in register r1 to the location which is addressed after the Y register is decremented.
Program memory constant addressing: This mode is only usable with the program instructionslpm, elpm, and spm. These instructions load bytes from program memory or write words to program memory. In both cases, the address is given by the Z-register.
Program memory addressing with post-increment: Similar to the previous mode, but with post-increment of the Z-register.
Direct program memory addressing: Program execution continues at the address immediate in the instruction word. Example: call 0x0555, which means that the subroutine at program address 0x0555 (which is not a byte address, but a word address!) is called. Note that as a programmer one provides a symbolic label for the subroutine address, so one does not have to care about the specific numerical value.
Indirect program memory addressing: Program execution continues at the address the Z-register points to (again, a word address, not a byte address). Example: ijmp.
Relative program memory addressing: Program execution continues at the address relative to the current program counter given by a displacement (again a word address). And again, the programmer just puts in a symbolic value. Example: rjmp back, which means that program execution is continued at the program location labeled with back.

Instruction types

Instructions can be of different types based on what they accomplish. Some of them just move data around, others perform arithmetic. The AVR instructions can be classified into the following types of instructions:

Arithmetic and logic instructions: These are instructions that perform logic and arithmetic operations. Almost all of them work on two registers, and some of them take an immediate value. They all affect the status register, e.g., zero- and carry flags are set. The latter is particularly important when performing multibyte arithmetic operations. For example, if we want to add the two-byte integer in registers r2/r3 to the two-byte integer in r4/r5 (remember, LSB in the registers with the lower address), this could be done as follows:

  add r4, r2 ; add low byte in r2 to r4 without carry
  adc r5, r3 ; add high byte in r3 to r5 with carry-flag

Branch instructions: Branch instructions are all instructions that change the linear flow of execution. For example, the above-mentioned call instruction belongs to this set and also the ret instruction, which returns from a subroutine call. These two are unconditional. There is also a large set of conditional branch instructions, such as breq (branch if equal), which only jumps to the specified program address if the Zero-flag is set.
Data transfer instructions: Data transfer instructions move data around, e.g. from data memory to registers (ld) and vice versa (st). There are also two instructions that move data between registers (mov and movw), and instructions that manipulate the stack (push and pop). And there are, of course, the instructions mentioned above that read from I/O registers (in) and that write to them (out). All of them do not change any flags in the status register.
Bit manipulation instructions: These are instructions that manipulate bits in one of the general registers, in the I/O registers, or in the status register. An example of the first kind of instructions is rol, which rotates the contents of a register to the left, shifting the carry flag into the least significant bit and moving the most significant bit into the carry flag. An example of the second kind of instruction is sbi which sets a particular bit in an I/O register to one. Finally, an example of the third type is clc, which clears the carry flag.
MCU control instructions: The only instruction from this set you probably will ever use is nop, which does nothing and takes one clock cycle to do so. This instruction is often used for delaying execution in order to time a loop.
Word instructions: Orthogonally to the above classification, one can classify instructions along the dimension of how many bytes are operated on. There are actually only three instructions that operate on words instead of on bytes. These are movw, to move a word from one register pair to another one, adiw, to add an immediate value (0-63) to a word in one of the four upper register pairs, and sbiw, which subtracts an immediate value.

Inline assembly coding

Once you have understood how to use assembly instructions, you may want to write an assembler program. This can be done by writing such a program into a file with an S extension and then handing it over to avr-gcc. ~~Unfortunately, the Arduino IDE does not support that.~~ Although the Arduino IDE has supported that for some time by now, it would mean that you also need to learn about the syntax of assembly language files. Instead, there is the possibility to insert inline assembly code into C/C++ code, as in the following basic example, where we enable interrupts:

asm volatile("sei");

Note that we use the asm statement with the volatile qualifier. This is something one should do in order to avoid the compiler optimizing the assembly code away or moving it out of a loop (strictly speaking, this is only necessary when there is an output operand that is not used afterward in the C++ code, but it never hurts anyway).

Let us now try to read the input port PINA, which should then be stored in the C-variable invalue of type byte. One might be tempted to use an asm statement like the following

asm volatile("in invalue, PINA");

which will not work for several reasons. First of all, the first operand of an in instruction needs to be a register, not a C-variable. So, we need to establish a connection between a register that the compiler will choose and the output variable invalue. Second, we cannot use a compile-time constant such as PINA inside a string because the pre-processor does not change anything inside a string. In order to deal with both problems, the asm statement usually contains more than simply the assembly code. The general form of the asm statement is:

asm volatile(<code> : <output operands> : <input operands> : <clobbers>)

The <code> section is simply a string containing the assembly code, where the instructions have to be separated by a newline symbol. In order to make the listing generated by the compiler nicer, one should also add a tabulator. Finally, since in C and C++ string constants that follow each other and are only separated by white space are considered as one string constant, one can write each instruction in one line, e.g.,

"nop    ; this is a comment \n\t"
"rjmp LABEL ; jump to LABEL \n\t"

The <output operands> section is a comma-separated list of descriptions of how outputs should be returned to the C/C++ context. Such a description consists of a constraint and a C-expression in parentheses. In the case of an output operand, this must be a lvalue, i.e., something that could be on the left-hand side of an assignment operator.

Similarly, the <input operands> section specifies how input is fed into the asm statement. Finally, the <clobbers> section is a list of comma-separated registers that are used in the inline code that need to be saved before the asm statement and restored afterward. Our example from above could be written as

asm volatile("in %0, %1" : "=r" (invalue) : "I" (_SFR_IO_ADDR(PINA)));

The notations %0 and %1 refer to the first and second operand specifications, respectively. The first operand specification in the output section is “=r” (invalue), which means that inside the asm statement %0 stands for a register, and after the asm statement is finished, the register contents should be stored in the C-variable invalue. The specification “I” (_SFR_IO_ADDR(PINA)) means that %1 should be substituted by a constant and this constant is the I/O address of PINA (i.e., 0x20 is subtracted from the memory address of the input port A).

For an output specification, the following constraints can be used:

"=r" meaning that a register should be used, and the value is write-only,
"+r" meaning that a register should be used, and the value should be loaded at the beginning and stored in the end,
"=&r" meaning that a register should be used, and this register is exclusively reserved.

For input specifications, many more constraints are possible. We already saw "I" meaning a 6-bit positive integer. There is also, for instance, "M" meaning an 8-bit positive integer constant, "r" meaning a register, "e" meaning one of the three address registers, etc. Here is a table with all possible constraints.

Constraint	Used for	Range of allowed values	Possible registers to be allocated by compiler
a	simple upper register		r16-r23
b	base pointer register pairs		Y, Z
d	upper register		r16-r31
e	pointer register pair		X, Y, Z
l	lower register		r0-r15
q	stack pointer		SPL:SPH
r	any register		r0-r31
t	temporary register		__temp_reg__
w	special upper register pair		r24, r26, r28, r30
x	pointer register X		X
y	pointer register Y		Y
z	pointer register Z		Z
G	floating point constant	0.0
I	6-bit positive integer	0-63
J	6-bit negative integer	-63-0
K	integer constant	2
L	integer constant	0
M	8-bit integer constant	0-255
n	16-bit integer constant	0-65535
N	integer constant	-1
O	integer constant	8, 16, 24
P	integer constant	1
Q	memory address based on Y or Z pointer with displacement
R	integer constant	-6-5

One interesting feature of the inline assembler is that you cannot only handle one-byte values and variables but also two- and four-byte values and variables. If, for instance, you want to swap the bytes of a two-byte variable val, this can be done as follows:

asm volatile("mov r5, %A0\n\t"
             "mov %A0, %B0\n\t"
             "mov %B0, r5\n\t"
             : "+r" (val)
             : 
             : "r5")

Here %An stands for the first (least significant) byte register of the operand %n and %Bn stands for the second byte. In the case of four-byte values, one would also use %Cn and %Dn. The variable val is here input and output, which is signified by "+r". Further r5 is mentioned as a clobbered register, i.e., the compiler must make sure that values previously stored in r5 get restored. Actually, instead of using r5, we could have used the temporary register r0, which is always available. Instead of using r0, one should use __temp_reg__, however, in order to make the code independent of changes in register allocations different compiler versions may introduce.

The AVR-GCC Inline Assembler Cookbook gives the whole picture of how to write inline assembly code but is dense at some points. If you want more of a tutorial-style introduction, then the Arduino Inline Assembly Tutorial is probably the right thing for you. It also gives a gentle introduction to AVR assembly programming in general. Meanwhile, the author has also published an e-book based on his blog. Let me finally point you to a tutorial write-up by wek at AVRFREAKS, which addresses some possible misunderstandings and pitfalls.

Example

In order to demonstrate the power of using inline assembly coding, let us look at the example of the 2MHz acquisition loop for the Arduino UNO logic analyzer (simplified):

byte trigger, trigger_value;
byte logicdata[1024];

// This function provides sampling for 2 MHz with no pre-trigger data
void acquire2MHz() {
  byte inp;
  unsigned int index = 0;
  
  cli(); // disable interrupts
  do {
    inp = PINB; // read sample
  } while ((trigger_values ^ inp ) & trigger); // as long as no trigger
  logicdata[0] = inp; // keep first trigger sample 
  
  // keep sampling for 1023 samples after trigger
  for (unsigned int i = 1 ; i < 1024; i++) {
    logicdata[i] = PINB;
  }
  sei(); // enable interrupts again
}

If we now look at what the compiler makes out of it, it looks quite efficient:

void acquire2MHz() {
  byte inp;
  
  cli();
 11e:	f8 94       	cli

  do {
    inp = PINB;
 120:	93 b1       	in	r25, 0x03	; 3

  } while ((trigger_values ^ inp ) & trigger); 
 122:	80 91 05 05 	lds	r24, 0x0505	; 0x800505 <trigger_values>
 126:	89 27       	eor	r24, r25
 128:	20 91 04 05 	lds	r18, 0x0504	; 0x800504 <trigger>
 12c:	82 23       	and	r24, r18
 12e:	c1 f7       	brne	.-16     	; 0x120
  logicdata[0] = inp;
 130:	90 93 04 01 	sts	0x0104, r25	; 0x800104 <logicdata>

  for (unsigned int i = 1 ; i < 1024; i++) {
 134:	81 e0       	ldi	r24, 0x01	; 1
 136:	90 e0       	ldi	r25, 0x00	; 0
 138:	81 15       	cp	r24, r1         ; [1]
 13a:	24 e0       	ldi	r18, 0x04	; 4 [1]
 13c:	92 07       	cpc	r25, r18        ; [1] 
 13e:	38 f4       	brcc	.+14     	; 0x14e [1 if false]

    logicdata[i] = PINB;
 140:	23 b1       	in	r18, 0x03	; 3 [1]
 142:	fc 01       	movw	r30, r24        ; [1]
 144:	ec 5f       	subi	r30, 0xFC	; 252 [1]
 146:	fe 4f       	sbci	r31, 0xFE	; 254 [1]
 148:	20 83       	st	Z, r18          ; [1]

  for (unsigned int i = 1 ; i < 1024; i++) {
 14a:	01 96       	adiw	r24, 0x01	; 1 [1]
 14c:	f5 cf       	rjmp	.-22     	; 0x138 [2]
  }

  sei();
 14e:	78 94       	sei

}
 150:	08 95       	ret

The interesting part is the for-loop between 138 and 14c. I have annotated the listing with the clock cycles in brackets. When you add them up, you get a result of 12 clock cycles. However, when you want to have a sample rate of 2 Ms/sec, this does not fit. Then each loop iteration should only be 8 clock cycles. So, here we have a case where inline assembler is definitely a must. The final acquisition procedure looks as follows:

void acquire2MHz() {
  byte * ptr = &logicdata[0];
  int delayCount = 1023;

  cli();
  asm volatile(
    "TRIGLOOP: in __tmp_reg__, %[CHANPINaddr]; read input [1]\n\t"
    "st %a[LOGICDATAaddr], __tmp_reg__  ; store input [2]\n\t"
    "eor __tmp_reg__, %[TRIGVAL]        ; inp = inp XOR trig_val [1]\n\t"
    "and __tmp_reg__, %[TRIGMASK]       ; inp = inp AND trigger [1]\n\t"
    "brne TRIGLOOP                      ; wait for trigger [1 if false]\n\t"
    "adiw %A[LOGICDATAaddr], 1          ; increment pointer [2]\n\t"
    ";This makes it 8 cycles! Now the sampling loop:\n\t"
    "SAMPLOOP: in __tmp_reg__, %[CHANPINaddr]; read input data [1]\n\t"
    "nop                                 ; cycle padding [1]\n\t"
    "st %a[LOGICDATAaddr]+, __tmp_reg__  ; store input &amp; post incr. [2]\n\t"
    "sbiw %A[DELAYCOUNT], 1              ; decrement delayCount [2]\n\t"
    "brne SAMPLOOP                       ; continue until end [2]\n\t"
    ";sum: 8 cycles\n\t"
    : 
    : [LOGICDATAaddr] "e" (ptr),	
      [CHANPINaddr] "I" (_SFR_IO_ADDR(PINB)),
      [DELAYCOUNT] "e" (delayCount), 
      [TRIGMASK] "r" (trigger),
      [TRIGVAL] "r" (trigger_values));
  sei();
}

Now only 8 clock cycles are used and in addition the triggering sample is also only 8 cycles before the general acquisition starts. There are a number of additional features used with respect to what we have introduced so far, though. First, I used symbolic references instead of positional references for output and input operands. Second, I used the notation %a[ref], in the instruction st %a[LOGICDATAaddr]+, __tmp_reg__. This means that the respective pointer register should be used, i.e, X, Y, or Z.

All in all, this example shows that with inline assembly, you can push the boundaries of timing on an AVR MCU quite a lot. But it is also clear from the code that it would be impossible to achieve a sample rate of 4 or 5 Ms/second with an AVR MCU running at 16 MHz using a loop. This is only achievable when one uses loop unrolling.

[Edit Nov-2023: An attentive reader noted that it is indeed possible to link .S files into an Arduino project. I changed that part.]

CodeProject