The featured picture of this blog post is by rawpixel.com on Freepik.

AVR MCUs sometimes appear to restart without you having pressed the RESET button or any other obvious reason. Is that a sign of resilience or of looming danger? How can you find the root cause?

When your AVR MCU appears to restart spontaneously, there can be a number of reasons for it. It is a good idea to identify the root cause because the MCU’s behavior after such a restart may be flaky, and the same cause can lead to a crash, i.e., the MCU becomes completely unresponsive.

Reasons for Restarts

There are real and apparent restarts. When I write apparent restart, I mean the behavior after the AVR MCU jumps to address 0x0000 without resetting the hardware. This differs from a real restart, where the MCU registers and control flags are reset to their initial default values before execution starts at 0x0000. This makes an apparent restart a very dangerous imposter of a real restart. The program very likely makes assumptions about the initial value of the registers and control flags, which are not satisfied after an apparent restart.

So, what can cause a real restart? One main reason is that the RESET pin is tied to GND for some time, i.e., when the RESET button is pressed. Further, applying power to the MCU leads to a power-on reset. In addition, the watchdog timer, when activated, can provoke a reset. Finally, when the supply voltage goes below a specified brown-out threshold, a reset is initiated. In all these cases, all registers are reset to their default values, and program execution starts at 0x0000.

The way to find out whether a real restart happened is to inspect the MCUSR register, in which the reason for a reset is stored.

From ATmega328 megaAVR® Data Sheet

WDRF is set after a watchdog reset, BORF is set after a brown-out reset, EXTRF is set after an external reset, and PORF is set after a power-on reset. If you are running your MCU with a bootloader, you will never see the contents of this register because it will be cleared by the bootloader. However, if you are not using a bootloader, you can find out the reset reason as follows.

#include <avr/wdt.h>  // this is optional!

byte mcusr_mirror __attribute__ ((section (".noinit")));

void mcusr_init(void) __attribute__((naked)) __attribute__((section(".init3"))) __attribute__((used));
void mcusr_init(void)
{
  mcusr_mirror = MCUSR;
  MCUSR = 0;
  wdt_disable(); // this is optional!
  return;
} 

The routine mcusr_init will be invoked before everything else happens and will set the variable mcusr_mirror, which you can inspect later in your program. If it is zero, then you know that an apparent restart occurred.

The inclusion of the header file avr/wdt.h and the call of wdt_disable() is optional. You need to call this function only when you use the watchdog timer in your sketch because, after a restart, the watchdog timer is always activated with the shortest possible watchdog interval.

So, if your MCU executes a restart sequence, although the four reasons above can be excluded, what could be the cause for this apparent restart?

Bad Interrupt

One reason for apparent restarts are bad interrupts, i.e., interrupts for which no interrupt service routine (ISR) has been registered. The default address for interrupts is simply 0x0000. So, if an interrupt is enabled, but the interrupt routine for it is not registered, the MCU will continue its execution at 0x0000.

One expects this to never happen since a reasonable programmer always provides an ISR before enabling the corresponding interrupt. However, by mistyping an interrupt vector name, an ISR might not be registered. This only leads to compiler warnings, which are, by default, disabled in the Arduino IDE. So, one way to find out whether a bad interrupt could be the reason is to switch on the warnings in the Preference dialog of the IDE or to use the compiler option -Wall. If you then see a message such as

/Users/.../file.ino:55:5: warning: 'TIMER_COMPA_vect' appears to be a misspelled 'signal' handler, missing '__vector' prefix [-Wmisspelled-isr],

then you know that you should try to find out the correct spelling for this ISR.

Of course, instead of a misspelled ISR name, a programmer could have enabled the wrong interrupt. If you suspect something like that, you can catch all bad interrupts by registering a catch-all ISR.

#include <avr/interrupt.h>
ISR(BADISR_vect)
{
    // user code here
}

Bad Indirect Function Call

It is possible to call a function indirectly by following a pointer. The following example illustrates this.

typedef void (*func_t)(void);

void sub(){
    Serial.println("Goodbye");
}

void setup(void)
{
    Serial.begin(19200);
    Serial.println("Hello world");
    func_t f_sub = &sub;
    f_sub();
}

void loop(void) { }

When you now assign the wrong value to f_sub, e.g., by not initializing the variable, the MCU may jump to some arbitrary place in flash memory. All of the flash memory after the highest writable flash memory cell read as 0xFF, which is interpreted as a NOP. This means that the MCU, once having jumped to such a location, will walk through the entire address space until the program counter overflows and starts with 0x0000 again. VoilĂ ! Alternatively, the MCU might jump to an arbitrary memory cell and act completely strangely, or it might do nothing.

Now, it is very unlikely that you used an indirect function call. It is probably the first time you have heard about indirect function calls. And I would not recommend using them in embedded systems. However, a jump to an arbitrary address in the flash address space can also have other causes.

Stack Overflow

The most likely reason for an apparent restart is a stack overflow. The stack is a data area that is used for storing local variables and return addresses. On AVRs, it starts at the top of the data area and grows downwards towards the beginning of the data area.

When making a function call or invoking an interrupt service routine, the return address is pushed onto the stack. At the start of each function, registers are saved, and an area for local variables is allocated. This is all freed when returning from the function. As a last action in each function, the instruction RET is executed, which pops the return address from the stack and loads it into the program counter. And this is precisely the moment when our MCU could be sent on a wild goose chase. When the stack has become too large, the return address may have been overwritten or not stored at all, and we have the same situation as with a bad indirect function call.

How can we find out whether such a thing happened? This is really difficult. You could define a function freeRam that measures how much of the data area is still free and which is called in every function, as in the following sketch, which also demonstrates the stack overrun phenomenon.

void setup(void)
{
  Serial.begin(19200);
  fun();
}

void loop(void) { }

void fun(void)
{
  Serial.println(freeRam());
  delay(50);
  fun();
  Serial.println(F("Returning"));
}

int freeRam(void)
{
  extern unsigned int __heap_start;
  extern void *__brkval;
  int free_memory;
  int stack_here;

  if (__brkval == 0) free_memory = (int) &stack_here - (int) &__heap_start;
  else free_memory = (int) &stack_here - (int) __brkval; 
  return free_memory;
}

The function fun called in the setup routine does nothing more than print the remaining amount of stack space and then calls itself again. After a while, the stack will overflow, and then the MCU crashes, i.e., becomes unresponsive or restarts. So, you could use freeRam to monitor the remaining amount of free memory and report if there is not enough left.

The best way to avoid a stack overflow at all is to declare all large data structures globally. And with large, I mean everything larger than, say, 20 bytes. Then, the variables taking up a lot of memory are allocated statically, and at the end of the compilation, you get a report of how much space is used up. If there are a few hundred bytes of free memory, and if you are then sure that function calls do not use too much space, you are on the safe side. And recursive functions, as in the example above, are, of course, out of bounds in an embedded programming context!

Heap Overflow?

The above description is actually only one side of the medal since the stack boundary is not static. There is another data area, called heap, that grows from the bottom to the top of the data area (on AVRs). The heap contains all dynamically allocated data structures. These are structures that you explicitly allocate with the new and malloc functions or that are dynamically allocated under the hood, as it happens, e.g., with the Arduino String class.

If the heap grows too big, then at some point, the allocation of memory will not work anymore, i.e., instead of an address to a freshly allocated piece of memory, you get NULL as the result of calling the function new or malloc. Further, we have the problem of a looming stack overflow because the heap eats up the entire space for the stack. So, while the heap will not overflow into the stack by itself, i.e., there is no danger of a heap overflow, subsequent function calls may still lead to a stack overflow much earlier because the heap takes up the free space.

My advice is to avoid dynamic memory allocation by all means in an embedded programming context. It is simply too risky, and one does not have any control over how much memory is allocated. Since there is usually only a very small amount of data memory, e.g., 256 bytes up to a few kilobytes, one needs to plan how to use this memory carefully and cannot rely on the fact that there will be enough memory, as you can on a desktop computer.

Buffer Overrun

While stack overflows happen because there is not enough memory, buffer overruns are explicit program errors caused by writing beyond the limits of an array, as in the following example.

void setup(void)
{
  Serial.begin(19200);
  Serial.println("Hello world");
  delay(100);
  fun();
  Serial.println("Goodbye");
}

void loop(void) { }

void fun(void)
{
  char buf[10];

  for (byte i=0; i < 19; i++) buf[i] = '\0';
  Serial.print(buf);
}

Here, we write to buf[18], although the array has only 10 array cells. What happens is that other data on the stack is overwritten. In particular, the return address is overwritten, which leads the MCU to jump to 0x0000, and we never see the “Goodbye”! By the way, when playing around with the upper bound in the iteration, different things happen. Such a buffer overrun can, of course, also occur with global variables.

The moral here is to make sure not to write beyond the boundary of an array. So, never assume that some input from the outside world will respect the buffer size assumption you made. Always check explicitly that the limits are not violated. Better raise some error if the limits are violated than be forced to hunt down some obscure buffer overrun.

Or is it the Hardware?

Stack overflows and buffer overruns are hard enough to locate. So, before one tries to hunt down such an error, it is a good idea first to check that the hardware is not the reason for the restarts. In particular, missing blocking capacitors of 100 nF close to the MCU have led to unstable behavior. Similarly, one should ensure the supply voltage is high enough and stable.

Summary

Spontaneous apparent restarts as well as occasional crashes of an MCU point to serious underlying problems. The most probable root causes on the software side are:

  • bad interrupts, which are usually easy to diagnose, because the compiler issues a warning if an ISR name is misspelled;
  • bad indirect function calls, which fortunately are not very frequent because they are not often used (for good reasons) in embedded programming;
  • stack overflows, which are hard to diagnose; avoid them by allocating all big data structures globally, do not use dynamic allocation of data structures, and make sure that there is enough space left on the stack;
  • buffer overruns, i.e., assigning values to array cells that are beyond the boundary of the array.

So, when restarts or crashes happen, look for these kinds of root causes first (after ensuring the hardware is working).

EDIT: Changed wording from “reboot” to “restart”, because MCUs do not boot.

Views: 166