MCU Reboots: Why Do They Happen? – Arduino Craft Corner

The featured picture of this blog post is by rawpixel.com on Freepik.

AVR MCUs sometimes appear to reboot without you having pressed the RESET button. Is that a sign of resilience or of looming danger? And how to find the root cause?

When your AVR MCU seems to suddenly reboot, there can be a number of reasons for it. And it is a good idea to identify the root cause, because the MCU behavior after an apparent reboot may be flaky, and because the same cause can lead to a crash of the MCU, i.e., the MCU becomes completely unresponsive.

Reasons for Reboots

There are real and apparent reboots. When I write apparent reboot, I mean the behavior after the AVR MCU started at address 0x0000 without a hardware reset. This is different from a real reboot, where the MCU registers are reset to their initial default values. And this makes an apparent reboot the very dangerous imposter of a real reboot.

So, what can cause a real reboot? One main reason is that the RESET pin is tied to GND for some time, i.e., when the RESET button is pressed. Further, applying power to the MCU leads to a power-on reset. In addition, the watchdog timer, when activated, can issue a reset. Finally, when the supply voltage goes below a specified brown-out threshold, a reset is initiated. In all these cases, all registers are reset to their default values and program execution starts at 0x0000.

The way to find out whether a real reboot happened is to inspect the MCUSR register, in which the reason for a reset is stored.

WDRF is set after a watchdog reset, BORF is set after a brown-out reset, EXTRF is set after an external reset, and PORF is set after a power-on reset. If you are running your MCU with a bootloader, you will never see the contents of this register because it will be cleared by the bootloader. However, if you are not using a bootloader, you can find out the reset reason as follows.

#include <avr/wdt.h>  // this is optional!

byte mcusr_mirror __attribute__ ((section (".noinit")));

void mcusr_init(void) __attribute__((naked)) __attribute__((section(".init3"))) __attribute__((used));
void mcusr_init(void)
{
  mcusr_mirror = MCUSR;
  MCUSR = 0;
  wdt_disable(); // this is optional!
  return;
}

The routine mcusr_init will be invoked before everything else happens and will set the variable mcusr_mirror, which you can inspect later on in your program. If it is zero, then you know that an apparent reboot happened.

The inclusion of the header file avr/wdt.h and the call of wdt_disable() is optional. You need to call this function only when you use the watchdog timer in your sketch because after a reboot, the watchdog timer is always activated with the shortest possible watchdog interval.

So, if your MCU executes a restart sequence, although the four reasons above can be excluded, what could be the cause for a restart?

Bad Interrupt

One reason for apparent reboots are bad interrupts, i.e., interrupts for which no interrupt service routine (ISR) has been registered. The default address for interrupts is simply 0x0000. So, if an interrupt is enabled, but the interrupt routine for it is not registered, the MCU will continue its execution at 0x0000.

One expects that this would never happen, since a reasonable programmer always provides an ISR before enabling the corresponding interrupt. However, by mistyping an interrupt vector name, an ISR might not be registered. This only leads to compiler warnings, which are by default disabled in the Arduino IDE. So, one way to find out whether a bad interrupt could be the reason is to switch on the warnings in the Preference dialog of the IDE or to use the compiler option -Wall. If you then see a message such as

/Users/.../file.ino:55:5: warning: 'TIMER_COMPA_vect' appears to be a misspelled 'signal' handler, missing '__vector' prefix [-Wmisspelled-isr],

then you know that you should try to find out the right spelling for this ISR.

Of course, instead of a misspelled ISR name, a programmer could have enabled the wrong interrupt. If you suspect something like that, you can catch all bad interrupts by registering a catch-all ISR.

#include <avr/interrupt.h>
ISR(BADISR_vect)
{
    // user code here
}

Bad Indirect Function Call

It is possible to call a function in an indirect way, following a pointer. This is illustrated in the following example.

typedef void (*func_t)(void);

void sub(){
    Serial.println("Goodbye");
}

void setup(void)
{
    Serial.begin(19200);
    Serial.println("Hello world");
    func_t f_sub = &sub;
    f_sub();
}

void loop(void) { }

When you now assign the wrong value to f_sub, e.g., by not initializing the variable, the MCU may jump to some arbitrary place in flash memory. All of the flash memory after the highest writable flash memory cell read as 0xFF, which is interpreted as a NOP. This means that the MCU, once having jumped to such a location, will walk through the entire address space until the program counter overflows and starts with 0x0000 again. Voilà! Alternatively, the MCU might jump to an arbitrary memory cell and act completely strange or it might do nothing.

Now, it is very unlikely that you used an indirect function call. It is probably the first time, you heard about indirect function calls at all. And I would not recommend using them in embedded systems. However, a jump to an arbitrary address in the flash address space can also have other causes.

Stack Overflow

The most likely reason for an apparent reboot is a stack overflow. The stack is a data area that is used for storing local variables and return addresses. On AVRs, it starts at the top of the data area and grows downwards towards the beginning of the data area.

When making a function call or when invoking an interrupt service routine, the return address is pushed onto the stack. At the start of each function, registers are saved and an area for local variables is allocated. This is all freed when returning from the function. As a last action in each function, the instruction RET is executed, which pops the return address from the stack and loads it into the program counter. And this is precisely the moment when our MCU could be sent on a wild goose chase. When the stack has become too large, then the return address may have been overwritten or not stored at all, and we have the same situation as with a bad indirect function call.

How to find out, whether such a thing happened? This is really difficult. You could define a function freeRam that measures how much of the data area is still free and which is called in every function, as in the following sketch, which also demonstrates the stack overrun phenomenon.

void setup(void)
{
  Serial.begin(19200);
  fun();
}

void loop(void) { }

void fun(void)
{
  Serial.println(freeRam());
  delay(50);
  fun();
  Serial.println(F("Returning"));
}

int freeRam(void)
{
  extern unsigned int __heap_start;
  extern void *__brkval;
  int free_memory;
  int stack_here;

  if (__brkval == 0) free_memory = (int) &stack_here - (int) &__heap_start;
  else free_memory = (int) &stack_here - (int) __brkval; 
  return free_memory;
}

The function fun called in the setup routine does nothing more than print the remaining amount of stack space and then calls itself again. After a while, the stack will overflow and then the MCU crashes, i.e., becomes unresponsive, or it restarts. So, you could use freeRam to monitor the remaining amount of free memory and report if not enough is left.

The best way to avoid a stack overflow at all is to declare all large data structures globally. And with large, I mean everything larger than, say, 20 bytes. Then, the variables taking up a lot of memory are allocated statically, and at the end of the compilation, you get a report of how much space is used up. If there are a few hundred bytes of free memory, and if you are then sure that function calls do not use too much space, you are on the safe side. And recursive functions as in the example above are, of course, out of bounds in an embedded programming context!

Heap Overflow?

The above description is actually only one side of the medal since the stack boundary is not static. There is another data area, called heap, that grows from the bottom to the top of the data area (on AVRs). The heap contains all dynamically allocated data structures. These are structures that you explicitly allocate with the new and malloc functions or that are dynamically allocated under the hood, as it happens, e.g., with the Arduino String class.

If the heap grows too big, then at some point, the allocation of memory will not work anymore, that is, instead of an address to a freshly allocated piece of memory you get NULL as the result of calling the function new or malloc. Further, we have the problem of a looming stack overflow because the entire space for the stack is eaten up by the heap. So, while the heap will not overflow into the stack by itself, i.e., there is no danger of a heap overflow, subsequent function calls may still lead to a stack overflow much earlier because the free space is taken up by the heap.

My advice is to avoid dynamic memory allocation by all means in an embedded programming context. It is simply too risky and one does not have any control of how much memory is allocated. Since there is usually only a very small amount of data memory, e.g., 256 bytes up to a few kilobytes, one needs to plan how to use this memory carefully and cannot rely on the fact that there will be enough memory, as you can on a desktop computer.

Buffer Overrun

While stack overflows happen because there is not enough memory, buffer overruns are explicit program errors caused by writing beyond the limits of an array, as in the next example.

void setup(void)
{
  Serial.begin(19200);
  Serial.println("Hello world");
  delay(100);
  fun();
  Serial.println("Goodbye");
}

void loop(void) { }

void fun(void)
{
  char buf[10];

  for (byte i=0; i < 19; i++) buf[i] = '\0';
  Serial.print(buf);
}

Here, we write to buf[18], although the array has only 10 array cells. What happens is that other data on the stack is overwritten. In particular, the return address, which leads the MCU to jump to 0x0000, and we never see the “Goodbye”! By the way, when playing around with the upper bound in the iteration, different things happen. Such a buffer overrun can, of course, also happen with global variables.

The moral here is to make sure not to write beyond the boundary of an array. So, never ever make an assumption that some input coming from the outside world will always respect the buffer size assumption you made. Always check explicitly that the limits are not violated. Better raise some error if the limits are violated than be forced to hunt down some obscure buffer overrun.

Or is it the Hardware?

Stack overflows and buffer overruns are hard enough to locate. So, before one tries to hunt down such an error, it is a good idea to first check that the hardware is not the reason for the restarts. In particular, missing blocking capacitors of 100 nF close to the MCU have been seen to lead to unstable behavior. Similarly, one should make sure that the supply voltage is high enough and stable.

Summary

Spontaneous apparent reboots as well as occasional crashes of an MCU point to serious underlying problems. The most probable root causes on the software side are:

bad interrupts, which are usually easy to diagnose, because the compiler issues a warning if an ISR name is misspelled;
bad indirect function calls, which fortunately are not very frequent because they are not often used (for good reasons) in embedded programming;
stack overflows, which are hard to diagnose; avoid them by allocating all big data structures globally, do not use dynamic allocation of data structures, and make sure that there is enough space left on the stack;
buffer overruns, i.e., assigning values to array cells that are beyond the boundary of the array.

So, when reboots or crashes happen, look for these kinds of root causes first (after having made sure that the hardware is working).