What is the purpose of the C++ qualifier volatile
, what does it have to do with race conditions, and what are Heisenbugs?
Motivation
Heisenbugs are software bugs that seem to disappear when you try to locate them using a debugger (or other means of observation). The name is a pun on Heisenberg, who formulated the observer effect of quantum mechanics.
The variable qualifier volatile
informs the compiler that it should consider a variable as something that could be changed by unpredictable outside powers. So, if you do not use this keyword in cases where you should have done it, the compiler could make optimizations that make overly optimistic assumptions, leading to wrong behavior. Because the use of the volatile
qualifier influences how the compiler optimizes the code, it is an ideal candidate for provoking a Heisenbug.
In any case, even after fixing your program by including the volatile
qualifier at the right place, you may not be necessarily safe. Volatile
variables may lead to race conditions. These are conditions where two code paths execute asynchronously to each other and the actual execution order decides on the final outcome. And not all of these outcomes may be intended. Bugs caused by race conditions are tricky to catch. However, you can prevent race conditions by using the programming concept of atomic blocks.
When to use the volatile qualifier
Let us look at an example that will clarify the notions of Heisenbug and volatility. The example is similar to the famous blink sketch but contains a part in the setup
function where one waits for a button press in order to proceed. And to make things more complicated, we implement that using an external interrupt.
bool ready; void setup() { pinMode(2, INPUT_PULLUP); // button pinMode(4, OUTPUT); // button GND pinMode(LED_BUILTIN, OUTPUT); attachInterrupt(digitalPinToInterrupt(2), button, LOW); ready = false; // start condition while (!ready) { // wait for button pressed delay(10); } } void loop() { digitalWrite(LED_BUILTIN, HIGH); delay(1000); digitalWrite(LED_BUILTIN, LOW); delay(1000); } void button(void) { detachInterrupt(digitalPinToInterrupt(2)); ready = true; }
Lines 4 to 6 initialize the GPIOs, in line 7 an interrupt is associated with pressing the button on pin 2, and line 8 states that we are not ready
yet. In lines 9-10, we are waiting for ready
to become true, which can happen only when the button is pressed and the interrupt on pin 2 is raised. Inside the interrupt routine button
(starting at line21), the ready
variable is set to true (line 23).
Compiling and uploading the sketch to an Arduino UNO leads to the observation that pressing the button does not have any effect. OK, time for debugging. Using the debug-enabled MiniCore and connecting the UNO board to the dw-link hardware debugger, we start the Arduino IDE 2 debugger.

After enabling Optimize for Debugging
in the Sketch
menu, recompiling the sketch, and starting the Arduino IDE 2 debugger, we break in line 4, the first line of setup
. This is a preset by the Arduino debugger.
Now, we set a watch on the variable ready
in the Watch
pane, place breakpoints in line 9 (checking the ready
variable), in the last line of setup
, line 12, and in line 24, the last line of the interrupt routine.
After starting execution, we will end up at line 9. When we now press the button and click continue
in the debug control pane, we break at line 24, at which point the variable ready
will be shown to have the value true. Continuing twice will lead us to line 12. So, everything worked correctly. In other words, by debugging the problem, the bug has disappeared. This is definitely a Heisenbug!
For this reason, let us try to restore the original conditions under which the bug appeared and disable Optimize for Debugging
. Then, we recompile the sketch, and start debugging again (with breakpoints still at lines 9, 12 and 24). This time, we will initially break in the file main.cpp
at line 43, where the function setup
is called. The reason is that the compiler did some function inlining. This is a bit confusing, but it will not stop us from further hunting down the bug.
Pressing the button and clicking continue
will lead to a break in the button
function at line 23. After a single step, the value of the variable ready
is shown to be true. When we now click continue
, execution will stop at line 10 again, and the value of the variable ready
is displayed as true. This does not change if we click on continue
several times. In other words, something is seriously wrong here. The test in the while loop is obviously not happening at all.
As a matter of fact, when looking at the assembly code listing (generated by using Export Compiled Binary
in the Sketch
menu), one notices that the while loop does not contain any test!
/Users/.../volatile.ino:8
attachInterrupt(digitalPinToInterrupt(2), button, LOW);
ready = false; // start condition
468: 10 92 04 01 sts 0x0104, r1 ; 0x800104 <__data_end>
/Users/.../volatile.ino:10
while (!ready) { // wait for button pressed
delay(10);
46c: 6a e0 ldi r22, 0x0A ; 10
46e: 70 e0 ldi r23, 0x00 ; 0
470: 80 e0 ldi r24, 0x00 ; 0
472: 90 e0 ldi r25, 0x00 ; 0
474: 0e 94 1e 01 call 0x23c ; 0x23c <delay>
478: f9 cf rjmp .-14 ; 0x46c <__LOCK_REGION_LENGTH__+0x6c>
When you look at the source code, it is clear why the compiler decided to optimize the test away. The variable ready
is set to false, and then we run a loop waiting for ready
to become true. Taking only the local context, it is, of course, superfluous to check for the value of ready
inside the loop. It is set to false right before the while loop and nothing is going to change that inside the loop.
So, this is where the qualifier volatile
comes into play. By prefixing the definition of ready
with this keyword, the compiler knows that it should never optimize assignments and tests of this variable away, or keep variables in registers only. Each assignment to the variable is implemented by storing the new value to random access memory, and for each test, the value is loaded from random access memory. And indeed, now everything works as expected.
Race conditions
Let us consider a slight variation of the above example, where we have a variable wait
that is counted up. With that, we leave the setup
function either after 25 seconds or when the button is pressed.
volatile byte wait; void setup() { pinMode(2, INPUT_PULLUP); // button pinMode(4, OUTPUT); // button GND pinMode(LED_BUILTIN, OUTPUT); attachInterrupt(digitalPinToInterrupt(2), button, LOW); wait = 1; // start condition while (wait) { // wait for button pressed wait++; delay(100); } } void loop() { digitalWrite(LED_BUILTIN, HIGH); delay(1000); digitalWrite(LED_BUILTIN, LOW); delay(1000); } void button(void) { detachInterrupt(digitalPinToInterrupt(2)); wait = 0; }
Since we used the qualifier volatile
, the compiler will be cautious about optimizing things away. And indeed, the sketch seems to work as advertised. Only sometimes does it do the wrong thing. Namely, it does not react to the button. But this happens only one out of a thousand times.
Such non-deterministic, time-dependent, rare errors are really annoying and challenging to catch. They are the typical kind of Heisenbugs because even slight variations in the timing might mask them. Nevertheless, we will try to catch this bug using the debugger.
The plan is to systematically try out all places in the while loop and see what happens if the interrupt happens just before the line is executed. That is, we run several debug sessions, breaking in each run at a different line out of the lines 9, 10, 11, and 12. In each case, we press the button and then click continue
. Doing so reveals that we get the intended behavior (leaving the while loop) at all lines except line 10.
A closer look shows that this could have been obvious from the beginning. When the button is pressed just before wait++
in line 10, then wait
will be set to zero by the interrupt routine. However, when the interrupt finishes and the program continues, the value of the variable wait
is incremented, and the while loop will not terminate.
We were actually lucky to find the bug since, in general, the critical place may lie inside a program line (for example, when multi-byte arithmetic is performed). For this reason, it is much more advisable to avoid such potential race conditions by using the appropriate programming construct. In our case, this would mean to put the test and write instructions together into a so-called atomic block that cannot be interrupted, as shown in the following listing, adding an include file and replacing line 9-12 as follows.
#include <util/atomic.h> ... while (true) { ATOMIC_BLOCK(ATOMIC_RESTORESTATE) { if (wait == 0) break; else wait++; } } ...
With that, the interrupt routine will be called before or after the atomic block. This means that the variable cannot be changed by any interrupt routine between the point when the variable is read and based on that modified.
The macro parameter ATOMIC_RESTORESTATE
will lead to saving the status register (including the interrupt bit) before the block starts, and will restore the register afterwards. Instead, one can use ATOMIC_FORCEON
, which is more efficient. It locks interrupts in the beginning and enables them afterwards.
Thread-safe programming on classic AVRs
Thread safety means that data can be accessed concurrently by multiple threads without causing unexpected behavior, race conditions, or data corruption. In the AVR context, we do not have many occasions where multiple threads are active. Only interrupt routines can run concurrently with the user sketch. For this reason, it is easy to ensure that ones code is thread-safe:
- If a variable is used in the user sketch and an interrupt routine, mark it as
volatile
. - If a variable is used in the user sketch and an interrupt routine, and it is read and then, based on its contents changed in the user sketch, include these two operations in one atomic block.
- If a multi-byte variable is used in the user sketch and an interrupt routine, then include assignments and tests of this variable happening in the user sketch in atomic blocks.
Not following this advice can easily lead to sporadic, hard-to-locate errors, which might even disappear when one tries to debug them. Note that it is unnecessary to use atomic blocks in the interrupt routine because these are not interruptible (provided the interrupt enable bit is not changed in the interrupt routine). All in all, it should not be too difficult to follow this advice to avoid such problems altogether.
Leave a Reply