Featured image: Courtesy of the Naval Surface Warfare Center, Dahlgren, VA., 1988. – U.S. Naval Historical Center Online Library Photograph NH 96566-KN
Be honest: When did you last time designed a system that worked right from the start and did not need any corrections? Right, that was probably a long time ago – or never. People usually spend a lot of time in identifying and correcting errors, colloquially called bugs. In this blog post, I will give you an overview of different forms of bugs and what you can do about them.
Failures, faults, and mistakes
The informal term bug refers to something that went wrong. However, let us be a bit more specific and use IEEE standard terminology to distinguish between conceptually different error types. First, there is the observation that your system does not behave as you expected it to do. It gives a wrong result, it does not respond to an input, it acts catastrophically, or something similar. This is called a failure of the system. Failures are caused by one or more faults in the program (or the hardware), i.e., some program code that should have been written differently. For instance, at one point, one should have used subtraction instead of addition. The fault in turn is caused by a human mistake, which might be that the programmer mistyped something or that the ideas of the programmer about the program logic are wrong. The terms bug and error can refer to any of these three phenomena.
How do bugs materialize?
First, let us try to categorize programming errors along the dimension of how they materialize, i.e., what kind of failures you could expect:
- Compilation (or syntax) errors, i.e., errors that are detected and flagged by the compiler (or the linker).
- Runtime errors, i.e., errors that show up only at runtime and that usually result in the termination or unresponsiveness of the program.
- Semantic (or logic) errors, which materialize as unintended behavior of the program (but do not result in a termination, crash, or hang).
The easiest kind of error is the first one, where the compiler tells you what is wrong. You just have to correct a typo, declare a variable, or change the type of a variable. This kind of error is often also called syntax error but can involve much more than mere syntax, in particular when it comes to typing. Errors detected by the linker I would also consider to fall in this class. Here we notice a typo or a forgotten declaration not on the level of a source code file, but only when trying to link everything together.
The second kind of error is much more nasty, in particular when it comes to embedded systems. They can be caused by, e.g., an arithmetic overflow, an array index out of bound, an uninitialized variable, a dangling pointer, a stack overflow, a heap overflow, or a wrong condition in a while loop. On desktop or laptop computers, such errors usually lead to an error message and the termination of the program. Alternatively, the program may “hang”, i.e., does not respond to anything, either because it is in an infinite loop or it waits for an input that never comes. In contrast, on an embedded system, e.g., an Arduino, the program never stops or gives you a runtime error message. Instead, it will do strange things such as restarting, it may become completely unresponsive, or might do completely unpredictable things. The reason is that the small MCUs do not monitor memory access and do not check whether a program or data address is “legal.” For instance, if a stack overflow happens, then when returning from a subroutine, the MCU might retrieve a wrong return address leading to an address outside the address space of the program memory, which will lead, in the case of an AVR chip, to running through the entire address space until address 0 is hit, which contains a jump to the reset address, and the system probably restarts. This all means that on embedded systems it may be hard to detect such an error and even harder to diagnose it.
The third kind of error, the semantic one, is the worst one, because it is hard to catch, in particular with embedded systems. Such failures can be caused by programmer mistakes where two symbols were mixed up or by very subtle misunderstandings of the meaning of programming constructs or the intended semantics of functions, perhaps written by somebody else.
Assume you have a good idea of what your (sub-)system is supposed to do (you have a specification of its required behavior). And you have written a program that compiles and never crashes or hangs. In fact, it does all that your specification says. It may nevertheless be incorrect because your specification is already wrong, e.g., it makes wrong assumptions about the physical environment or sub-systems it has to communicate with.
One striking example of such an error is the crash of the Mars Climate Orbiter in 1999. The sensor system delivered its measurements in imperial units, while the overall system expected the measurements to be in metric units. The orbiter crashed and more than $300,000,000 were lost. So, here everything worked perfectly, there was only the misunderstanding about what a measurement operation really meant, i.e., what the semantics of such an operation were.
The difference between the second and third kinds of error is that you have to look for different causes. In case of a runtime error, most often you have forgotten to initialize a variable, or you specified the wrong operation. That means you have to identify the place in the program where the code does something violating your idea of what the program is supposed to do. In case of a semantic error, it could be the same. However, it could also be that you first have to understand why your idea of what the program is supposed to do is wrong. Hunting for the bug using debugging tools (which will be described in an upcoming blog post) will probably be similar, the correction might be quite different though.
What is the root cause?
The root cause for failures is not necessarily an error in the program logic. In embedded systems, there could be many different causes. In order to keep things simple, we again will just consider four very broad classes of errors:
- Timing-independent errors happen regardless of how fast or slow the program is executed.
- Timing-dependent errors happen only under particular timing conditions.
- Event-ordering-related errors happen only when events are ordered in a particular way.
- Hardware-related errors, i.e., errors that are (mainly) caused by (faulty?) hardware design.
If you can reproduce an erroneous behavior regardless of how slow your MCU runs, I would consider this a timing-independent bug. Most probably, there is a lower limit to how slow you can go, because your MCU never exists in isolation, and it has to react to external signals somehow. However, you most probably have an error that can be detected without taking timing considerations into account, which makes it easy to pinpoint the place in the program that should be blamed. Using classical debugging methods (which I will describe in an upcoming blog article), you should be able to identify the error.
Most often in the development of embedded systems, however, timing is very important and the manifestation of erroneous behavior depends on how fast your MCU runs (or how many cycles you put between important operations), i.e. you have a timing-dependent bug. In order to avoid such errors, one usually tries to identify time-critical code beforehand and develops and tests it separately. So, in particular, communication routines and interrupt services are developed and tested in isolation. Only, if you are sure that they work, you use them in your system. In the Arduino universe, all the official libraries promise implicitly to be usable in the overall system without causing trouble. But you never know! And you never had a formal specification in the beginning. For this reason, you should always consider timing-dependent errors caused by an imported library as a possibility.
A classical source of problems is that additional interrupt service routines (ISRs) can take away essential MCU cycles. For instance, if you have a bit-banging communication routine, you might lose synchronization if an interrupt handler needs too many compute cycles. Sometimes, with a slower communication rate, such issues could be avoided. Another potential way to solve the problem is to disallow interrupts while the bit-banging routine is active. In any case, debugging such time-critical code is usually not done using a classic debugger, but you have to use a logic analyzer or an oscilloscope.
However, time-dependent bugs might also show up on the general level, where you do not expect any timing dependencies. When I recently disabled the printing of debug information after everything seemed to work, the program without printing the debug information did not work at all. In other words, this is a good example of what people call a Heisenbug, a bug that disappears when you try to observe it.
Related to the time-dependent bugs are those that depend on the order of events, and event-ordering-related errors. These are errors that manifest themselves only if external and internal events are ordered in a particular way. One example could be that the main program reads a particular global integer variable and an interrupt service routine changes this variable. In this case, the variable has to be declared volatile, i.e., accessible by an ISR and the main program at any point in time. It can happen that in the main program, the first byte of the variable is loaded into the MCU, then an interrupt occurs that changes the values, and after that, the second byte is loaded into the MCU. In this case, the loaded value is garbage, since the first and the second byte do not have anything to do with each other. This error happens only if the interrupt happens in the middle of loading the value. The solution is to use the ATOMIC_BLOCK macro. However, catching such an error in the first place is very difficult because it will not happen very often. So one better try to avoid them at all costs! There are, of course, other possible such errors, e.g., when interrupts happen close to each other and one interrupt is served too late.
Finally, there are hardware-related bugs, which we will have a look at in the next blog post of this series.
Leave a Reply