The importance of runtime observability for multi-threaded software verification
By Dr. Johan Kraft, CTO, Percepio AB. Most embedded systems today are powered by multi-threaded software, e.g., running on Linux or a real-time operating system (RTOS). While multi-threading has many advantages over single-threaded designs, it can make the software more complex and the verification more challenging.
Why is multi-threaded software different?
Traditional verification methods like code review, static analysis and functional testing are necessary but not sufficient to fully verify multi-threaded embedded software. This is because the threads may interact in ways that are not apparent in the source code. For example, operations on IPC objects like semaphores and mutexes may suddenly block the execution, due to the actions of other threads. Threads might also compete for processor time and have timing-dependent bugs like race conditions.
In this way, multi-threaded systems form an intricate web of dependencies between threads, caused by both explicit and implicit thread interactions. Moreover, such dependencies can be affected by software execution time variations, which are not explicit in the code but rather an emergent behaviour in runtime. This is especially problematic for multi-core systems, where bus conflicts between cores may affect execution times in unpredictable ways. Timing effects on thread interactions are often an uncontrolled factor in system testing of multi-threaded systems.
In some cases, vast amounts of testing might only skim the surface of an ocean of potential execution scenarios. Various latent defects may then remain undetected, only to surface in deployment. They can be next to impossible to replicate in the lab.
NASA’s first Mars rover, Pathfinder, is a good example. I bet this system was tested far more extensively than most embedded software, but several failures occurred enroute to Mars due to a priority inversion problem causing a watchdog reset. You find the details in a paper by Glenn E Reeves called What really happened on Mars? This is a good read, both the issue explanation and how they finally found the problem – using software tracing.
Improving testability of multi-threaded software
To improve the quality of multi-threaded software, just increasing the amount of testing doesn’t help much. First you need to ensure that the software is testable, i.e., that the same input always produces the same result at system level, regardless of the software timing. The best way to accomplish that is to follow best practices for multi-threaded software design. This is a broad topic, but some examples from the top of my mind are:
• For periodic threads, use a sound method for assigning RTOS scheduling priorities, such as rate-monotonic scheduling where the most frequent task is assigned the highest scheduling priority
• Keep interrupt handlers as short and deterministic as possible. Instead of running all the event-handling code within the interrupt handler, delegate jobs to low-priority threads
• Code with long execution time, or large variations in execution time, should run on as low scheduling priority as possible, ideally using the lowest scheduling priority
• Shared resources must be accessed in a safe way, without risk for race conditions, priority inversions or deadlocks. It is typically recommended to protect critical sections using mutex objects supporting priority inheritance
By ensuring a solid software design with respect to multi-threading, you can achieve a more stable, deterministic system behaviour, which improves the test coverage at system level without increasing the test effort. This minimises the risk of elusive bugs remaining in your production code.
Verifying real-time requirements
Embedded software are often real-time systems, with more or less explicit requirements on the software timing. For example, a control system might have a requirement to output control signals to a motor controller every five milliseconds, and any additional delay is considered a failure.
Such requirements are not only affected by the execution time of the specific thread, but also by dependencies on other threads. For example, a higher priority thread or interrupt may delay the execution more than expected. If the thread is using shared resources (e.g., a mutex) the outcome of the access might depend on the relative timing with other threads, that in turn depend on other things.
Thus, verifying real-time requirements is not only about measuring timing metrics, but also about identifying potential risks from thread interactions that may affect the timing requirements.
Analysing multi-threaded software
But how do you verify a software system in this way? How do you know if design is good from a multi-threading perspective? How do measure software timing and verify real-time requirements?
The de facto solution is runtime observability by using software tracing. This is provided by Percepio Tracealyzer, offering a large set of visual analysis features. Often used for debugging RTOS and Linux systems at system level, it also offers a lot of functionality for software design analysis and for verification of real-time requirements.
By adopting system tracing with Tracealyzer, developers can:
• Collect detailed runtime data on thread execution, thread interactions and timing over long test runs, without needing any specialised hardware for this purpose
• Find anomalies in the real-time behaviour using various high-level overviews, such as the CPU load graph or the statistics report, and simply click on an abnormal data point and see the details in the execution trace view.
• Analyse software timing variations, for example using the Actor Instance Graph showing a plot of various timing metrics for each thread over time
• Overview the thread dependencies, for example using the Communication Flow graph that provides a visual overview of thread interactions through IPC objects.
Tracealyzer does not require any particular hardware support but relies on efficient software instrumentation in the target software. This leverages existing trace points in the kernel, so you don’t need to add any instrumentation to your application code. You may however extend the tracing by adding explicit tracing calls in your application code.
The trace data can be transferred to the host computer in various ways, for example by real-time streaming over an ethernet connection or a supported debug probe.