Solving Real-Time Problems With Multicore Microcontrollers

14th May 2013

ES Admin

0 0

This article from ES Design magazine explores solutions for the real-time problems that can occur with multicore microcontrollers. By Ali Dixon, Director, Product Marketing and co-Founder, XMOS.

Concurrency and multicore processing are familiar concepts in most aspects of digital electronics today. In microprocessors, the use of multicore (or at least ‘manycore’) devices in particular has become increasingly prevalent. This shift dates back to as long ago as the turn of the century, when it became evident techniques that, until then, had driven rapid increases in processor performance were becoming ineffective.

At this time, practical clock speeds topped out in the low Gigahertz range and power consumption became a real issue. Meanwhile the use of superscalar technology — which in effect parallelises tasks at the instruction level by breaking down large instructions and executing them on different units within the execution core — also ran out of steam.

Many of the advanced techniques employed in microprocessors represented at best a mixed blessing when transferred into the world of embedded design. Superscalar instruction-level parallelism, for example, spawned devices that attempted to keep all of their execution units busy at all times, by executing instructions out-of-order, and attempting to predict the results of branch instructions.

While on one level this approach to multicore and parallelism may seem resource-efficient, the penalty is that the timing of code execution becomes increasingly unpredictable. In fact, for systems that run control loops, and for tasks like signal processing in which execution is highly data-dependent, such techniques are often worse than useless.

As a result, embedded designers — and particularly designers of systems with real-time requirements — turned to other forms of multicore processing to fulfil their needs. In many cases this has meant the use of hardware, in the form of an ASIC or FPGA. These may not at first sight seem like ‘multicore’ devices: but like the processor with the superscalar architecture, they effectively integrate many task engines on a single chip — albeit in the form of hardware. FPGAs are frequently used to implement multiple concurrent state machines.

But these approaches themselves have drawbacks. There may be a lack of flexibility (especially in the case of the ASIC, these devices are not programmable in the traditional sense); high non-recurring engineering (NRE) costs translating to poor cost-effectiveness at low to medium volume; and high unit cost and power consumption in the case of FPGAs.

In practice many embedded designers resort to an even more fundamental multicore strategy; they partition their system between several chips. It is not uncommon to see a design that includes a microcontroller (MCU), a DSP and an FPGA. Taking this approach means that the processes on each device are separated, and cannot interfere with each other unless they need to interact. This increases predictability. The penalty is not only an increase in system complexity (including communication between the various system blocks) and therefore bill of materials cost: such a design also requires programming of the MCU in C, of the DSP in a separate design environment (probably using assembly code) and of the FPGA in RTL.

However, whenever an embedded system needs to respond predictably, or in real time, the need for parallelism is not far away. A common solution is to employ a single processor resource, with a real-time operating system (RTOS) to divide and conquer the task. The RTOS schedules and manages critical processes into the processor resource, but makes it ‘look’ to each individual process as if it has exclusive use of the processor.

Again there are drawbacks: embedded designs are typically memory-constrained, and the RTOS itself takes up valuable memory footprint. And the scheduling processes can be a drain on the processing power available for the task at hand.

Another common alternative is to implement an interrupt-based system to manage task partitioning. The first issue with this approach is that interrupts are asynchronous, and hence the state of the processor at the time of the interrupt will be unknown. The programmer must be careful to restore the system to its previous state when the interrupt has been handled. Second, an incoming interrupt inherently delays whatever task it is interrupting: in a real-time system this may be unacceptable. The commonest solution is to over-engineer the system to ensure that timings are met.

The XMOS xCORE architecture seeks to combine the advantages of an MCU equipped with an RTOS, with all the benefits of multicore concurrency. Each device can be viewed as a network of 32-bit logical microcontroller cores interconnected with a built-in communication network. Almost all instructions execute in a single-cycle (the exceptions are the divide instructions and those which block to input data or synchronise to other tasks), so program execution is itself timing-deterministic. On-chip hardware is provided to connect I/O ports directly with logical cores, dramatically reducing latency of response to external events, and eliminating the need for interrupts — the single largest source of unpredictability in embedded design. Devices include timing and scheduling hardware, implementing the functions of an RTOS on-chip.

Figure 1: xCORE devices are constructed from tiles, subdivided into a set of logical cores. This illustration shows the recently-announced 10-core XS1-L10-128

Although the architecture itself is very different, the development model will be very familiar to any programmer. An xCORE device is programmed in a version of C with extensions to handle the multicore aspects of the design. DSP instructions are included, eliminating the need for multiple design environments. And the low-latency I/O and single-cycle execution mean that xCORE programs behave so predictably that they can actually be simulated like a hardware solution.

Time Slicing

An xCORE device is constructed from one or more tiles. The tile includes processor resource, memory, interconnect, and hardware response ports connecting the processor resource with the outside world. There is also a scheduling and timing block, which manages the subdivision of the tile resources into four, five, six or eight logical cores. This is effectively achieved by apportioning each logical core a guaranteed portion of the processing resource, time-sliced in sequence.

The logical cores on a tile share a single memory, with unconstrained read and write access. Because there are no caches, execution is highly predictable; the worst-case execution time of a straight section of code is virtually the same as the best-case execution time. Data-dependency may introduce uncertainty, but that is implicit in the algorithm: the overall effect is that designs can be engineered efficiently and economically, without large margins.

This innovative approach to microcontroller design often allows developers to combine the functions of several devices into a single xCORE chip. Meridian Audio, for example, used xCORE to replace a microcontroller, DSP and an FPGA, as well as two external devices, with a substantial bill of materials saving.

It is not only the multicore processing approach itself that brings benefits. The on-chip Hardware-Response ports allow xCORE devices to respond with extremely low latency (Figure 2). While this approach is inherently more responsive, it also scales better than traditional MCU architectures with increasing numbers of inputs.

The combination of responsiveness and predictable code execution means that xCORE devices can implement, in software, functions that require hardware in other device architectures. This is particularly useful when building communication interfaces; unlike a traditional MCU, an XCORE device can be software-configured with the exact combination of interfaces required for a particular application.

Figure 2: xCORE devices respond 100x faster than traditional MCUs and with a consistent latency independent of the number of inputs

A good illustration of this advantage is the strength of the architecture when implementing real-time digital communications and infotainment. For example, xCORE is already a dominant force in the field of USB audio, in which data needs to be delivered predictably to enable the highest audio quality. This has brought design wins at customers including Sennheiser. The architecture also scales well to multi-channel systems; for instance for the emerging Ethernet AVB standards, XMOS offers a complete xCORE-powered multi-channel evaluation kit that allows multiple audio talkers and listeners to be connected together quickly and easily.

xCORE systems are programmed in C with extensions for concurrency, communications, and real-time I/O. XMOS provides an easy-to-use toolchain called xTIMEcomposer based around the LLVM compiler that translates C into xCORE machine code. Both the instruction set and language extensions have been designed to facilitate highly efficient code generation.

The design environment is backed by a range of off-the-shelf xSOFTip soft peripherals that make use of the ‘software-as-hardware’ capabilities of the architecture. As well as high speed USB 2.0 and Ethernet interfaces, these include S/PDIF, I2S, SPI and CAN. In addition, the developer can code any arbitrary interface they require. This makes the architecture particularly suitable for protocol bridging applications, for example XMOS demonstrated CAN connectivity over AVB at the Embedded World exhibition in February 2013.

The range of devices available has already expanded rapidly beyond the original 8-core and 16-core options. As well as system-in-package solutions with integrated USB, XMOS has recently added 6-core, 10-core and 12-core versions. Most recently, the company announced the XS1-L4-64 device, a 4-core variant believed to be the industry’s lowest-cost multicore microcontroller, priced below $2 in volume.

xCORE is already proving the value of multicore processing in embedded applications. The combination of deterministic processing, parallelism and predictable I/O mean that the architecture can address many applications that are simply beyond traditional microcontrollers. This, along with the ability to reduce BOM costs by integrating DSP, control and interfacing functions in a single device, mean that an increasing number of designs in industrial, consumer and automotive applications will make use of this innovative approach to embedded systems design.