Adaptive acceleration to bring AI from the cloud to the edge

12th November 2018

Xilinx

Alex Lynn

0 0

Emerging applications for AI will depend on System-on-Chip devices with configurable acceleration to satisfy increasingly tough performance and efficiency demands.

Author: Dale Hitt, Director Strategic Marketing Development at Xilinx

As applications such as smart security, robotics, or autonomous driving rely increasingly on embedded Artificial Intelligence (AI) to improve performance and deliver new user experiences, inference engines hosted on traditional compute platforms can struggle to meet real-world demands within tightening constraints on power, latency, and physical size. They suffer from rigidly defined inferencing precision, bus widths, and memory that cannot be easily adapted to optimise for best speed, efficiency, and silicon area. An adaptable compute platform is needed to meet the demands placed on embedded AI running state-of-the-art convolutional neural networks (CNN).

Looking further ahead, the flexibility to adapt to more advanced neural networks is a prime concern. CNNs that are popular today are being superseded by new state-of-the-art architectures at an accelerating pace. However, traditional SoCs must be designed using knowledge of current neural network architectures, targeting deployment typically about three years in the future, from the time development starts. New types of neural networks such as RNNs or Capsule Networks are likely to render traditional SoCs inefficient and incapable of delivering the performance required to remain competitive.

If embedded AI is to satisfy end-user expectations, and – perhaps more importantly – keep pace as demands continue to evolve in the foreseeable future, a more flexible and adaptive compute platform is needed. This could be achieved by taking advantage of user-configurable multi-core System on Chip (MPSoC) devices that integrate the main application processor with a scalable programmable logic fabric containing configurable memory architecture and signal processing suitable for variable-precision inferencing.
Inferencing precision

In conventional SoCs, performance-defining features such as the memory structure and compute precision are fixed. The minimum is often eight bits, defined by the core CPU, although the optimum precision for any given algorithm may be lower. An MPSoC allows programmable logic to be optimised right down to transistor level, giving freedom to vary the inferencing precision down to as little as 1-bit if necessary. These devices also contain many thousands of configurable DSP slices to handle multiply-accumulate (MAC) computations efficiently.

The freedom to optimise the inferencing precision so exactly yields compute efficiency in accordance with a square-law: a single-bit operation executed in a 1-bit core ultimately imposes only 1/64 of the logic needed to complete the same operation in an 8-bit core. Moreover, the MPSoC allows the inferencing precision to be optimised differently for each layer of the neural network to deliver the required performance with the maximum possible efficiency.

Memory architecture
As well as improving compute efficiency by varying inferencing precision, configuring both the bandwidth and structure of programmable on-chip memories can further enhance the performance and efficiency of embedded AIs. A customised MPSoC can have more than four times the on-chip memory, and six times the memory-interface bandwidth of a conventional compute platform running the same inference engine. The configurability of the memory allows users to reduce bottlenecks and optimise utilisation of the chip’s resources. In addition, a typical subsystem has only limited cache integrated on-chip and must interact frequently with off-chip storage, which adds to latency and power consumption. In an MPSoC, most memory exchanges can occur on-chip, which is not only faster but also saves over 99% of the power consumed by off-chip memory interactions.

Silicon area
Solution size is also becoming an increasingly important consideration, especially for mobile AI on-board drones, robots, or autonomous/self-driving vehicles. The inference engine implemented in the FPGA fabric of an MPSoC can occupy as little as one-eighth of the silicon area of a conventional SoC, allowing developers to build more powerful engines within smaller devices.

Moreover, MPSoC device families can offer designers a variety of choices to implement the inference engine in the most power-, cost-, and size-efficient option capable of meeting system performance requirements. There are also automotive-qualified parts with hardware functional-safety features certified according to industry-standard ISO 26262 ASIL-C safety specifications, which is very important for autonomous-driving applications. An example is Xilinx’s Automotive XA Zynq UltraScale+ family, which contains a 64-bit quad-core ARM Cortex-A53 and dual-core ARM Cortex-R5 based processing system alongside the scalable programmable logic fabric, giving the opportunity to consolidate control processing, machine-learning algorithms, and safety circuits with fault tolerance in a single chip.

Today, an embedded inference engine can be implemented in a single MPSoC device, and consume as little as 2 Watts, which is a suitable power budget for applications such as mobile robotics or autonomous driving. Conventional compute platforms cannot run real-time CNN applications at these power levels even now, and are unlikely to be able to satisfy the increasingly stringent demands for faster response and more sophisticated functionality within more challenging power constraints in the future. Platforms based on programmable MPSoCs can provide greater compute performance, increased efficiency, and size/weight advantages at power levels above 15W, too.

The advantages of such a configurable, multi-parallel compute architecture would be of academic interest only, were developers unable to apply them easily in their own projects. Success depends on suitable tools to help developers optimise the implementation of their target inference engine. To meet this need, Xilinx continues to extend its ecosystem of development tools and machine-learning software stacks, and working with specialist partners to simplify and accelerate implementation of applications such as computer vision and video surveillance.

Flexibility for the future
Leveraging the SoC’s configurability to create an optimal platform for an application at hand also gives AI developers flexibility to keep pace with the rapid evolution of neural network architectures. The potential for the industry to migrate to new types of neural networks represents a significant risk for platform developers. The reconfigurable MPSoC gives developers flexibility to respond to changes in the way neural networks are architected, by reconfiguring to build the most efficient processing engine using any contemporary state-of-the-art strategy.

More and more, AI is being embedded in equipment such as industrial controls, medical devices, security systems, robotics and autonomous vehicles. Adaptive acceleration leveraging programmable logic fabric MPSoC devices holds the key to delivering the responsive and advanced functionality required to remain competitive.