CNN hardware accelerator brings improved performance
Deep learning and Deep Neural Networks (DNNs) in particular, are currently one of the most intensively and widely used predictive models in the field of machine learning. DNNs are not a new concept, but after the recent initial breakthrough applications of DNNs in the fields of speech recognition and image recognition, they have returned to the focus of both academic and industrial communities.
Author: Dr Rastislav Struharik, Associate Professor at UNS, CTO of Kortiq
Today, different types of DNNs are being employed in a wide range of applications, ranging from autonomous driving, medical and industrial applications, to playing complex games. In many of these application domains, DNNs are now able to exceed human performance.
This exceptional performance of DNNs and, in particular Convolutional Neural Networks (CNNs), predominantly arises from their ability to automatically extract high level features from raw sensory data during the training phase, using large amounts of data, in order to obtain an effective representation of an input space. This approach is different from earlier machine learning attempts that have used manually crafted features or rules designed by experts. Because of this, CNNs currently offer the best recognition quality versus alternative object recognition, or image classification algorithms.
“We recognise a strong and increasing demand for object recognition and image classification applications,” said Michaël Uyttersprot, Technical Marketing Manager for Avnet Silica. “CNNs are widely implemented in many embedded vision applications for different markets, including the industrial, medical, IoT and automotive market.”
Convolutional neural network is a type of feed-forward artificial neural network in which the connectivity pattern between the neurons is inspired by the neural connectivity found in the animal visual cortex. Individual neurons from visual cortex respond to stimuli only from a restricted region of space, known as receptive field. Receptive fields of neighbouring neurons partially overlap, spanning the entire visual field. Previously it was shown that the response of individual neurons to the stimuli within its receptive field can be approximated mathematically by a 3D convolution operation, which is extensively used in CNNs.
The CNN architecture is formed by stacking together many layers of differentiable functions, which transform the input instance into an appropriate output response (e.g. holding the class scores). A number of different layers are commonly used when building a CNN: convolutional layer, pooling layer, non-linear layer, adding layer, concatenation layer, fully connected layer.
However, the superior accuracy of CNNs comes at the cost of their high computational complexity. All CNN networks are extremely computationally demanding, requiring billions of computations in order to process single input instance. The largest CNNs (from the VGG neural networks models) require more than 30 billion computations in order to process one input image. This is a huge number, significantly reducing the applicability of CNNs in embedded/edge devices. For example, using VGG-16 CNN to process a video stream at 25 frames per second requires the target processing system to be able to perform more than 750 GOp/s in order to achieve the target frame rate.
All CNN networks are also extremely memory demanding, requiring megabytes of memory space for storing CNN network parameters. For example, VGG-16 CNN has more than 138 million network parameters. With a 16-bit fixed-point representation, we need to allocate more than 276 Mbytes of memory just for storing all network parameters.
Furthermore, sizes of intermediate feature maps, generated as CNN processes input image, can also be very large, requiring significant memory resources. For example, the size of the largest input feature map from the VGG-16 CNN is around 6 Mbytes. Although significantly smaller than the network parameters memory size, even this amount of memory can be unacceptably high for many embedded applications.
Even more importantly, since VGG-16 CNN is composed of a total of 21 layers, the total required feature map data movement to process one input image reaches over 60 MB. Moving that amount of data between main memory and processing system can have a significant impact on the power consumption.
It is likely that the future CNNs will be larger, deeper, will process even larger input instances, and will be used to perform more intricate classification tasks at faster speeds, ever increasingly in real-time, within low-power environments. While general-purpose compute engines, especially GPUs, have been the mainstay for much of CNN processing today, increasingly there is an interest in providing a more specialised acceleration of CNN computations.
From the above observations it can be concluded that, in order to deploy CNNs in embedded/edge devices, a specialised hardware architectures for processing CNNs must be developed, since only they can provide enough processing power under a strict power envelope.
Kortiq, a Munich based company, has recently developed a novel CNN hardware accelerator, called AIScale. AIScale, distributed as an IP core intended to be implemented using FPGA technology, provides performance in terms of processing speed and power consumption, compared to existing solutions, when accelerating any CNN network architecture.
When designing a CNN hardware accelerator its underlying architecture, especially if the accelerator is to be used in embedded/edge applications, should try to optimise the following:
- Processing speed – Provide as much as possible compute power with as little as possible computing resources. This will enable reaching required performance goals (target “frames per second” values) and reducing power consumption, while minimising silicon area, therefore reducing the cost.
- Reducing required memory size – This will enable deployment of the CNN based systems even in the most restricting embedded scenarios, reducing the total cost of the system. It will also reduce the power consumption, since less data will have to be moved between the memory and the CNN accelerator.
- Universality – Architecture should support different CNN architectures out of the box. This would enable ease of use of the accelerator by software developers.
- Scalability – Architecture should provide easy and seamless scaling of compute power, enabling easy deployment in different compute power demanding applications.
Kortiq’s AIScale accelerator meets all of these needs in the following way:
- It is designed to process pruned/compressed CNNs and compressed feature maps – this increases processing speed by skipping all unnecessary computations, reduces memory size for storing CNN parameters, as well as feature maps. All these features also help to reduce power consumption.
- It is designed to support all layers types found in today’s CNN architectures (convolution, pooling, adding, concatenation, fully-connected). This yields highly flexible and universal system, which can support different CNN architectures without the need to modify underlying hardware architecture.
- It is designed to be highly scalable, by simply providing more or less Compute Cores (CCs), the core processing blocks of the AIScale architecture. By using an appropriate number of CCs, different processing power requirements can be easily met.
AIScale is designed as a standard Soft-IP core, implementable using any Xilinx Zynq SoC and MPSoC (Multi-Processing System on Chip), providing all processing features required to accelerate any CNN architecture. By doing so, if the user wants to change CNN architecture that should be accelerated, no change in the underlying hardware accelerator structure is needed.
AIScale uses a proprietary binary description format, used to specify the topology of target CNN that should be accelerated. This description is in a form of a linked list, where each node in a list specifies the characteristics of one layer from the CNN. AIScale sequentially processes this linked list, reconfigures itself automatically for optimal processing of upcoming CNN layer, performs all required operations defined by that layer and moves to the subsequent layer. This sequential operation is repeated until all CNN layers have been processed, marking the end of the processing of one input image. The complete process can then start again from the first CNN layer, using a different input image.
By using this approach, it is extremely easy to change CNN networks that need to be accelerated. The user simply has to point the accelerator to the desired CNN description file that it should process. Description files for many different CNN networks can be held in memory at the same time, waiting to be selected for processing. Switching between different CNN networks is extremely fast and happens almost instantly. This enables dynamic switching between different CNNs at run time, something which most other solutions cannot support.
AIScale architecture uses four AXI interfaces to communicate with surrounding on-chip modules. Therefore, integration within ARM-based SoCs is greatly simplified. Working with AIscale IP core within Vivado IP Integrator is as easy as with any other standard IP core from the broad Xilinx IP repository. The user only has to put an instance of AIScale IP core into an IP Integrator block diagram of a complete system and run the connection automation wizard to connect it to the rest of the system. After that the system is ready to be implemented on the selected FPGA device.
Uyttersprot concluded:“Kortiq developed a very small and efficient machine learning accelerator optimised for Xilinx Zynq SoCs and MPSoCs. Xilinx Zynq devices are highly flexible and reconfigurable, and the AIScale IP cores can be dynamically reconfigured to adapt the accelerator on the fly for the required task or for any deep learning topology. This unique functionality can target many machine learning applications.”