Home | Products | Artificial Intelligence | How engineers are optimising Edge AI with NPUs and model compression

How engineers are optimising Edge AI with NPUs and model compression

Artificial Intelligence

7 August 2025

As AI models move from design to production, engineers face a double-faceted challenge: delivering real-time performance on embedded devices with limited computational power and memory.

Neural Processing Units (NPUs) offer a powerful hardware solution, as they excel at handling AI models’ intensive computational requirements. However, the large size of AI models can make deployment on NPUs challenging, highlighting the value of model compression techniques. Effective real-time Edge AI requires a closer look at how NPUs and model compression techniques – like quantisation and projection – can work together.

How NPUs power real-time performance on embedded devices

A key challenge in deploying AI models on embedded devices is minimising inference time – how long a model takes to generate a prediction – to ensure real-time responsiveness. For example, in real-time motor control applications, inference times often need to be below 10 milliseconds to maintain system stability and responsiveness, and to prevent mechanical stress or component damage. Engineers must optimise for speed while managing memory, power, and prediction quality.

NPUs are specifically designed for AI inference and neural network computations, making them ideal for embedded systems where processing power is limited and energy efficiency is critical. Unlike Central Processing Units (CPUs), which are general-purpose processors, or Graphics Processing Units (GPUs), which are powerful but energy-intensive, NPUs are optimised to efficiently compute the matrix operations most commonly used in neural networks. Although GPUs can also perform AI inference, NPUs are more cost-effective and consume significantly less power.

From a cost perspective, NPUs offer a more economical alternative to using a microcontroller (MCU), GPU, or FPGA to handle AI tasks. Although chips with integrated NPUs often cost more upfront than traditional, basic microcontrollers, the overall value proposition of NPUs lies in their superior energy efficiency and AI processing capabilities. NPUs are specifically designed to accelerate neural network inference, offering significantly higher performance at much lower power consumption compared to CPUs. This efficiency can lead to reduced operating costs and extended battery life on embedded devices, making them more cost-effective over time. Additionally, NPUs can enable real-time AI processing without needing more expensive and power-hungry alternatives like GPUs or FPGAs, further enhancing their economic appeal.

While NPUs are designed for efficient AI inference, they typically have limited memory and power, especially in embedded systems. Model compression is essential to reduce model size and complexity, allowing NPUs to deliver real-time performance without exceeding system constraints.

Compressing AI models using both projection and quantisation

Model compression techniques help deploy large AI models to the edge by reducing their size and complexity, improving inference speed, and reducing power consumption. However, excessive compression can degrade prediction quality, prompting engineers to carefully evaluate how much accuracy they can sacrifice to meet hardware constraints.

Two complementary compression techniques – projection and quantisation – can be combined to optimise AI models for deployment on NPUs. Projection reduces model size by removing redundant learnable parameters, while quantisation further compresses the model by converting the remaining parameters to lower-precision (typically integer) datatypes. Used together, they can compress both the structure and datatypes of models, improving their efficiency without significant reductions in accuracy.

It is recommended to first apply projection to structurally compress the model, reducing its size and complexity, before applying quantisation to further minimize memory usage and computational cost.

Projection

Neural network projection is a structural compression technique available in MATLAB Deep Learning Toolbox that can be used to reduce the number of learnable parameters in a model by projecting the weight matrices of layers onto lower-dimensional subspaces. Based on principal component analysis (PCA), the method identifies the directions of greatest variance in neural activations and removes redundant parameters by approximating high-dimensional weight matrices with smaller, more efficient representations. This reduces memory and computation requirements while preserving much of the model’s accuracy and expressivity.

Quantisation

Quantisation is a datatype compression technique that reduces AI models’ memory footprint and computational complexity by converting their learnable parameters (weights and biases) from high-precision, floating-point values to lower-precision, fixed-point integer types. This can lead to reductions in a model’s memory usage and inference speed, making it especially effective in preparing models for deployment on NPUs. While quantisation introduces some loss of numerical precision, calibrating the model with input data representative of real-world operation can typically preserve accuracy within acceptable limits for real-time applications.

Use case: deploying quantised models on microcontrollers at STMicroelectronics

STMicroelectronics developed a workflow using MATLAB and Simulink to deploy deep learning models on STM32 microcontrollers. The engineers started by designing and training the model, followed by hyperparameter tuning and knowledge distillation to reduce model complexity. Next, they applied projection to structurally compress the model by removing redundant parameters, and quantisation to convert weights and activations to 8-bit integers, which reduces memory usage and improves inference speed. This dual-stage compression approach enables the deployment of deep learning models on resource-constrained NPUs and MCUs without sacrificing real-time performance.

Best practices for AI model deployment on NPUs

Model compression techniques like projection and quantisation can greatly improve the performance and deployability of AI models on NPUs. However, because compression can affect accuracy, iterative testing – using both simulation and hardware-in-the-loop validation – is critical to ensure models meet functional and resource requirements. Testing early and often allows engineers to catch and address issues before they escalate, reducing the risk of late-stage rework and helping ensure smooth deployment to embedded systems.

A unified ecosystem can also address many challenges associated with AI model deployment, simplifying integration, accelerating development, and supporting comprehensive testing throughout the process. This is especially valuable in today’s fragmented software landscape. Engineers must often integrate disparate codebases into their simulation workflows or larger system environments. Integration and validation complexity has increased further due to platforms operating separately from standard development environments. Incorporating NPUs creates additional complexity in the toolchain, further necessitating the need for a unified ecosystem.

Designing for the Edge: balancing power, precision, and performance

The future of embedded AI is engineered for performance, built to thrive at the edge, and powered by AI models driving complex engineered systems. Engineers’ success depends on understanding model compression trade-offs, testing on hardware early, and building adaptable systems. By uniting smart NPU and AI model design with strategic compression techniques, engineers can turn embedded devices into powerful, real-time decision-makers.

About the authors: