The Speed and Power Advantage of a Purpose-Built Neural Compute Engine
By Jeremy Holleman
Keyword spotting (KWS) in edge devices is a challenging task. The algorithm must avoid false alarms and reliably respond to target words. All this must be accomplished while consuming only a small fraction of a device’s battery supply. Until recently, the only devices capable of performing KWS at edge-compatible power levels were low-power microcontrollers (MCUs) and digital signal processors (DSP). Syntiant has recently introduced the NDP10x series of Neural Decision Processors, highly efficient chips purpose-built for KWS using deep neural networks.
It can be difficult to compare the power consumption of different devices. Datasheets may describe current consumption in uA/MHz, but it is not generally possible to convert that into the power required for a given task. Additionally, different network topologies, training procedures, etc., can affect system accuracy and power consumption. The real question for system designers is "How much power consumption will be added by a solution running a KWS algorithm with sufficient accuracy to satisfy my use case?" To address this question, we measured a few of the lowest-power options for KWS available and compared speed, accuracy, and power on a common task: the Google keyword dataset assembled by Pete Warden of Google.
The Contenders
We measured Syntiant’s NDP100 solution using our internal development board. Our network is a 4-layer, fully connected network. For comparison points, we first looked to the paper, "Hello Edge: Keyword Spotting on Microcontrollers” (a worthwhile read for anyone interested in acoustic NNs using MCU-compatible hardware resources). The Hello Edge team compiled performance and complexity data for small, medium, and large networks with several topologies and published the source code, including pre-trained models for two networks: the small depthwise-separable CNN and the small fully connected network. The documentation specifically mentions the STM32F746 Discovery board, so we started there.
We also included an MCU from ST’s low-power line of microcontrollers, the STM32L476G and its evaluation board.
We also looked at the Ambiq Apollo 3 Blue, using the SparkFun Edge Board, a joint effort by Google, Ambiq, and SparkFun. We used the pre-trained TinyConv model used in the guide.
Test Description
We modified the code to run a batch of 20 inference cycles on each of the platforms. For the MCU targets, the input audio is taken from a pre-recorded waveform stored in memory, to avoid contaminating the measurements with microphone power. The measured power includes computation of MFCC or filter-bank energy features, but not an ADC or microphone power. The NDP measurements include decimation of PDM output from a digital microphone as well as feature extraction and the neural inference energy. MCUs often have several on- and off-chip peripherals powered from the same supply, so we subtracted a baseline current measurement made with the MCU powered up but idle. The NDP results are direct measurements with nothing subtracted. The MCU experiments were performed with HW settings as set in the available open-source software linked above.
We made our current measurements using a Keysight B2901A SMU, taking average values for current during inference and idle. Here is an example of the measurement made for the STM32F746. We see 121mA during the twenty inference cycles, which last 990ms, with the supply current returning to an 82 mA baseline after inference is completed. For this particular measurement we removed a 0 Ω jumper (R21) from the board to access the MCU VDD pin directly. The supply voltage at the MCU pin is 3.3V, but it is reduced to 1.3V by an internal regulator. The energy per inference is calculated as (120.8 mA - 82.4 mA)*1.3V*0.99s/20 = 2471 uJ/frame.
Results
The table below summarizes the measurements. The trade offs between model complexity, accuracy, compute time, and energy make direct apples-to-apples comparisons difficult, so we compare across three axes. Energy per inference is the energy required to process one frame of audio samples and produce a label (e.g. “yes,”, “no,” “hello”), including feature extraction and running inference. Inference time is the time required to process one frame of audio and frame rate is its inverse, indicating the maximum number of audio frames per second that the processor is capable of processing with a given network. G10 Accuracy is the accuracy on the Google-10 keyword task. Estimates for the operations and parameter counts for the Small FC and Small DS-CNN networks are taken from the “Hello Edge” paper. For the TinyConv network used on the Ambiq chip, those figures were estimated from the code in the “Speech Commands” TensorFlow example.
We see that the purpose-built compute path in the NDP100 provides a dramatic energy savings relative to the stored-program architecture of an MCU solution. The closest result in energy, from the Ambiq Apollo 3 Blue, is 27x higher than the NDP energy, operating with a much smaller and less accurate network. To achieve similar accuracy, we have to look at the STM32L4786 running the Small DS-CNN network, which requires 193x the energy of the Syntiant chip.
The frame rates are also critical. Production KWS systems typically require at least 20 and often 100 fps in order to avoid “stepping over” a target utterance. The parallelism and dedicated data path of the NDP also provides a dramatic speed advantage over low-power MCUs. The Apollo's energy advantage comes with a speed penalty resulting in a maximum frame rate of 3.7fps, which will typically result in much lower real-world accuracy than the software accuracy of 66% (in TensorFlow on pre-aligned data) would predict. Of the tested MCUs, only the STM32F746, from ST’s high-performance F7xx line, exceeds 20 fps, and it does so at an energy cost about 700x higher than the NDP, resulting in a power cost of about 50 mW, in addition to the baseline power required to power up the MCU initially.
Overall, these results confirmed our expectation that a purpose-built compute engine for neural inference can provide dramatic energy and speed advantages over stored-program architectures.
Jeremy Holleman, Ph.D., is the chief technology officer of Syntiant Corp. He is an expert on ultra-low power integrated circuits and directs the Integrated Silicon Systems Laboratory at the University of North Carolina, Charlotte, where he is an associate professor.