Smart speaker acoustic measurements
Smart speakers are a relatively new class of consumer audio device with unique characteristics that make testing their audio performance difficult. In this application note, Joe Begin, Director of Applications & Technical Support at Audio Precision provides an overview of smart speaker acoustic measurements with a focus on frequency response – the most important objective measurement of a device’s audio quality.
A smart speaker is an Internet-connected (usually wireless) powered speaker with built-in microphones that enables you to interact with an intelligent virtual assistant (IVA) using voice commands. Using voice only, you can direct it to perform tasks such as play audio content (news, music or podcasts, etc.) from the Internet or a connected device, control home automation devices, or even order items from a connected online shopping service.
Amazon was the first major company to release a smart speaker, called Echo, with an IVA known as Alexa, and it still has the dominant market share. Other significant entrants in this space include Alphabet (Google Assistant on Google Home smart speakers), Apple (Siri on HomePod mini speakers), Microsoft (Cortana on third-party speakers), Samsung (Bixby on Samsung Galaxy Home speakers) and a few others from China, Japan and South Korea.
From humble beginnings when first introduced, smart speakers quickly skyrocketed in popularity. In November 2014, shortly after the release of the first-generation Amazon Echo, a popular technology review site said of the event: “[T]he online retailer took another strange turn in the world of hardware by unveiling a weird wireless speaker with, Siri-like ability to recognise speech and answer questions.”
Since then, however, the proliferation of smart speakers has exploded, and strong growth is projected to continue for at least the next five years. Within three years of the 2014 introduction, the USA alone had an installed base of 67 million smart speakers in households, and that number grew by 78% in 2018 to 118 million units.
The global smart speaker market was valued at USD 4.4 billion in 2017 and is projected to reach USD 23.3 billion by 2025. Several companies with IVAs license the technology to other manufacturers. For example, both Bose and SONOS offer Alexa-enabled and Google Assistant-enabled smart speakers. And not only speakers are getting 'smart': IVA technology, with microphones and loudspeakers to support it, is being added to all sorts of devices like refrigerators, microwave ovens and set top boxes, enabling voice control of those devices. Additionally, most smartphones can also play the role of a smart speaker.
Smart speaker IVAs
An interaction with a smart speaker begins with a specific 'wake word' or phrase, for example, 'Alexa' for Amazon, 'Hey Siri' for Apple, and so on, followed by a command. In their normal operating mode, smart speakers are in a semi-dormant state, but are always 'listening' for the wake word, which triggers them to acquire and process a spoken command.
In terms of speech recognition, smart speakers themselves are only capable of recognising the wake word (or phrase). The more computationally intensive speech recognition and subsequent processing is done by the IVA on a connected server.  The IVA converts the user’s speech to text and attempts to interpret the command. To invoke the requested response from the device, the spoken command must contain a sequence of keywords recognisable by the IVA. A successful interaction may result in the requested action being taken by the IVA (e.g., 'Set a timer for 10 minutes'.) or by a connected Web service (e.g., 'Play an Internet radio station').
Smart speakers contain several distinct audio subsystems, including:
- A microphone array
- A powered (active) loudspeaker system
- Front-end signal processing algorithms for tasks such as beamforming, acoustic echo cancellation and noise suppression
An array of microphones is used instead of a single microphone to enable the device to take advantage of beamforming, a signal processing technique that can effectively increase the signal-to-noise ratio of the speech signal sent to the IVA for processing. Based on correlations among signals received at different microphones in the array, a beamforming algorithm can detect the most likely direction of the talker in a room and, in a sense, focus on that direction by combining the various microphone signals in a way that attenuates signals coming from other directions.
This can effectively reduce the level of ambient noise and room reverberation in the speech signal sent to the IVA. Noise suppression may also be used to reduce the level of non-speech-like signals.
Ideally, a smart speaker will be able to respond to spoken commands (by first recognising the wake word) even while playing audio content such as music or speech in a room. Acoustic echo cancellation (AEC) is essential for preventing the loudspeaker output from completely masking the microphone input for this task. The signal being played on the loudspeaker system can be used as a reference signal for the AEC algorithm, enabling it to ignore the content being played and to recognise the wake word. Typically, playback is paused after the wake word is detected in order to help improve command recognition.
Audio signal paths
The primary audio paths for a smart speaker are between the device and the IVA or a network server using the Internet with a Wi-Fi or wired connection. On the input side, a speech signal containing a spoken command is sensed with the device’s microphone array, digitised and then uploaded to the IVA for signal processing and command interpretation. On the output side, digital audio content is transmitted from a Web server to the device, where it is converted from digital to analog, then finally to an acoustic signal as it is played over the device’s loudspeaker system.
In addition to the two primary paths above, smart speakers may have several other audio paths, such as the following:
- An analogue output jack for connecting to an external powered speaker system
- An analogue input jack for using the smart speaker as a simple powered speaker
- Bluetooth connection for playing audio content on an external Bluetooth speaker, streaming content from a smartphone or tablet as a music source, or in some cases acting as a handsfree device for telephone calls
- Network connections to other smart speakers for multi-room music, stereo pairing or intercom functionality
- Connections to home automation devices, e.g., for two-way intercom connection to a security device, or audible status messages
The audio subsystems of smart speakers have a multitude of components that contribute to overall performance and audio quality, including microphones and microphone arrays, A/D and D/A converters, power amplifiers, loudspeaker drivers, digital signal processors, audio codecs, etc. In addition, several system- level functions such as beamforming, echo cancellation, wake word recognition, etc., contribute to overall quality. At some stage, each of these components and systems must be tested. Testing end-to-end performance of an overall smart speaker system is also desirable.
Different test contexts (conisder R&D, validation, production test, quality assurance) have different goals and different levels of access to subsystems and components. For example, during product design, R&D engineers might well be able to isolate the active crossover functionality of a system on a chip (SOC) by physically tapping into chip level connections (and have the first-hand product knowledge to be able to use the resulting signals). Similarly, for production test, manufacturers have the option of temporarily loading special test-specific firmware into the device to enable functional tests which are not available in off-the-shelf units. For example, noise reduction could be disabled allowing the microphone input system to be tested with sinusoidal signals instead of speech.
Testing the overall end-to-end performance of a smart speaker’s primary input and output audio paths can be quite challenging for the following reasons:
- Input to and output from a smart speaker are both acoustic, and acoustic test is by its nature more complex than electronic (analog or digital) audio test. Acoustic tests require calibrated microphones, usually an anechoic test chamber, and a quality loudspeaker system to stimulate DUT microphones.
- Smart speakers are inherently open-loop devices. On the input side, a signal (typically speech) is captured, digitised and then transmitted to a server somewhere as a digital audio file. To assess the input path performance, the audio file must be retrieved from the server and analysed in comparison to the signal that was generated in the first place. On the output side, audio content that originates as an audio file on a server is streamed to the device where it is converted to analog and played on the device’s loudspeaker system. To assess the output path performance, the device’s loudspeaker output must be measured with a measurement microphone and compared with the original signal from the server. The original signal is often in the form of an encoded audio signal (e.g., MP3 or AAC [Advanced Audio Coding]), which requires that it be decoded before analysis.
- The audio-to-digital and digital-to-audio converters in the device will invariably have different sample rates than the audio analyser, requiring some form of compensation during analysis.
Measuring frequency response
The most important aspect of the performance of any audio device is its frequency response. Frequency response is a type of “transfer function” measurement. For a device under test (DUT), it represents the magnitude and phase of the output from the DUT per unit input, as a function of frequency. Devices are often compared in terms of the 'shape' of their frequency response curves, which typically refers to the magnitude response only (not phase), and in addition normalises the magnitude to a reference value. For example, the response magnitude might be normalised to its value at some reference frequency, say 1kHz, such that the normalised curve passes through 0dB at 1kHz.
Usually, a flat frequency response (constant response magnitude versus frequency) is desirable in audio systems to ensure that source material is faithfully recorded and reproduced without spectral coloration. Flat frequency response is quite achievable in electronic audio systems, but much more difficult in acoustic devices, especially loudspeakers.
For loudspeakers, frequency response is also the basis from which another important metric, sensitivity, is derived. Sensitivity is a measure of the output from the device per unit input. For loudspeakers, sensitivity is calculated by averaging the frequency response magnitude over a range of frequencies, because they typically do not have flat frequency response.
Because frequency response is such an important audio quality metric, audio analysers have several different ways of measuring it. For example, some of the measurements available in Audio Precision audio analysers that could be considered for smart speaker testing include:
- Stepped sine sweep
- Logarithmically-swept sine (chirp)
- Transfer function
Each measurement technique has certain advantages and disadvantages with respect to measuring the primary input and output paths of a smart speaker.
Stepped sine sweep
Stepped sine testing is the classic means of measuring frequency response that has been in use since the earliest days of audio test, even before audio analysers existed. It involves testing a DUT by 'sweeping' over a series of discrete frequency steps within the frequency range of interestn – usually some portion of the audible frequency range, considered to be 20Hz to 20kHz. At each frequency step, the device is stimulated with a sine wave, and its output is analysed to determine metrics such as level and harmonic distortion.
In general, one advantage of stepped sine testing over other techniques is that because of its long history of use, it is often considered to be a type of 'gold standard' against which other techniques may be compared if there is any doubt about measurement integrity. Another advantage is that other audio quality metrics such as total harmonic distortion (THD) and inter-channel phase can be measured at the same time as frequency response.
A general disadvantage of stepped sine testing is test time. Devices often have a transient response when the stimulus frequency is changed abruptly, requiring more time at each step for the device to settle to its steady-state behavior. Another related disadvantage is poor frequency resolution. Because the frequency steps are discrete, higher resolution requires more steps, which increases test time.
For acoustic devices, one of the biggest disadvantages of stepped sine testing is that a test environment free of reflections (usually an anechoic chamber) is required. An ordinary room with reflective surfaces can’t be used because reflected sound waves will combine (either constructively or destructively) with the direct sound waves of interest, causing severe measurement errors.
For testing the input path of a smart speaker, stepped sine testing has two additional disadvantages:
- Smart speakers are designed to capture and process speech signals. It is therefore quite likely that they will have digital signal processing (DSP) designed to attenuate sinusoidal signals. It might be difficult to effectively test this path with a stepped sine signal unless this feature can be disabled for the test.
- When conducting an end-to-end test, there is a limit to the length of signal the device will record when attempting to process speech commands (e.g., approximately 7 seconds, including the keyword, in the case of the Alexa IVA service). In this case, a stepped sine stimulus would have to be less than about 6 seconds.
Stepped sine testing works well for testing the output path of a smart speaker, provided the test environment is anechoic within the frequency range of interest.
For more information from Audio Precision (AP) on the test and measurement and R&D processes in the ever-growing smart speaker market, you can download the full version of the above application note by visiting AP's webpage.