Can You Hear Me Now?
About AudioSmart Far-Field Voice Technology
Oct 07, 2019
By Trausti Thormundsson
There are a broad variety of companies that have developed sophisticated voice service platforms including Amazon with their Alexa Voice Service (AVS), the Google Voice Assistant (GVA), Samsung’s Bixby, Apple’s Siri, Naver and NTT Docomo which are popular in Japan, SK Telecom and Korea Telecom in Korea, and Baidu, Alibaba and Tencent in China. These companies have a very complex voice-centric software platform that is customized for specific geographies and on a plethora of devices ranging from small light bulbs to large refrigerators -- all of which are activated with a wake word algorithm and far-field voice technology, both of which are specialties of Synaptics.
Now that edge computing is taking root in the smart home, Synaptics has created a dual wake word offering to differentiate cloud-based requirements (like weather reports and streaming music) versus future edge-based commands like “turn on the lights” (this is done in the cloud today). Adding to this, the smart device needs to understand all of the various accents and languages across the globe ranging from American Southern drawl, to British English, to Mandarin and Japanese. These are complicated problems to solve in order to support a great user experience.
But I digress, I’ll leave wake word tech for another article. Today I wanted to focus on far-field voice technology specific to the ability to extract the user's voice in often challenging environments where it needs to cut through various ambient noises and audio coming from the device itself. As market adoption and increased use of helpful voice-triggered smart devices like speakers and sound bars continue to shoot through the roof, many people may wonder, “How do smart devices know my voice from a TV or radio?" Let’s take a closer look at the nuts and bolts of far-field voice and the chips and software that enable it.
Understanding the Far-Field Voice Problem
When our speech travels from our mouth to a smart device it is affected in many various ways that all alter the signal that the voice assistant ends up hearing. Our speech is attenuated as the speech travels from us to the device, in free space the sound pressure level drops 6dB each time propagation distance doubles. The speech sound pressure wave will also bounce off all the surfaces around us, and the device will not only “see” the speech signal that traveled directly from our mouth to the device, but it will see numerus copies of the signal delayed and scaled by various amounts. In fact, the voice assistant will observe the sum of all these different copies. This effect is called reverberation. Attenuation and reverberation happen even in a perfectly quiet room.
All our environments have background noises, and in some cases these noises can be quite loud, such as noise coming from kitchen appliances, the television, street traffic, or other people talking. In general, all our activities generate some acoustic signals. The voice assistant will hear all these background noises as well, and the microphone does pick up the sum of our attenuated and reverberated speech along with these background noises.
Last but not least, the voice assistants’ microphones will also pick up the reverberated version of the audio that the voice assistant is playing. This is called acoustic echo, and in many cases this signal can be two orders of magnitude larger than the actual voice signal that the personal assistant needs to hear since the loudspeaker in the personal assistant is much closer to its microphones than the user.
The task of far-field voice processing is to take the microphone signals that includes the acoustic echo, the acoustic noises, and the reverberated speech, and reconstruct a signal that faithfully represents what was spoken.
How to Extract the User's Voice in Negative Signal to Noise Ratio Situations
The most challenging situations for far-field voice processing are the cases when the acoustic noises are close in level to the user’s voice or even larger. This can, for example, happen when the microphone on a smart device is too close to a television or noisy kitchen appliances. The SNR (Signal to Noise Ratio) turns negative. But there are a few features that can enhance voice communication and automatic speech recognition performance in real-world, noisy environments. The features are as follows:
Smart Source Locator (SSL): Allows the device to automatically determine the number of acoustic sources around the device, even if they are all active concurrently. In addition, it can determine if a given source is emitting speech only.
Smart Source Pickup (SSP): Uses the signal from two or more microphones to extract a single audio source from the all other audio sources in the microphone signal. To decide what source to focus on, Smart Source Pickup uses information from the Smart Source Locator. This allows the SSP to focus on the user's voice and cancels noise from all directions around the device (omnidirectionally), even if noise sources are from the same direction as the user. This process also performs partial de-reverberation.
Voice Barge In: Enabled by full duplex Acoustic Echo Cancellation (AEC), the Voice DSP can detect the wake word even when the device is playing music or voice prompts loudly.
What About Acoustic Echo Cancellation (AEC)?
The AEC in devices like Harman’s speakers, Ecobee’s light switch, or Netgear’s mesh Wi-Fi router have a dual sub-band filter structure which ensures faster convergence and deeper cancellation. Further, the AEC utilizes the capability of the SSP as a post filter, where the SSP treats nonlinear echo residue as noise. The outcome is a deep cancellation of both linear and nonlinear echo, resulting in a robust barge-in performance even at very loud playback volume.
Can You Hear Me Now?
Smart homes will get smarter in more ways than one, and voice remains a pivotal human interface with a fast growth trajectory. But rapid adoption of smart devices is wholly reliant on them actually being helpful and meaningful to people. We are not quite there yet, but Synaptics is leading the way with purpose-built and powerful edge computing SoCs with integrated far-field voice and custom wake word technologies that our customers leverage with their evolving voice service platforms.
Watch a video demonstration of Synaptics' far-field voice technology performing in a very noisy environment. https://bcove.video/2Bct94F