You may know them as Siri or Alexa. Also known as personal assistants, these smart devices are attentive listeners. Say a few words and they’ll play a favorite song or show you the way to the nearest gas station. But all that listening comes with a privacy risk. To help people protect themselves from eavesdropping devices, a new system plays soft, calculated sounds. This masks conversations to confuse the devices.
The smart devices use automated speech recognition — or ASR — to convert sound waves into text, explains Mia Chiquier. She studies computer science at Columbia University in New York City. The new program fools the ASR by playing sound waves that vary with your speech. Those added waves confuse a sound signal to make it difficult for the ASR to distinguish the sounds from your speech. It “completely confuses this transcription system,” Chiquier says.
She and her colleagues describe their new system as ‘voice camouflage’.
The volume of the masking sounds is not the most important. In fact, those sounds are silent. Chiquier likens them to the sound of a small air conditioner in the background. The trick to making them effective, she says, is to match these so-called “attack” sound waves with what someone is saying. To work, the system predicts the sounds someone will say in the future. It then silently emits sounds chosen to confuse the smart speaker’s interpretation of those words.
Chiquier described it on April 25 during the virtual International Conference for Learning Representations.
get to know you
Step one in creating great speech camouflage: get to know the speaker.
If you text a lot, your smartphone starts anticipating what the next few letters or words in a message will be. It also gets used to the type of messages you send and the words you use. The new algorithm works in much the same way.
“Our system listens to the last two seconds of your speech,” explains Chiquier. “Based on that speech, it anticipates the sounds you might make in the future.” And not just somewhere in the future, but half a second later. That prediction is based on the characteristics of your voice and your language patterns. This data helps the algorithm learn and calculate what the team calls a predictive attack.
That attack comes down to the sound the system plays in addition to the speaker’s words. And it keeps changing with every sound someone speaks. When the attack plays along with the words predicted by the algorithm, the combined sound waves turn into an acoustic medley that confuses any ASR system within earshot.
The predictive attacks are also hard to outwit an ASR system, Chiquier says. For example, if someone tries to disrupt an ASR by playing a single sound in the background, the device can subtract that sound from the speech sounds. That’s true even if the masking sound changes periodically over time.
The new system instead generates sound waves based on what a speaker has just said. So the attack sounds are constantly changing – and in an unpredictable way. According to Chiquier, that makes it “very difficult for” [an ASR device] to defend against.”
Attacks in action
To test their algorithm, the researchers simulated a real-life situation. They played a recording of someone speaking English in a room with a medium level of background noise. An ASR device listened in and transcribed what it heard. The team then repeated this test after adding white noise in the background. Finally, the team did this with their voice mask system on.
The voice camouflage algorithm prevented ASR from hearing words correctly 80 percent of the time. Common words like “the” and “our” were the hardest to mask. But those words don’t hold much information, the researchers add. Their system was much more effective than white noise. It even performed well against ASR systems designed to cut out background noise.
The algorithm could one day be embedded in an app for real-world use, Chiquier says. To make sure that an ASR system couldn’t listen in reliably, “you just opened the app,” she says. “That’s about it.” The system can be added to any device that transmits sound.
That, however, is a bit ahead of things. After that, more testing will follow.
This is “good work,” says Bhiksha Raj. He is an electrical and computer engineer at Carnegie Mellon University in Pittsburgh, Pennsylvania. He was not involved in this investigation. But he also studies how people can use technology to protect their speech and voice privacy.
Smart devices currently dictate how a user’s voice and conversations are protected, Raj says. But he believes that control should instead be left to who is speaking.
“There are so many aspects to a voice,” explains Raj. Words are one aspect. But a voice can also contain other personal information, such as a person’s accent, gender, health, emotional state or physical size. Companies may be able to abuse these features by targeting users with different content, ads, or prices. They could even sell speech information to others, he says.
When it comes to voice, “figuring out exactly how to cover it up is challenging,” Raj says. “But we need to have at least some of it under control.”