Alexa, how do you train a voice assistant?

Voice technology is changing the way we interact with the world. From instructing Alexa to turn on the lights to asking Siri to find the nearest pizza restaurant, voice-enabled assistants are fast becoming an ubiquitous part of our everyday lives.  

And its usage is increasing at an exponential rate. Projections show that by the end of 2020, 50% of all searches will be conducted by voice, while 75% of US homes will have at least one smart speaker. However, the most telling forecasting comes from Juniper Research, who predicts there will be a staggering 8 billion digital voice assistants in use by 2023, up from an estimated 2.5 billion at the end of 2018.  

Naturally, the widespread adoption of voice technology by consumers will see businesses evolving to add a voice-enabled assistant to their product or service. Not only are businesses optimizing for voice, but also creating content experiences designed primarily for voice interactions.  

However, there are certain caveats when it comes to consumers’ approval of voice technology.  

A 2018 Consumer Intelligence Series Study conducted by PwC showed that consumers have certain criteria they expect their voice assistants to meet. At a bare minimum, respondents expect their voice assistants to be accurate (73% agree), to understand the accent every time someone speaks (61% agree), to tell the difference between multiple voices (57% agree), to save them time (59% agree), and to make their lives easier (55% agree).  

The bottom line is that consumers expect voice-enabled assistants to be smart, helpful and mostly on target. If they’re not, the experience quickly becomes isolating and frustrating instead of helpful for the user.  

But how are voice-enabled assistants designed to enrich the customer experience and give the competitive edge to the businesses that create them? The answer is high-quality training data. The better and more specific the data, the more accurate the voice assistant. 

All You Need Is High Quality Training Data 

Businesses need to start thinking about voice as part of their strategy. The key ingredient? High quality training data to enable fluent conversations that exceed expectations.”  

Daan Baldewijns, Director of Technical Program Management at DefinedCrowd. 

To provide responses to consumer queries, a voice-enabled assistant needs to be able to listen and understand. However, for it to be accurate, it needs to be trained with data that is specific to its purpose. Let’s take a deeper look. 

Training to Listen 

The process of training begins with the collection of raw data: samples of speech from people who represent the age, gender and linguistic profile of your target group. These voice recordings should relate to specific scenarios (cancel a credit card, for example) pertaining to the context in which the voice assistant operates (banking etc.). Speech data can be scripted, spontaneous, or in the format of dialogue.  

Beware the Accent Gap 

Collected data should represent every voice in your targeted audience. A joint study spearheaded by the Washington Post showed significant inconsistencies in how voice assistants understand people from different parts of the US. The less white, educated and affluent the consumer interacting with the voice assistant, the less accurate the voice assistant became.  

It seems real world bias seeps into the artificial world during the data collection stage. For voice assistants to be useful to everyone, no matter their education level, accent or gender, the technology needs to be trained with a diverse range of voices, not just a select few.  

Collect Background Noise 

 During the collection stage, it’s also important to simulate the sounds of the environment in which the voice assistant will function. A voice-enabled car dashboard, for example, will need to operate effectively in an environment with traffic noise, while a home smart speaker will need to hear language over the sounds of a boiling kettle or television. Simulating these background noises during speech collection will ensure the voice assistant has the highest chance of success.  

Once the speech data has been recorded, the audio is transcribed and then validated. This is a crucial stage of the data collection. The transcriptions will be used to train the models, so every sound needs to be transcribed; every word, pause, cough, repetition and hesitation. If the transcriptions don’t pass validation, they need to be re-transcribed until they do. 

Training to Understand 

Now you’ve got accurate transcriptions of samples of speech from people that represent your target group. But how are these used to train models? 

This is where natural language processing comes in. Once speech data has been recorded, transcribed and validated, it is annotated to identify domain (the context in which it is used, i.e. banking or healthcare) and intent (cancelling a credit card, finding a prescription or playing a song). 

People acting as annotators label specific sentences or text excerpts according to predefined categories of intent (semantic annotation) or domain (named entity tagging). As there is often ambiguity in a speaker’s intent, multiple annotators and quality checks should be used to ensure consistency and higher quality data.  

Training For Emotional Intelligence 

“Thanks a lot, Siri.” These could be the words of a genuinely happy customer or those of an annoyed one. Humans can pick up the difference in meaning based on the context of the preceding conversation and the tone used in the response. However, voice assistants struggle to recognize the emotion behind the words.  

Once again, the problem is solved with high-quality and accurate annotation. People label transcripts according to emotion, mood and sentiment. If a voice assistant detects the irritability, anger or annoyance of a customer, it can quickly placate the customer by transferring the call to a human consultant.  

The Key Considerations Of A Voice Assistant 

As seen from the above training methods, the key to creating a successful voice assistant is to fuel it with high-quality data. But to collect relevant and useful data, you need to think carefully about your customer needs.  

You need to know in which environment the voice assistant will operate; how the customer will use the assistant; what voices it will hear; what content it will need to understand; and what levels of interaction it will offer. Once you have defined these parameters, you can collect the data needed to create a truly useful and effective voice assistant.  

And the benefits are well-worth it. According to Gartner research, businesses that employ voice assistants report a reduction of up to 70% in inquiries, increased customer satisfaction and a 33% saving per voice engagement.  

Voice assistants are changing the way businesses interact with their customers. However, for you to reap the rewards offered by the technology, voice assistants need to be trained correctly. It’s only with high-quality, relevant data that voice assistants can surpass customer expectations, increasing sales and brand loyalty.  

Keen to learn more about voice assistants and how they are trained? Check out our white paper, How To Train A Voice Assistant.