“Siri, call Mom.”
“Alexa, reorder my favorite chocolate bar.”
“Hey Google, play my evening jazz playlist.”
Naturally, the widespread adoption of voice technology by consumers will see businesses evolving to add a voice-enabled assistant to their product or service. Not only are businesses optimizing for voice, but also creating content experiences designed primarily for voice interactions.
However, there are certain caveats when it comes to consumers’ approval of voice technology.
A 2018 Consumer Intelligence Series Study conducted by PwC showed that consumers have certain criteria they expect their voice assistants to meet. At a bare minimum, respondents expect their voice assistants to be accurate (73% agree), to understand the accent every time someone speaks (61% agree), to tell the difference between multiple voices (57% agree), to save them time (59% agree), and to make their lives easier (55% agree).
However, the most telling forecasting comes from Juniper Research, who predicts that by 2023, there will be 8 billion digital voice assistants in use, a 2.5 billion increase over the 2018 estimation!
But how are voice assistants designed to enrich customer experiences and still give businesses that create them a strong and long-lasting competitive edge?
In short, the process starts with high-quality speech training data.
Put another way, on-target voice assistant responses are the impact of speech training data that is specific to a situation, rigorously managed to prevent bias, and representative of a target population. Likewise, a voice assistant needs to be able to listen and understand.
Follow along as we unpack this more.
Train Your Speech Recognition Model to Listen
The process of model training begins with collecting speech samples from people who represent the age, gender, and linguistic characteristics of your target group. These voice recordings should relate to specific scenarios pertaining to the context in which the voice assistant operates. For example, a credit card cancellation aligns well with a banking or retail environment.
You also want to consider different types of speech training data:
- Scripted Monologue Speech Training data speech is where an individual speaker records specific utterances from a script.
- Spontaneous Dialogue Speech data is guided by rules, scenarios, or completely off-the- cuff conversations by multiple speakers.
Include Dialect and Accent in Speech Training Data
Real world bias—unintended or intentional—can seep into machine learning models during the data collection stage. Proceedings of the National Academy of Sciences of the United States of America found that several of the world’s tech giants make far fewer errors with white users than with those of blacks.
For voice assistants to be useful to everyone, regardless of race, education, gender, or accent, speech recognition technology must be trained with a diverse range of voices.
During the collection stage, it’s equally important to simulate sounds natural to the environment in which the voice assistant will function. A voice-enabled car dashboard, for example, will need to operate effectively in an environment with traffic noise, while a home smart speaker will need to hear language over the sounds of a T.V., dishwasher, or playing child. Simulating these background noises during speech collection helps ensure the voice assistant has the highest chance of success.
Train Speech Models to Understand
After you collect accurate speech samples from people that represent your target group, you’re ready to train your speech recognition model.
This is where natural language processing comes in. Once speech data has been recorded, transcribed, and validated, it is annotated to identify domain (the context in which it is used, i.e., banking) and intent (i.e., cancelling a credit card).
People acting as annotators label specific sentences or text excerpts according to predefined categories of intent (semantic annotation) or domain (named entity tagging). Because there is often ambiguity in a speaker’s intent, multiple annotators and quality checks should be used to ensure consistency and higher quality data.
“Thanks a lot, Siri.” These could be the words of a genuinely happy customer or those of an annoyed one. Humans can pick up the difference in meaning based on the context of the preceding conversation and the tone used in the response. However, voice assistants struggle to recognize the emotion behind words.
The problem is solved with high-quality and accurate annotation. People label transcripts according to emotion, mood, and sentiment. If a voice assistant detects irritability, anger, or annoyance in a customer’s voice, it can help transform a negative experience by transferring the call to a human.
All You Need Is High Quality Training Data
“Businesses need to start thinking about voice as part of their strategy. The key ingredient? High quality training data to enable fluent conversations that exceed expectations.”
Daan Baldewijns, Director of Technical Program Management at DefinedCrowd.
Do What’s Necessary to Make Your Voice Assistant Successful
Voice assistants are changing the way businesses interact with their customers. For example, voice assistants make it possible for your customers to receive support any time, from anywhere they have an internet connection.
From kids to grand moms, we all expect voice assistants to be smart, helpful, and most of all, on target. If they’re not, the frustrating customer experience can easily translate into a deflated brand or lost sales.
Bottom line, useful data means carefully thinking about your customer needs. It is essential that you know:
- In which environment will the voice assistant operate
- How your customer will use the assistant
- What voices it will hear
- What content it will need to understand
- What levels of interaction it will offer
With these defined parameters, your next step is to collect high-quality speech training data. You can collect training data yourself or source it from a provider you trust. If you don’t have a data solution provider in mind, request a speech training data sample from us.