Defined Crowd
Building a voice assistant model

Building a Voice
Assistant Model

An end-to-end data solution for smarter product development

The Challenge

With the rise of voice technology, this leading global provider of audio equipment wanted to develop an automatic speech recognition (ASR) model to test in their products. However, traditional data vendors did not offer the proper tools or a diverse enough crowd to represent their target base. With our global community of 210,000+ people and industry-leading enterprise portal, DefinedCrowd® was ideally equipped to serve their needs.

The client would need high-quality data to train an ASR system on everything from simple audio system commands like “repeat,” to fuller assistant requests like “find me a restaurant”, which could be spoken in a quiet home environment or a moving vehicle with background noise. The system would need to understand variations of the same request – such as “make it louder” or “turn it up”– as well as accents and other factors that influence people’s speech.

To achieve the right result, the quality of data used to train the ASR system would be paramount.

Building a voice assistant model

The Solution

DefinedCrowd’s enterprise portal served as an end-to-end pipeline to collect everything needed to build a voice assistant from the ground-up. The project took advantage of our range of purpose-built workflows in speech and NLP, utilizing the skills of our diverse human-in-the-loop community.

Step 1. Speech collection
  • The client’s requirements were converted into an online task on our Neevo platform that was specific to their project.
  • 230 qualified people from our community were selected to record phrases, in their own words, related to specific scenarios. Example: “Make a request to play some music you’d like to hear”.

This step would not only help the virtual assistant understand customers; it would also train it to speak to users at a later stage of development.

230
people
9434
recordings
Step 2. Transcription & validation
  • Additional community members transcribed the speech data we collected from audio format into text.
  • Since quality is maximized when a different person validates someone else’s data, we again recruited a new group to validate the transcriptions.

Data that did not pass the validation process was recollected and validated at no additional charge to the customer.

143
people
18415
text audio transcriptions
(incl. corrections)
Step 3. Semantic annotation
  • The validated transcriptions were compiled with a set of further transcriptions from the client.
  • A further group of 75 people annotated the sentences, identifying the speakers’ intent. Examples: adjusting volume, noise cancellation, finding a restaurant.
  • They also categorized phrases related to specific artists, music genres, song titles and other entertainment categories based on domains defined by the client.

This step would ensure that the ASR model would be able to respond to user requests accordingly.

75
people
22526
sentences
(incl. some of the clients’ own transcriptions)
Step 4. Entity tagging

A subset of the data that was semantically annotated in the previous step was then transferred to a final entity tagging job. This saved on cost as only the rows which contained entities went through this process, as opposed to the entire data set being tagged.

  • 90 community members tagged the remaining sentences with the entities mentioned. Example: «Lady Gaga» tagged as «Singer».
  • A pre-defined inter-annotator agreement was used to ensure consistency.
90
people
16039
sentences
Step 5. Aggregating & delivering

Once the full process was completed, we aggregated, vetted and delivered a unique set of speech data to the client which would enable them to build a voice-enabled product from scratch.

This entire process was completed via DefinedCrowd’s enterprise portal. At every step, our machine learning algorithms monitored our community members’ work, eliminating data that was deemed inaccurate or invalid to ensure the highest possible quality of output.

The Results

With ongoing support, this client received the high-quality data they needed to train, test and tune a model for the development of voice-enabled products. The data measured at over 98% accuracy based on 1,8% word error rate, and F1 score threshold of.80. The data set would enable them to build a baseline ASR model and lead to follow-up work for further customization.

It’s an ideal formula to test, learn, design and iterate using an agile approach for a successful go-to-market strategy. Having already worked together for over a year, this client was familiar with DefinedCrowd’s services and saw how using our full end-to-end process to combine diverse data workflows would deliver greater quality, efficiency and ROI. From data collection to validation, aggregation and every step in between, DefinedCrowd has the proprietary tools, expertise and commitment to be your partner in training data, delivering results with:

Expertise
We are experts in our field
Reliability
We deliver what we promise
Innovation
We are working on the world of tomorrow
Trust
We are your trusted data partner