Speed Your Time to Market with Off-the-Shelf Data You Can Trust

With DefinedCrowd’s new offering, high-quality data is now just a click away 

When it comes to launching AI products into the marketplace, time is of the essence to gain the competitive advantage. However, collecting, annotating and validating the data required to train successful AI can take weeks, if not months.  

For companies looking to build baseline models, adapt existing models with domain corpora, extend models with extra languages, or test and evaluate current models, off-the-shelf data provides a viable alternative to custom-built datasets.  

To help these companies rapidly expand their AI-initiatives into the marketplace, DefinedCrowd is proud to announce the launch of DefinedData, an online marketplace of AI datasets available for on-demand purchase.  

The need for continuous access to highly accurate data 

“Machine learning teams building AI models have always faced one particularly pressing problem, and that is continuous access to highly accurate data. When large enterprise tech firms want to launch their AI initiatives into the market quickly, they simply don’t have the time to collect and validate the data required to do so. DefinedData aims to solve this problem by providing customers with access to an extensive library of speech datasets that will rapidly accelerate their AI programs,” said DefinedCrowd’s VP of engineering, Andrew Michalik.   

DefinedData will provide customers with on-demand access to high-quality data available in multiple languages, domains, recording types and pricing options. The initial offering includes scripted recordings in English, Italian, Portuguese, German, French and Dutch in the domains of  healthcare, entertainment, hospitality, automotive and generic.  

By May 2021, the library is expected to grow to include over 25,000 hours of both scripted and spontaneous recordings in the above languages as well as in Spanish, Hindi and Japanese. Additional domains will include banking, insurance, telecom, retail and IVR. 

Customers are able to instantly request samples or request to purchase one or multiple datasets for immediate use. 

Quality is key to success 

“Although strong algorithms require lots of data, they also require accurate and high-quality data,” said Michalik. After all, the success of AI models depends on the quality of data used to fuel them. For those firms considering off-the-shelf data, they can rest assured that quality is at the core of everything we do at DefinedCrowd.” 

To ensure the highest levels of accuracy and authenticity, the primary quality control mechanism for DefinedCrowd’s speech datasets is Word Error Rate, which is less than 5% for scripted recordings and 10% for spontaneous recordings. For speech collection, quality is ensured by measuring accuracy levels in gender distribution, age distribution, noisy vs silent, nativeness (native vs. non-native and the level of fluency of non-native), domain (accuracy in staying on topic) and segmentation (spontaneous collections). 

“Whether you’re building a prototype, or minimum viable product; evaluating or benchmarking current models; building synthetic datasets; or simply needing quality data fast, our continually updated library of datasets will help you quickly achieve your AI goals,” concluded Michalik.  

With DefinedData, accessing high-quality data has never been easier.