Off-The-Shelf Data Available for Immediate Use
Speed your AI products to market with high-quality, off-the-shelf datasets from DefinedCrowd. These pre-collected datasets, annotated and validated by a global crowd, can be used to train baseline models or evaluate and benchmark current models. Browse our robust and dynamic catalog for datasets that suit your specific needs.
Watch the video to find out how easy it is to find the speech data you need. Or don’t wait, browse the catalog now.
Take Advantage of the Benefits
Time to Market
Quickly build baseline models, improve existing models, or adapt live models for faster expansion.
Purchase pre-annotated and validated data for model training and testing.
Choose from a one-time download, or our discounted subscription options – whichever fits your needs.
Choose from multiple languages, domains and recording options.
Data For Every Use
Whether you’re building a prototype, or minimum viable product; evaluating or benchmarking current models; building synthetic datasets; or simply needing quality data fast, our continually updated library of datasets will help you quickly achieve your AI goals.
Quality is at Our Core
Your AI models require high-quality datasets, which is why quality underpins everything we do. The primary quality control mechanism for our speech datasets is Word Error Rate, which for our scripted recordings is less than 5% and for our spontaneous recordings is less than 10%.
For speech collections, we ensure quality by measuring accuracy levels in:
\\ Gender distribution
\\ Age distribution
\\ Noisy vs silent
\\ Nativeness (accuracy of native speakers)
\\ Domain (accuracy in staying on topic)
\\ Segmentation (spontaneous collections)