Off-The-Shelf Data Available for Immediate Use

Speed your AI products to market with high-quality, off-the-shelf datasets from DefinedCrowd. These pre-collected datasets, annotated and validated by a global crowd, can be used to train baseline models or evaluate and benchmark current models. Browse our robust and dynamic catalog for datasets that suit your specific needs.

Watch the video to find out how easy it is to find the speech data you need. Or don’t wait, browse the catalog now.

Browse Catalog

Take Advantage of the Benefits

Time to Market

Quickly build baseline models, improve existing models, or adapt live models for faster expansion.


Purchase pre-annotated and validated data for model training and testing.


Choose from a one-time download, or our discounted subscription options – whichever fits your needs.


Choose from multiple languages, domains and recording options.

Data For Every Use

Whether you’re building a prototype, or minimum viable product; evaluating or benchmarking current models; building synthetic datasets; or simply needing quality data fast, our continually updated library of datasets will help you quickly achieve your AI goals.

Browse Catalog

Quality is at Our Core

Your AI models require high-quality datasets, which is why quality underpins everything we do. The primary quality control mechanism for our speech datasets is Word Error Rate, which for our scripted recordings is less than 5% and for our spontaneous recordings is less than 10%.

For speech collections, we ensure quality by measuring accuracy levels in:

\\ Gender distribution
\\ Age distribution
\\ Noisy vs silent
\\ Nativeness (accuracy of native speakers)
\\ Domain (accuracy in staying on topic)
\\ Segmentation (spontaneous collections)