Browse DefinedCrowd's AI Datasets Catalog

Search our dataset catalog, choose those that meet your requirements and get your data samples today!

Browse Catalog

AI Models

Training a baseline model, testing and
evaluating existing ML models, and
benchmarking third-party applications.

Delivery Time



Word Error Rate (WER) less than 5% on most datasets.


Pre-collected, off-the-shelf datasets available in a wide range of languages, accents and domains.

Pricing Model

Subscription or one-time purchase

Output Options

  • Monologue speech training data
  • Dialogue speech gold sets
  • 10 different languages
  • 5 different industry domains
  • Balanced and wide range of demographics represented
  • Specialized grammars

Advantages of DefinedCrowd's AI Datasets

Our AI datasets not only are available for immediate use, but they are built with the same level of quality for which DefinedCrowd has become known. Versatile to be used for a variety of training and testing applications while available in specific languages and domains you need.

Time to Market

Quickly build and improve ML models, or adapt live models for faster expansion.


Purchase pre-collected, pre-annotated and validated datasets for model training and testing.


Choose from a one-time download, or our discounted subscription options – whichever fits your needs.


Speech datasets available in multiple languages, domains and recording options.

Datasets Quality Guaranteed

Our multi-faceted approach to data quality ensures you’re only reducing time to market, not quality.  Here are several of our quality metrics that are used for quality control.


Word Error Rate (WER)

Our primary quality metric, most datasets are <5% error rate.


Accuracy of audio to a native speaker

Domain Accuracy

Context is specific to a domain

Gender and Age Distribution

Minimizes bias in the dataset