DefinedData’s Online Marketplace: Promoting Transparent and Bias-aware Datasets

DefinedCrowd’s newest product offering increases access for all to high-quality, transparent, diverse training data through Nvidia’s NGC container hub.

More than a year into the global pandemic, the world as we know it has fundamentally changed. As global lockdowns and remote work have become the new norm, businesses are accelerating their adoption of artificial intelligence to keep up in a world demanding more agility and efficiency.  

A study by IDC showed that in 2020, the AI market was predicted to be worth over $300 billion by 2024. As of February 2021, predictions were revised, and the market is now expected to break the $500 billion mark by 2024.  

McKinsey survey found that responses to the crisis have sped up the adoption of digital technology by several years, with 61% of high-performance companies increasing investment in AI. Businesses are realizing that slow progress is not an option, and to connect with customers in this new world, they need to accelerate their adoption of technologies like artificial intelligence.  

Demand for Data Skyrockets 

The increase in the adoption of artificial intelligence has dramatically increased the demand for high-quality, relevant training datasets. However, not all data is the same, and if the industry wants to move forward in a responsible and ethical way, it’s essential to train AI models with bias-aware, top-quality AI training data. 

“As data becomes the new code, transparency is vital,” states Dr. Braga. “Training AI models with high-quality, bias-aware training data is the only way we can ensure a future where machines and humans interact seamlessly, without biases skewing these interactions. It’s crucial that all engineers have constant access to ethically sourced, bias-free data to make the most out of their technology, and, in this way, keep innovation flowing,” she said. 

Countering bias with transparent, diverse data 

To meet the increasing demand for high-quality, transparent data, the DefinedData catalog has been expanded to provide machine learning engineers with unprecedented levels of metadata.  

The DefinedData catalogue allows you to access audio samples and basic information about the datasets including number of speakers, locale, language, country, gender and accents. From there, we break it down even further, giving you visibility into the phonetic, age, gender and accent distribution with visual breakdowns and detailed statistics.  

“We know the risks low-quality datasets pose not only to customer experience, but also to humanity as a whole,” says GM of DefinedData, Martin Andreas Stein. “For models to be free from bias they need to be trained on representative data, which is why data transparency is essential. We are exceptionally proud to offer machine learning teams this level of metadata,” he added.  

An Open Marketplace to Sell Your Data 

To expand the offering even further, DefinedData has opened its marketplace to third-party vendors – helping them monetize their datasets by reaching a broad network of data scientists, academics, and other AI professionals.  
The marketplace will be open to speech, NLP, image, video, and translation datasets to fuel a range of models. Find more information about becoming a vendor here.

Subscriptions for easy access to data 

In the name of democratizing data, the launch of our subscription service in April will be an exciting new way for those who need large amounts of data to get it. Subscribers will be able to access fresh, up-to-date, vetted datasets to help meet the demand from their models, whether speech or NLP. The addition of Computer Vision datasets later in the year will add to the variety, ensuring that users can build, test and train multi-modal models.  

“Companies constantly need to engage a long tail of data in order to grow in new sectors, and data scientists need the raw material in order to address these issues as data science becomes more democratic each day,” says Director of Machine Learning at DefinedCrowd, Dr. Christopher Shulby. “This offering will allow data scientists to keep their models relevant in a continually evolving world,” he concludes. 

Subscriptions will come in a variety of plans; find out more about the various options here.

Democratizing Data Access via NVIDIA NGC 

As a key step in democratizing access to data, DefinedCrowd will provide dataset samples through the NVIDIA NGC catalog, a GPU-optimized hub for AI and HPC containers, pre-trained models, and SDKs that simplifies and accelerates end-to-end workflows. Datasets can be used to train models using libraries within the NVIDIA Jarvis application framework; NVIDIA Transfer Learning Toolkit, which enables developers to build production-quality models faster with no coding required; as well as the NVIDIA NeMo platform, a Python toolkit for building, training, and fine-tuning unmatched GPU-accelerated conversational AI models. This collaboration allows researchers and developers to build high-quality, state-of-the-art conversational AI models. 

“By working with DefinedCrowd, we’re providing NVIDIA Jarvis and NeMo users with sample datasets to build and accelerate their models, all within the NGC environment,” said Richard Kerris, head of developer relations at NVIDIA.  

A new way forward in sourcing transparent, high-quality training data 

The expansion of DefinedData is just a start, as DefinedCrowd continues to recognize the role we can play in ensuring fair, ethical access to data for everyone – not just those who can pay big money for it. The democratization of data is upon us, and unprecedented transparency is the way forward.  

Browse our catalogue of high-quality, transparent datasets here.