Defined Crowd

Democratizing Data Access via NVIDIA NGC

As a key step in democratizing access to data, DefinedCrowd will provide dataset samples through the NVIDIA NGC catalog, a GPU-optimized hub for AI and HPC containers, pre-trained models and SDKs that simplifies and accelerates end-to-end workflows. Datasets can be used to train models using libraries within the NVIDIA Jarvis application framework; NVIDIA Transfer Learning Toolkit, which enables developers to build production-quality models faster with no coding required; as well as the NVIDIA NeMo platform, a Python toolkit for building, training, and fine-tuning unmatched GPU-accelerated conversational AI models. This collaboration allows researchers and developers to build high-quality, state-of-the-art conversational AI models.

“By working with DefinedCrowd, we’re providing NVIDIA Jarvis and NeMo users with sample datasets to build and accelerate their models, all within the NGC environment,” said Richard Kerris, head of developer relations at NVIDIA.

Build World-Class AI Models with Nvidia’s NeMo & DefinedCrowd

Speech is the most natural form of human communication. So, it’s not surprising that we’ve always wanted to interact with and command machines by voice.

However, for conversational AI to provide a seamless, natural, and human-like experience, it needs to be trained on large amounts of data representative of the problem the model is trying to solve.

The difficulty for machine learning teams is the scarcity of this high-quality, domain-specific data.

Companies are trying to solve this problem and accelerate the widespread adoption of conversational AI with innovative solutions that guarantee the scalability and internationality of models.

Nvidia and DefinedCrowd are two such companies. By providing machine learning engineers with a model-building toolkit and high-quality training data respectively, Nvidia and DefinedCrowd integrate to create world-class AI simply, easily, and quickly.

Introducing DefinedCrowd, a One-Stop-Shop for AI Training Data

Our core business is providing high-quality AI training data to companies building world-class AI solutions.

Our customers can access this data either through DefinedData, an online marketplace of off-the-shelf AI training data, available in multiple languages, domains, and recording types.

If you can’t find what you’re looking for in DefinedData, our workflows can serve as standalone or end-to-end data services to build any Speech-or-Text-enabled AI architecture from scratch, to improve solutions already developed, or to evaluate models in production, all with the DefinedCrowd Quality Guarantee.

The NeMo Toolkit Offering: Create Conversational AI Applications the Easy Way

NVIDIA NeMo is a toolkit built by NVIDIA for creating conversational AI applications. This toolkit includes collections of pre-trained modules for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS), enabling researchers and data scientists to easily compose complex neural network architectures and focus on designing their applications.

The NeMo and DefinedCrowd Integration: How It Is Done

In this tutorial, we will demonstrate how to connect DefinedCrowd Speech Workflows to train and improve an ASR model using NVIDIA NeMo.  The code can also be accessed on this Google Colab link.

Step 1: Install NeMo Toolkit and Dependencies

# First, let's install NeMo Toolkit and dependencies to run this notebook
!apt-get install -y libsndfile1 ffmpeg
!pip install Cython

## Install NeMo dependencies in the correct versions
!pip install torchtext==0.8.0 torch==1.7.1 pytorch-lightning==1.2.2

## Install NeMo
!python -m pip install nemo_toolkit[all]==1.0.0b3

 

Step 2: Obtaining data using DefinedCrowd API

In this section, we will demonstrate how to connect to DefinedCrowd API in order to obtain speech collected data.

For more information, visit https://developers.definedcrowd.com/

# For the demo, we will use a sandbox environment
auth_url = "https://sandbox-auth.definedcrowd.com"
api_url = "https://sandbox-api.definedcrowd.com"

# These variables should be obtained at the DefinedCrowd Enterprise Portal for your account.
client_id = "<INSERT HERE YOUR CLIENT ID>"
client_secret = "<INSERT HERE YOUR SECRET ID>"
project_id = "<INSERT HERE YOUR PROJECT ID>"

 

Authentication

import requests, json

payload = {
    "client_id": client_id,
    "client_secret": client_secret,
    "grant_type": "client_credentials",
    "scope": "PublicAPIv2",
}
files = []
headers = {}

# request the Auth 2.0 access token
response = requests.request(
    "POST", f"{auth_url}/connect/token", headers=headers, data=payload, files=files
)
if response.status_code == 200:
    print("Authentication Success!")
    access_token = response.json()["access_token"]
else:
    print("Authentication Failed")

Authentication Success!

Get List of Deliverables

# GET /projects/{project-id}/deliverables
headers = {"Authorization": "Bearer " + access_token}
response = requests.request(
    "GET", f"{api_url}/projects/{project_id}/deliverables", headers=headers
)

if response.status_code == 200:
    # Pretty print the response
    print(json.dumps(response.json(), indent=4))

    # Get the first deliverable id
    deliverable_id = response.json()[0]["id"]

[
    {
        "projectId": "eb324e45-c4f9-41e7-b5cf-655aa693ae75",
        "id": "258f9e15-2937-4846-b9c3-3ae1164b7364",
        "type": "Flat",
        "fileName": "data_Flat_eb324e45-c4f9-41e7-b5cf-655aa693ae75_258f9e15-2937-4846-b9c3-3ae1164b7364_2021-03-22-14-34-37.zip",
        "createdTimestamp": "2021-03-22T14:34:37.8037259",
        "isPartial": false,
        "downloadCount": 2,
        "status": "Downloaded"
    }
]
Download the final deliverable for a speech data collection
# the name I want to give to my deliverable file


filename = "scripted_monologue_en_GB.zip"

# GET /projects/{project-id}/deliverables/{deliverable-id}/download
headers = {"Authorization": "Bearer " + access_token}
response = requests.request(
    "GET",
    f"{api_url}/projects/{project_id}/deliverables/{deliverable_id}/download/",
    headers=headers,
)

if response.status_code == 200:
    # save the deliverable file
    with open(filename, "wb") as fp:
        fp.write(response.content)
    print("Deliverable file saved with success!")

 

Deliverable file saved with success!

!unzip  scripted_monologue_en_GB.zip &> /dev/null
!rm -f en-gb_single-scripted_Dataset.zip

 

Step 3: Analyze the Speech Dataset

In this section, we will analyze the data received from DefinedCrowd. The data is built of scripted speech data collected by the DefinedCrowd Neevo platform from several speakers in the UK (crowd members from DefinedCrowd).

Each row of the dataset contains information about the speech prompt, crowd member, device used, and the recording. The data we find with this delivery is:

Recording:

  • RecordingId
  • PromptId
  • Prompt

Audio File:

  • RelativeFileName
  • Duration
  • SampleRate
  • BitDepth
  • AudioCommunicationBand
  • RecordingEnvironment

Crowd Member:

  • SpeakerId
  • Gender
  • Age
  • Accent
  • LivingCountry

Recording Device:

  • Manufacturer
  • DeviceType
  • Domain

This data can be used for multiple purposes, but in this tutorial, we will use it for improving an existent ASR model for British speakers.

import pandas as pd

# let's look into the metadata file
dataset = pd.read_csv("metadata.tsv", sep="\t", index_col=[0])
# Let's check the data for the first row
dataset.iloc[0]

# How many rows do I have?
len(dataset)

50000

# Let's check some examples from our dataset
import librosa
import IPython.display as ipd

for index, row in dataset.sample(4, random_state=1).iterrows():

    print(f"Prompt: {dataset.iloc[index].Prompt}")
    audio_file = dataset.iloc[index].RelativeFileName

    # Load and listen to the audio file
    audio, sample_rate = librosa.load(audio_file)
    ipd.display(ipd.Audio(audio, rate=sample_rate))

 

For audio samples, please access this Google Colab link.

Step 4: Data Preparation

After downloading the speech data from DefinedCrowd API, we need to adapt it for the format expected by NeMo for ASR training. For this, we need to create manifests for our training and evaluation data, including each audio file’s metadata.

NeMo requires that we adapt our data to a particular manifest format. Each line corresponding to one audio sample, so the line count equals the number of samples represented by the manifest. A line must contain the path to an audio file, the corresponding transcript, and the audio sample duration. For example, here is what one line might look like in a NeMo-compatible manifest:

{“audio_filepath”: “path/to/audio.wav”, “duration”: 3.45, “text”: “this is a nemo tutorial”}

For the creation of the manifest, we will also standardize the transcripts.

import os
# Function to build a manifest
def build_manifest(dataframe, manifest_path):
    with open(manifest_path, "w") as fout:
        for index, row in dataframe.iterrows():
            transcript = row["Prompt"]
            # Our model will use lowercased data for training/testing
            transcript = transcript.lower()
            # Removing linguistic marks (they are not necessary for this demo)
            transcript = (
                transcript.replace("<s>", "")
                .replace("</s>", "")
                .replace("[b_s/]", "")
                .replace("[uni/]", "")
                .replace("[v_n/]", "")
                .replace("[filler/]", "")
                .replace('"', "")
                .replace("[n_s/]", "")
            )

            audio_path = row["RelativeFileName"]
            # Get the audio duration
            try:
                duration = librosa.core.get_duration(filename=audio_path)
            except Exception as e:
                print("An error occurred: ", e)

            if os.path.exists(audio_path):
                # Write the metadata to the manifest
                metadata = {
                    "audio_filepath": audio_path,
                    "duration": duration,
                    "text": transcript,
                }
                json.dump(metadata, fout)
                fout.write("\n")
            else:
                continue

 

Step 5: Train and Test Splits

In order to test the quality of our model, we need to reserve some data for model testing. We will be evaluating the model performance on this data.

import json
from sklearn.model_selection import train_test_split

# Split 10% for testing (500 prompts) and 90% for training (4500 prompts)
trainset, testset = train_test_split(dataset, test_size=0.1, random_state=1)

# Build the manifests
build_manifest(trainset, "train_manifest.json")
build_manifest(testset, "test_manifest.json")

 

Step 6: Model Configuration

In this tutorial, we’ll describe how to use the QuartzNet15x5 model as a base model for fine-tuning with our data. We want to improve the recognition of our dataset, so we will benchmark the model performance on the base model, and after, on the fine-tuned version.

Some of the following functions were retrieved from the Nemo Tutorial on ASR that could be checked at https://github.com/NVIDIA/NeMo

# Let's import Nemo and the functions for ASR
import torch
import nemo
import nemo.collections.asr as nemo_asr

import logging
from nemo.utils import _Logger

# Setup the log level by NeMo
logger = _Logger()
logger.set_verbosity(logging.ERROR)

Step 7: Setting Training Parameters

For training, NeMo uses a python dictionary as data structure to keep all the parameters. More information about it can be accessed at the NeMo ASR Config User Guide.

For this tutorial, we will load a pre-existent file with the standard ASR configuration and change only the necessary fields.

## Download the config we'll use in this example
!mkdir configs
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/asr/conf/config.yaml &> /dev/null
# --- Config Information ---#
from ruamel.yaml import YAML

config_path = "./configs/config.yaml"

yaml = YAML(typ="safe")
with open(config_path) as f:
    params = yaml.load(f)

 

Step 8: The Base Model

For our ASR model, we will use a pre-trained QuartzNet15x5 model from NVIDIA’s NGC cloud. (List of pre-trained models from NeMo)

Description of the pre-trained model: QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other.

# This line will download pre-trained QuartzNet15x5 model from NVIDIA's NGC cloud and instantiate it for you
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En", strict=False)

 

Step 9: Base Model Performance

The Word Error Rate (WER) is a valuable measurement tool for comparing different ASR model and evaluating improvements within one system. To obtain the final results, we will assess how the model performs by using the testing set.

# Let's configure our model parameters for testing

# Parameters for training, validation, and testing are specified using the 
# train_ds, validation_ds, and test_ds sections of your configuration file

# Bigger batch-size = bigger throughput
params["model"]["validation_ds"]["batch_size"] = 8

# Setup the test data loader and make sure the model is on GPU
params["model"]["validation_ds"]["manifest_filepath"] = "test_manifest.json"
quartznet.setup_test_data(test_data_config=params["model"]["validation_ds"])

# Comment this line if you don't want to use GPU acceleration
_ = quartznet.cuda()
# We will be computing the Word Error Rate (WER) metric between our hypothesis and predictions.

wer_numerators = []
wer_denominators = []

# Loop over all test batches.
# Iterating over the model's `test_dataloader` will give us:
# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)
# See the AudioToCharDataset for more details.
with torch.no_grad():
    for test_batch in quartznet.test_dataloader():
        input_signal, input_signal_length, targets, targets_lengths = [x.cuda() for x in test_batch]  
        log_probs, encoded_len, greedy_predictions = quartznet(
            input_signal=input_signal, 
            input_signal_length=input_signal_length
        )
        # Notice the model has a helper object to compute WER
        quartznet._wer.update(greedy_predictions, targets, targets_lengths)
        _, wer_numerator, wer_denominator = quartznet._wer.compute()
        wer_numerators.append(wer_numerator.detach().cpu().numpy())
        wer_denominators.append(wer_denominator.detach().cpu().numpy())
# We need to sum all numerators and denominators first. Then divide.
print(f"WER = {sum(wer_numerators)/sum(wer_denominators)*100:.2f}%")

WER = 39.70%

Step 10: Model Fine Tuning

For this tutorial, the base model got 39.7% of WER, which is not so good. Let’s see if providing some data from the same domain and language dialects can improve our ASR model.

For simplification, we will train for only 1 epoch using DefinedCrowd’s data.

import pytorch_lightning as pl
from omegaconf import DictConfig
import copy
# Before training we need to 

# provide the train manifest for training
params["model"]["train_ds"]["manifest_filepath"] = "train_manifest.json"

# Use the smaller learning rate for fine-tunning
new_opt = copy.deepcopy(params["model"]["optim"])
new_opt["lr"] = 0.001
quartznet.setup_optimization(optim_config=DictConfig(new_opt))

# Batch size will depend on the GPU memory available
params["model"]["train_ds"]["batch_size"] = 8

# Point to the data we'll use for fine-tuning as the training set
quartznet.setup_training_data(train_data_config=params["model"]["train_ds"])

# clean torch cache
torch.cuda.empty_cache()

# And now we can create a PyTorch Lightning trainer.
trainer = pl.Trainer(gpus=1, max_epochs=1)

# And the fit function will start the training
trainer.fit(quartznet)

 

Comparing Model Performance

Let’s compare the final model performance with the fine-tuned model we received from training with additional data.

# Let's configure our model parameters for testing
params["model"]["validation_ds"]["batch_size"] = 8

# Setup the test data loader and make sure the model is on GPU
params["model"]["validation_ds"]["manifest_filepath"] = "test_manifest.json"
quartznet.setup_test_data(test_data_config=params["model"]["validation_ds"])
_ = quartznet.cuda()
# We will be computing the Word Error Rate (WER) metric between our hypothesis and predictions.

wer_numerators = []
wer_denominators = []

# Loop over all test batches.
# Iterating over the model's `test_dataloader` will give us:
# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)
# See the AudioToCharDataset for more details.
with torch.no_grad():
    for test_batch in quartznet.test_dataloader():
        input_signal, input_signal_length, targets, targets_lengths = [x.cuda() for x in test_batch]
                
        log_probs, encoded_len, greedy_predictions = quartznet(
            input_signal=input_signal, 
            input_signal_length=input_signal_length
        )
        # Notice the model has a helper object to compute WER
        quartznet._wer.update(greedy_predictions, targets, targets_lengths)
        _, wer_numerator, wer_denominator = quartznet._wer.compute()
        wer_numerators.append(wer_numerator.detach().cpu().numpy())
        wer_denominators.append(wer_denominator.detach().cpu().numpy())

# We need to sum all numerators and denominators first. Then divide.
print(f"WER = {sum(wer_numerators)/sum(wer_denominators)*100:.2f}%")

WER = 24.36%

After training new epochs of the neural network ASR architecture, we achieved a Word Error Rate (WER) of 24.36% which is an improvement over the initial 39.7% from the base model using only 1 epoch for training. For better results, please consider using more epochs in the training.

Conclusion

In this tutorial, we demonstrated how to load speech data collected by DefinedCrowd and how to use it to train and measure the performance of an automatic speech recognition (ASR) model. We hope we have demonstrated how easy it is to create world-class AI solutions with Nvidia and DefinedCrowd. To find out more please click this link.