Evaluate and Fine-Tune Your AI models

Our suite of subjective evaluation tests can further tune your AI models with accurate human in the loop workflows.  If you care about the Quality of Experience for your virtual assistants, TTS models, machine translation engines or chatbots, partner with us today.

Get Started

MOS Testing

Obtain a Mean Opinion Score for a single speech, text or image stimulus.

MUSHRA Testing

Evaluate and score multiple speech, text or image stimuli.

ABX Testing

Compares 2 choices of a random sample against 2 reference outputs to identify detectable differences.

Pronunciation Validation

A more objective evaluation designed to evaluate text-to-speech.

Evaluation Quality Guarantee

Given subjective nature of opinion-based work, the same level of DefinedCrowd quality cannot be guaranteed, however multiple quality metrics are used to ensure evaluations are performed and validated, including the naturalness, trustworthiness, and likeability of your AI model.

Spammer Control

IP location-based blocking of contributors.

Inter-Evaluator Checks

Agreement score calculations using Pearson Correlation Coefficient.

Standard Deviation

Answer variety thresholds based on standard deviation and evaluations per contributor.

Real Time Audits

Answer pattern detection for suspicious behavior.

Success Stories

Mastercard’s R&D Labs needed unique, multi-lingual text data that covered 20 designated payment scenarios in English and Spanish, and they needed it fast.

Keeping a nation’s lights on means constantly inspecting electricity poles for damage. EDP partnered with DefinedCrowd to improve Asset Performance Management processes.

With the rise of voice technology, this leading global provider of audio equipment wanted to develop an automatic speech recognition (ASR) model.

A global electronics maker came to DefinedCrowd with the goal of building more inclusive facial recognition models, requiring accurately annotated images with highly specific criteria.

Smart companies see the pile of unstructured text floating through the digital realm as a strategic goldmine of consumer insights.

A Fortune 500 Tech company needed comprehensive speech training data in French that accounted for a wide range of dialects, requiring diverse data in terms of age, gender and regional dialects.

A visionary Fortune 500 Tech company leveraged sentiment analysis models to dig beyond surface-level understandings to extract granular-level insights.