3 Challenges to Sourcing Training Data for AI Applications

Marina PortraitMarina Lucier
minute read

How do you navigate the challenges of sourcing training data?

Every successful AI algorithm is built on a foundation of training data. But sourcing data that fits your needs and meets volume requirements is an immense undertaking.

In fact, 81% of executives said training AI with data is more difficult than expected, according to Alegion.

“Sourcing data on your own is extremely challenging at best, and potentially near impossible.” Kristin Simonini, VP of Product at Applause, said in a recent webinar, Sourcing Training Data for AI Applications. “You might not have access to the number of people you need and even if you do, you still need to ensure you’re getting quality data. If you do get quality data, you need a team to annotate and label the data. And even if you do all that, you need to think about diversity and being able to evolve over time. … It’s a massive challenge of logistics and overhead.”

Applause works with a variety of enterprises to train their algorithms to respond to real-world scenarios, such as how Applause helped train BBC’s voice assistant Beeb. Leveraging a global community of participants with a fully managed service, Applause is uniquely positioned to provide training data for AI algorithms and handle projects that companies simply can’t execute on their own.

Here are three key challenges of sourcing training data that Applause helps organizations tackle.

3 Challenges of Sourcing Training Data for AI Algorithms

Quantity of data sources

“Sourcing data at scale is not easy to do,” Simonini said. “It's not something that most organizations are really prepared to manage and to operate. It takes dedicated resources to deliver a project of the scale that we did for BBC.”

Enormous amounts of data are required to develop an effective algorithm. To train BBC’s voice assistant, the algorithm required over 100,000 voice utterances. Applause provided these utterances from 972 unique people throughout the United Kingdom.

To train BBC’s voice assistant, the algorithm required over 100,000 voice utterances. Applause provided these utterances from 972 unique people throughout the United Kingdom.

As another example, Applause recently helped a company train its AI algorithm to read handwritten documents. The requirements were to deliver thousands of handwriting samples, each one unique and from a different person. For this project, the quantity of individuals was key — the requirements didn’t allow for 10 testers to contribute 100 handwriting samples each – as the algorithm needed unique samples from a broad spectrum of individuals.

“Since we needed such a high volume of participants, we had to recruit people from all over,” Simonini shared. “Obviously, the size of our crowd was critical to our success. We sourced over 1,000 individuals that were willing to provide these handwritten documents and met the demand for diverse content.”

Most organizations simply don’t have access to this amount of individuals to contribute data. You can ask employees to get samples from friends and family, but that could be ineffective and a project management nightmare.

Quality of data

So what makes quality training data? Let’s look again at the handwriting samples example.

In this case, the artifacts must be legible, easily accessible and meet a host of other requirements based on individual project goals. More specifically, there couldn’t be deformities on the page or even a single folded margin in the middle of the page. When users scanned the documents, they needed good light conditions or the ability to use flash in dark settings.

“There were really specific asks like that,” Simonini said. “There were a lot of things that needed to be tracked and monitored very carefully.”

Every individual artifact must be verified for quality to assure that your algorithm will work as intended. Again, this process takes up a considerable amount of resources. While organizations could do this internally, it creates massive overhead and would be inefficient.

In the case of the handwriting samples example, having the organization take responsibility for parsing through each document to confirm its quality — and ask participants for new samples when necessary — could take months, and create a logistical nightmare.

Diversity of data

As if finding mountains of quality data wasn’t hard enough, your team has to have a diverse range of artifacts to develop an accurate algorithm. Without diversity in the training data, the algorithm won’t be able to recognize a broad range of possibilities, ultimately making the algorithm ineffective.

Without diversity in the training data, the algorithm won’t be able to recognize a broad range of possibilities, ultimately making the algorithm ineffective.

If you're building an AI algorithm, you don't want to rely on one single person to provide the artifacts that you're going to use to train the algorithm. To properly train an algorithm, you need different types of data and inputs, including geographical data, demographic information, types of documents, etc.

“It won't lead to strong output and one that will service the needs of your, no doubt, diverse customer base,” Simonini said.

The Applause community provides access to a global pool of participants. When organizations work with Applause, they are able to select hyper-specific demographics, including gender, race, native language, location, skill set, geography and many other filters.

Be prepared to evolve with your project

Remember: No project ends up exactly how it started. Needs shift over time, and you have to shift, change footing, get new data points, and source new testers or resources to input the information as the project evolves.

“If you’re embarking on a project like this,” Simonini said, “considerations about how you’re going to manage that data input, data quality process, that’s certainly something that we at Applause are happy and in position to assist with.”

You might also be interested in: