Quality AI Datasets and How to Use Them

Jay Selig Jay Selig
minute read
Applause Blog Logo

Sourcing quality data is one thing, but knowing how to use it is the real challenge.

Artificial intelligence is a data-driven technology, but that does not mean all datasets are created equally. While AI does need a steady diet of data, its effectiveness is predicated on the quality of data provided.

In other words, bad data will likely cause your algorithm to make bad, biased or flawed decisions.

What Makes Quality Data?

Bias can creep into an AI model in a variety of ways. In many ways, it’s like the flu — there is no outright cure for it, but there are measures you can take to limit the negative effects. If you choose not to take proper precautions, you risk potentially devastating consequences. The best place for you to start in your battle against bias is by using quality training data.

Every design choice, not just training data, can bring unconscious bias.
Francesca Rossi, A.I. Ethics Chief, IBM Research

Quality data has a variety of characteristics. Below are a few of the most important ones:

Diversity of data sources

When data isn’t diverse, AI models tend to show bias. One powerful example of this comes from Google image search. Years ago, Google made headlines when its search results for “hands” showed only images of white hands. Searching for “black hands” yielded problematic results, such as hands that were handcuffed. The dataset in this case wasn’t diverse enough, and Google’s algorithm-based search results reflected that lack of diversity.

The lesson here is that using data from a variety of sources can limit bias and provide a more holistic (and wholesome) experience.

Clean data

Irrelevant data, data with missing values, or data with typos won’t help anyone or anything — especially an AI model that’s trying to learn. Your training data is foundational to the model you build. It’s the first information your AI will see, so it should be clean and clear, and you should remove any corrupted information.

Clearly annotated inputs

Labeling can be tedious work, and is often done best by human beings, but it’s necessary. Without clear, relevant labels, your AI is unlikely to learn how to make the correlations you ask for. Good labels give your model the information it needs to make correlations in the real world, where inputs aren’t labeled.

Your Keys to Combating Bias in AI

Whitepaper

Whether intended or not, bias is built into every AI engine. See where you need to be vigilant and the steps you can take to eliminate bias from your AI initiatives.

READ NOW

Training and Testing Your Model

When training and testing your AI model, you should break down your data into three separate and distinct datasets: training data, validation data, and testing data.

Training data

The training dataset is the sample of data used to fit the model. In other words, the training data teaches the model how it’s supposed to learn and think. To do this, training data is often presented in pairs: an input and an output.

Training data is the first set of data your model is exposed to. During each stage of training, the model will be exposed to the training data, learning more and more about the parameters — or weight — of the data.

Because the training dataset does the heavy lifting of teaching your algorithm, it’s also the biggest dataset, making up between 60% and 80% of your total data.

Validation data

The validation dataset is a sample of data held back from training your model. This dataset provides an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. In more basic terms, validation data is an unused portion of your training data, and helps determine if the initial model is accurate.

Validation data lets you see if your model can use what it learned from the training data to identify relevant new examples — and if there is noise impacting the model’s decisions. Also called the development dataset, the validation data helps tune an AI model while checking for overfitting.

Overfitting happens when the AI is too closely fitted to the training data — producing results tied to the specifics of that first dataset. After validation, the team will often return to the training data and run it again, making adjustments to values and parameters to improve the model.

Team reviewing data for their AI model.
Team reviewing data for their AI model.

Test data

The test dataset is a sample of data used to provide an unbiased evaluation of a final model fit on the training dataset or to test the model. Put more simply, test data is a set of unlabeled inputs that test whether the model is producing the correct outputs in the real world.

The key difference between a validation dataset and a test dataset is that the validation dataset is used during model configuration, while the test dataset is reserved to evaluate the final model.

Test data is about 20% of your total data and should be completely separate from your training data — which your model should know very well by this point.

The Importance of Quality Data

When adopting AI, you need to clearly define everything about your project for it to be successful. You need to know why you’re adopting AI, and what business problem you’re solving. You must clearly define, clean and label all the data you use to train and test your model so that your AI algorithm will learn quickly and easily.

AI is only as accurate and helpful as the data you feed it. When you start with good data, you’re far more likely to have a healthy and effective model.

Applause Circle Logo