Using Questionable Datasets to Train AI Could Come With High Costs

While scrolling through social media posts recently, I saw that a number of friends had posted AI-generated portraits of themselves. Some were better than others. Some were just bizarre. Most were not to my taste. I didn’t think much about it.

Until the artists and designers in my network started sounding off.

Not over the quality of the images, but over copyright infringement, stating that many of the works in the dataset used to train the portrait-generating app were used for that purpose without permission from the artists. The open-source LAION dataset scrapes images from the web — including copyrighted artwork, images from medical records, photographs of war and violence from various news sites and other content that subjects clearly did not intend to train AI.

LAION’s makers argue that the organization meets Creative Commons licensing requirements, as it’s attributing the source of each image and not changing or even storing the images. Many artists whose works have been used disagree, and are seeking ways to have their art removed from the dataset. The path to do so isn’t a clear one.

The most powerful language model created to date, the Generative Pre-Trained Transformers 4 (GPT-4), released in March this year does not reveal what data sets it is trained on, citing competitive reasons. This raises serious ethical questions and concerns on data bias and privacy.

Legislation is beginning to catch up to the technology

Cases where AI has been developed using data governed by privacy laws rather than copyright or intellectual property regulations seem to be more straightforward. In addition to issuing fines, the Federal Trade Commission has ordered several companies to delete data and destroy any algorithms developed from that data. The FTC has called for algorithmic destruction as a penalty in instances where organizations violated privacy, gathered information illegally or where consumers did not consent for data to be used to train algorithms.

Algorithmic destruction has a huge cost for organizations across multiple fronts. Not only does the company lose the data collected to train the algorithm; the business also loses competitive position in the market. Development may be set back months or even years before the algorithm can be recreated from fresh, ethically-collected inputs. Software dev time isn’t cheap — and AI and machine learning engineers typically occupy the higher end of the developer pay scale. Having to pay for rework to recapture market share isn’t an ideal investment. Add in the costs of customer churn and possible damage to the company’s reputation and the losses keep growing.

The proposed European AI Act, with steep fines (up to 30 million euros or 6% of a company’s annual revenue, whichever is greater) is still in negotiation, but will likely come into effect within the next few years. Much like GDPR, the law will apply to all businesses that operate in the EU and is expected to influence similar legislation around the world.

Consumers are concerned about AI’s use

A February 2023 Applause survey found that overwhelmingly, people believe that AI technology and use should be regulated: Of 4,398 respondents, only 6% said they did not think AI should be regulated at all. More than half (53%) said AI should be regulated depending on its use and 35% said it should always be regulated.

Bias is another cause for concern, occurring when an algorithm is trained using poor or insufficient data. When questioned about bias in generative AI technology like ChatGPT, 86% of survey respondents indicated some level of concern:

Very concerned (19%)
Somewhat concerned, depends on the situation (41%)
Slightly concerned (26%)

In addition, customers aren’t always happy with the results AI offers. Applause’s survey found that among the 4,637 respondents who had used chatbots, 30% said they were somewhat or extremely dissatisfied with the experience, and 32% agreed with the statement “I would use chatbots more if they responded more accurately to the way I phrase things.” Natural language processing failures can reflect gaps in training data, including limited data from various regional, generational and ethnic groups.

As consumer and regulatory scrutiny intensifies, companies developing AI would be wise to ensure they’re collecting training data legally and ethically.

How to ensure training data is collected ethically

Make sure your organization’s terms and conditions/privacy policy cover AI training use cases. If you’re planning to use customer data to train AI, make sure they know that, understand how the data will be used, and how it will benefit them, such as through improved product and service offerings.

Ask if participants have opted in. Informed consent is key. When participants agree to provide data that may explicitly be used to train AI algorithms, companies are on solid ground. While data warehouses may be able to provide artifacts at scale, it’s important for buyers to ensure that the data may be expressly used to train AI algorithms without risk or repercussions. Ask if contributors have knowingly granted permission to have their biometrics used to train body or facial recognition technology, voice applications or other AI.

Actively work to eliminate bias. Look at the data and make sure it accurately reflects the diversity of your customer base and target audience, at a minimum. Ensure your dataset includes samples from people with disabilities, as well as those of different ages, genders, races, and other key demographics. If necessary, work with a partner to source training data that matches the exact criteria you need, like this international fitness company that partnered with Applause to source AI training data from users with a variety of body types and fitness levels.

Consider creating synthetic balanced data based on patterns and abstractions. Another ethical way to create training data, methods like the Synthetic Minority Oversampling Technique (SMOTE) can help create more balanced datasets to eliminate bias.

Ebooks

Building a Global AI/ML Data Collection & Quality Program

AI development requires a dedicated program. In this paper, we explore where current approaches to AI development are going wrong and show why a programmatic approach is the answer.

Read 'Building a Global AI/ML Data Collection & Quality Program' Now

Using Questionable Datasets to Train AI Could Come With High Costs

Legislation is beginning to catch up to the technology

Consumers are concerned about AI’s use

How to ensure training data is collected ethically

Building a Global AI/ML Data Collection & Quality Program

Crowdtesting vs. Outsourced Software Testing: A 2026 Quality Comparison

How To Give iGaming Users the Experiences They Crave

How Much Testing Is Enough?

Are AI Tools Improving Accessibility in 2026?

Human Testing vs. AI Testing: What Each Can (and Can’t) Catch

From Drift to Deflection: Engineering Trust in AI Systems

General

Company

Resources

Legal