4 Key Challenges of AI Artifact Collection
Product owners have a lot to consider when training AI algorithms. Foremost among the concerns is sourcing massive amounts of high-quality data to help the algorithm perform its task effectively. Trading quality for quantity or vice versa can ultimately result in poor accuracy or even biased conclusions that harm part of your customer base.
AI artifact collection, done right, is a massive and ongoing undertaking. Every day at Applause, we work with customers that navigate these challenges in their attempts to build valuable AI algorithms that do everything from make predictions on future retail purchases to assist in medical diagnoses.
In our work at Applause, we experience four primary challenges with AI artifact collection:
volume
quality
diversity
data protection
Let’s dig into each challenge of AI artifact collection to explore the difficulties of training AI algorithms at scale.
Volume
AI algorithms interpret data to deliver valuable predictions. It stands to reason that the greater the data input, the more effective the output will be, especially in a world of nuances that a machine can’t inherently understand.
Consider if you planned to train an AI model to identify a dog. You might source hundreds of photos of dogs to train the model — but there are hundreds of officially recognized dog breeds, plus many other experimental breeds. Those dogs might have very different kinds of complexions or have unique ailments or injuries that change their appearance. Maybe some photos of dogs show them wearing a Halloween costume or a holiday sweater, which might confuse the algorithm. The volume of artifacts needed to train this AI model is massive — but the more data you feed it, the better it will become at performing the task.
It’s not enough to source clear, high-res photos for the task. Ideally, some photos of dogs in both high and low light, low-res photos, and even photos where dogs are partially obscured can train the model on real-world examples of what a dog looks like.
Now think about how difficult this task might be if you require data that can only be sourced from people in the real world — maybe even data that is difficult to collect without robust data protection processes. What if you need to train a model on facial recognition, identification of medical records, or models that leverage sensitive financial or other personal information? One or two of these artifacts won’t suffice. Sourcing this type information and ensuring that the collection and processing of the information is compliant with applicable law and regulation is a very challenging task, even for those with sophisticated AI artifact collection operations. But it’s better to encounter the difficulty of sourcing that data in the beginning than debugging a low-accuracy algorithm and rushing to source the proper amount of data later.
Remember, too, that as a brand expands into a new market, the challenge of collecting a high volume of artifacts for an AI model grows exponentially, along with the task of later validating the results in market.
Quality
As established, the quantity of data is important, but it must also reach a certain quality threshold. Foremost, the data must be relevant to the problem, which means data must be current and clear. This goes back to the point about having a large volume of data, as inconsistent lighting or obscured images, for example, can be helpful as edge cases but less useful as the primary example upon which to train the data. Organizations must go through the painstaking process of validating these enormous data sets to ensure accuracy, but they might lack the time and resources to do so.
A large volume of data also presents a challenge with labeling and annotating. These are time-consuming and error-prone tasks that can negatively affect the use of the data itself. Likewise, preprocessing and integrating disparate data sources presents challenges around the consistency and reliability of the artifacts.
There are also instances where user-sourced data can be inconsistent or inaccurate. When dealing with user-sourced artifacts, it’s easy for someone to miss a step or fail to follow directions, which can also undermine the data training efforts. These unreliable or bad actors require a sophisticated vetting approach from an experienced professional — though sometimes AI models can lend an assist, as they can be trained to detect common fraud attempts in other AI models.
Prioritize the acceptance rate of the artifacts. Data providers who provide readily available, low-cost data sets sometimes fail to deliver useful returns. In fact, after factoring in re-work and potential damage to a brand’s reputation from poor quality data sets that cause inconsistent or inaccurate AI outputs, there’s a strong argument to be made that you save more by spending more on a premium data provider. After all, high-accuracy models inspire high confidence — and the opposite is true too.
All of these artifact quality challenges require thorough monitoring, attention to detail, and a mindful approach to ethical and legally compliant data collection and management. Simply put, many AI artifact collection and management operations exceed the capabilities of in-house teams, as training, retraining and testing AL algorithms should be an ongoing task.
Diversity
The data helps inform the algorithm. But what if the data neglects to capture the full picture? Worse, what if it degrades or even creates a negative experience for some of your customers?
Look no further thanKodak’s Shirley cards for a lesson in how product developers can inadvertently introduce bias into systems, even if they aren’t digital — as Meredith Broussard discussed in a recent episode of the Ready, Test, Go. podcast. Shirley cards, named after the light-skinned model pictured in the photo, originated in the 1950’s as a way to tune printing systems. Because the focus was on lighter skin, the film failed to capture the full range of colors, including darker skin colors, with clarity. While the company eventually corrected course with Shirley cards that featured models with multiple skin tones, the brand took a reputation hit and likely limited its revenue.
The Danger of Bias in Tech
Are glitches really the problem? Author and professor Meredith Broussard joins the podcast to challenge tech leaders to look deeper into the biases embedded in technology.
The same risks for bias exist today — in fact, they are likely exacerbated. AI models often train disproportionately on people with lighter skin tones, which means those models will struggle to identify individuals with African ethnicities. And this bias problem goes beyond image recognition, as historical data is littered with institutional bias as well. The result is that systems used for facial recognition, loan approval, fraud detection that are trained with historical data may carry the risk of bias that could pose harms to individuals.
Organizations must ultimately source a diverse set of data that closely represents both the problem the algorithm is trying to solve and the customer base. Capturing diverse data from customers may mean the collection of a broad spectrum of data about individuals, including different:
ages
ethnicities
languages
dialects or accents
socio-economic backgrounds
gender identities
education levels
technical proficiencies
physical, mental or cognitive abilities and disabilities
body types
income classes
Customize the data diversification efforts to the needs of the system. For example, a fitness tracker would need to source people of varied ages, fitness levels and athletic interests, while a legal document scanner would need to recognize different types of inputs, handwriting, etc. This level of artifact collection must be able to scale to meet the challenge.
These individuals not only have unique needs, viewpoints and values, but they also represent various relevant customer personas for the business. Make sure these valuable customers are accurately and fairly represented in the data, for the sake of accuracy and ethics — consider what the data sourcer or generator might be missing, and the inherent bias within its system. It’s easy for AI artifact collection efforts to be too narrow, even despite explicit attempts to the contrary.
Data protection
Hundreds of data breaches occur every year, and can be a costly nightmare for businesses. Businesses must stay ahead of the curve when it comes to ensuring the AI artifacts collected are secure and protected from misuse, exfiltration, or other data security concerns. Data security controls like encryption in motion and at rest and access management help achieve this objective. Work with a data sourcer who can not only provide the artifacts you need, but also protect it accordingly.
AI models and infrastructure must also be secure to prevent unauthorized access and misuse. Encryption, API security, and patches go a long way toward protecting models and the data contained within, while cloud services providers may offer a variety of options to secure the infrastructure.
Personal data that is used as a part of an AI training set carries its own set of risks, and more and more countries are putting AI-specific privacy legislation forth to counteract the risk of AI consumption of personal data from individuals. Businesses should ensure that personal data is redacted, anonymized or pseudonymized prior to use where possible.
Personal data use cases for AI models should be transparent, providing appropriate notices to individuals from whom the data artifacts used for training are gathered, and securing the right consents for the use of those artifacts. It is important to remember that personal data is subject to a variety of data protection laws around the globe, and companies must consider balancing the use case of that data with the rights and freedoms guaranteed across the world. Fair information practices and strong compliance controls help to ensure that AI models are trained using the right data for the right reasons.
In all, the training and use of AI requires a collaborative effort between data security and legal professionals.
Take away the guesswork
How much time and resources do you spend on AI training and testing today? How much more will it take to reach your company’s goals?
Applause leverages a scalable global community of more than one million digital experts who can provide the volume and quality of artifacts you need — text, images, video, speech, handwriting, documents, and more — for any AI training purpose. Our AI artifact and data collection solution delivers the training data sets you need, curated and validated by domain experts. We source artifacts from our global community, which offers individuals from all backgrounds and cultures, matching your current and future target customers. We offer advanced techniques to train algorithms and large language models for applications, helping household-name companies deliver high-quality, highly secure AI experiences and capabilities.
But Applause doesn’t stop there. We offer unmatched AI testing capabilities, sourcing the same million-strong digital community to provide rapid, iterative feedback on your AI experiences. Thus, Applause can work alongside — and even within — your organization as a true digital quality partner, delivering reliable, optimized AI models.
Talk with us today about your AI goals, and let’s create a plan to help you achieve them.