5 Best Practices for Testing AI Applications

Ben Anderson
Reading time: minutes

Testing AI applications in the era of AI regulations

In light of the April 2021 announcement of the world’s first legislative framework for regulating Artificial Intelligence (AI), the European Artificial Intelligence Act (EU AIA), now is an opportune time for developers to revisit their strategies for testing AI applications.

Incoming regulations mean that the group of stakeholders who care about your testing results just got bigger and more involved. The stakes are high, not least because companies that violate the terms of the legislation could face fines higher than those levied under the General Data Protection Act (GDPR). For the purpose of transparency, certain types of AI also have to make their accuracy metrics available to users, which adds to the pressure to get functional testing right.

Following on from Applause’s step-by-step guide to training and testing your AI algorithm, this article summarizes how developers should be testing AI applications in anticipation of the new era of AI regulations. Before jumping to the five best practices, though, it is important to understand the ways in which the EU AIA is set to impact the work of AI developers.

What the draft EU legislation says

Not all AI systems are subject to the same rules under the proposed EU AIA. The draft regulations intend to split AI systems into four categories and legislate them according to the risk posed to society by each group:

  • Unacceptable-risk systems (such as dark-pattern AI, real-time remote biometric identification systems and social-scoring mechanisms) are banned entirely

  • High-risk systems (such as those used for law enforcement, employee management, critical infrastructure operation, biometric identification in nonpublic spaces, and border control) are heavily regulated

  • Limited-risk systems (such as deepfakes, chatbots and emotion recognition systems) require certain disclosure obligations

  • Minimal-risk systems (such as AI-enabled video games and spam filters) are not subject to requirements but developers are encouraged to draw up codes of conduct

This means that it is predominately high-risk AI systems that could face extensive requirements under the EU AIA. If the draft legislation is enacted, high-risk AI applications developed across the globe will need to attain the Conformité Européenne (CE) mark — an EU logo awarded to products that have met their health and safety requirements — if they are to be traded on the European market. To do so, high-risk systems must conform to regulations regarding human oversight, transparency, cybersecurity, risk management, data quality, monitoring, and reporting obligations.

What AI developers need to look out for

If readers take one key lesson from this article, it should be that the EU AIA will not only potentially affect the development of AI in Europe, but across the globe. This is because the draft regulations apply not only to every AI system put on the European market, but to those systems for which the output is used in the EU. Under a phenomenon known as the Brussels Effect, many non-EU tech developers will thereby find themselves subject to the EU AIA if they want access to the European market.

Even if a developer has no contact with Europe, the EU AIA may still impact them indirectly. EU regulations often set the global precedent, as was the case with the GDPR, which has influenced similar data privacy laws across countries including the US, Chile and India. Margrethe Vestager, the Executive Vice President of A Europe Fit for the Digital Age, believes the EU AIA is no exception, stating in a press release: “with these landmark rules, the EU is spearheading the development of new global norms.”

Companies unsure of whether their AI’s output could be used in the EU might want to err on the side of caution, as failure to comply with the rules can lead to fines of up to 30 million euros or 6% of a business’ annual revenue. As history has shown, the EU does not take regulatory infringements lightly. Last year, Google was fined €50 million ($56.6 million) for breaching the terms of the GDPR — just one of 220 fines handed out for GDPR violations within the first 10 months of 2020. Many companies underestimated the preparation needed to change their processes in line with the regulations, such that 20% of US, UK and even EU companies are still not fully GDPR-compliant, according to research by TrustArc.

The EU AIA should come into effect in around five years. Whether AI companies are already doing business in Europe today or plan on penetrating the EU market, they need to act now if they are to leave enough time for testing AI applications. Research from McKinsey shows that only 48% of technology companies recognise regulatory-compliance risks in 2020, while an even fewer 38% reported actively working to address them. Despite this, a separate report by Accenture found that 72% of US executives believe AI will dramatically change their industry and one-quarter say it will completely transform their business in the next 3 years.

5 best practices

It is solely each individual company’s responsibility to ensure compliance with existing legislation and this new potential framework at such a time as it may be enacted. However, Applause’s framework for training and testing AI applications anticipates many of the areas laid out in the EU AIA. After all, many of the draft legislation’s requirements — such as ensuring output accuracy, identifying bias and practising good data management — are also simply best practices that can dramatically improve the quality of AI-driven experiences. Here are some of the key areas for consideration where we can support you:

1. Start off on the right foot

While testing AI applications is crucial, it is only one part of the pie. Algorithms are only as smart as the data that goes into them, so if your AI isn’t trained on high-quality data, then testing will only get you so far. The EU AIA also recognises the importance of training data for producing accurate and unbiased outcomes, stipulating that it should be relevant, representative, free of errors and complete.

An Alegion survey shows that 81% of executives say that collecting training data for AI models is more difficult than expected. Many companies that try to source training data themselves end up underestimating the investment and organisational skill sets needed to recruit, source and prepare the data at scale, resulting in expensive overhead due to false starts and product delays. Others turn to data providers like Amazon Mechanical Turk, which supply high data volumes at low cost, but which are simply not bespoke enough to each company’s needs.

Through our worldwide community of testers, Applause sources any training data set rapidly and at scale, including text, images, speech, handwriting, biometrics and more. We cast the broadest and largest possible net when recruiting data samples, collecting first-hand, authentic datasets from across different countries, cultures, communities, backgrounds, ages, races and genders. Only by thinking on a global scale can developers ensure the diversity of data needed to avoid bias and produce accurate, representative outcomes.

2. Think beyond mobile and web

Testing AI applications was simpler in the days when consumers spent most of their time on mobile and web. The draft EU AIA has coincided with the proliferation of technologies like voice, facial recognition and IoT, which themselves have given rise to trends like omnichannel and multi-medium experiences. Developers today need to test AI applications on all devices where they might occur, such as wearables, smart home devices, in-car systems and in-store shopping experiences.

The EU AIA recognises a darker side of many novel technologies, especially where consumers might be tricked into using them without their knowledge. Article 52, for example, states that EU residents must be made aware of whether a video is a deepfake, a conversation partner is a voice assistant or if they are subject to biometric categorization. Testing whether real users are aware of when this is happening is one way companies can measure whether they are successfully meeting this requirement.

3. Test outside of the lab

Humanity stands at the heart of the EU AIA. The legislation was created because the European Commission recognised that AI will never meet its full economic potential if humans don’t trust it. Companies hold a similar view, with 58% of executives saying AI-enabled growth will come from increased customer satisfaction and engagement, according to a report by Accenture.

A key issue with AI systems today is that they don’t provide the same level of service as a human. Research from Capgemini shows 64% of consumers want AI to be more humanlike if they are to engage more with the technology. According to Pega, a further 70% still prefer to speak to a human when dealing with customer service. If tech developers are to assuage consumers’ concerns and build more human-like AI, they need to ensure that their AI experiences are not only theoretically sound, but truly useful to the people who use them.

However, whether AI experiences are meeting customer expectations is not something that can be tested for in a lab environment. While the lab might be able to gauge whether an AI correctly captures information or responds accurately, only humans can gauge performance indicators like:

  • Did it understand me?

  • Did I see or hear what I expected?

  • Was it easy to use?

  • Did it give me everything I needed?

  • Would I use it again?

Only by involving real users when testing AI applications can companies produce truly useful AI experiences.

4. Actively identify and rectify biases

Because AI systems are trained on data gathered and created by humans, AI often takes on some of our unconscious biases — unconscious in the sense that individuals can’t always identify their own biases. This means that, if AI experiences are tested within too narrow a group of people, biases may go unnoticed that could lead to the marginalisation of certain groups or the exacerbation of prejudices. It also means that an AI system might discriminate against some of your users in that it works better for some groups than others. For this reason, paying attention to biases is also a requirement for certain systems under the EU AIA.

As discussed earlier, much of the work involved in eliminating bias is ensuring that the data used to train AI represents as diverse a pool of people as possible. That said, while representative training data can mitigate bias, testing AI applications is the only way you can know for sure whether your algorithm has still picked up bias. The only way developers can identify a biased algorithm is to have a large, diverse pool of testers trial the AI experience and then analyse the test results. Here Applause’s uTest community, the largest community of trained testers in the world, can provide an unparalleled service.

5. Build in a feedback loop

Just as algorithms can learn, they can unlearn and relearn. The step that comes after testing your AI applications for inaccuracies and bias is to build a feedback loop into the development process that corrects errors on a rolling basis. Testing AI applications is a circular process, as the output data can be used to reinform the input data up until the output is correct. Given that the EU AIA is likely to lead to similar legislation worldwide, AI systems also need to be able to adapt to changing requirements.

Applause works with the world’s leading tech companies to build global AI programs. Learn more about how we can help at https://www.applause.com/ai-training-testing.


You might also be interested in: