The Last Mile of Generative AI Risk and Benefits Comes Down to Humans

Generative AI introduces new variables to digital interactions because the technology enables variable inputs and outputs. Traditional systems operate within the realm of fixed or tightly bounded inputs and predefined outputs. Conversational systems such as voice assistants and chatbots provided the option for input variability but primarily relied on predefined outputs. Generative AI introduces a new level of complexity that most organizations recognize as important but do not have a clear playbook for accommodating the change.

Generative AI systems not only accommodate input variability but have added output variability that can stray well beyond intentional boundaries. Many organizations know about these issues of output “hallucinations” and recognize that user safety and security can be compromised both at the input and output stages. This new dynamic complicates the testing process significantly, rendering traditional automated testing methods insufficient. It requires rethinking how to apply human testing in the process.

Generative AI’s strength in automating processes that previously resisted automation is so attractive that companies are accepting the new risks introduced by these dynamic solutions. The challenge they face is how to mitigate these risks while enhancing user adoption and value. Thankfully, there are some tools and techniques that are helping address these risks. Updated models that employ human testing represent one important development.

Limitations of Automated Testing

Automated testing is the backbone of traditional software quality assurance, but it struggles when faced with the more fluid nature of generative AI. This inadequacy stems from automated tests’ reliance on predefined expectations—expectations that generative AI, with its potential for variability, often defies.

For instance, traditional customer service might only need to validate a finite set of user inputs against expected outputs, rejecting anything outside that input. However, generative AI chatbots developed for the same purpose can produce an array of responses based on the nuances of an interaction. This variability introduces new types of risks, such as generating inappropriate or irrelevant content. A rule-based automated system may catch these, though degrees of freedom in the acceptance criteria can inadvertently allow risks to infiltrate seemingly innocuous interactions.

Generative AI solutions can also generate fully accurate outputs that include textual variability that an automated system may incorrectly reject. Automated scripts can be helpful but are ill-equipped to catch both false and true outputs in different circumstances due to their reliance on predefined conditions and outcomes.

Moreover, the risks associated with automated testing are not merely technical but also ethical and reputational. Inadequately tested AI systems may inadvertently produce biased, offensive, or nonsensical content, eroding user trust and potentially causing harm. Robust testing mechanisms that go beyond automated scripts are required. This is one area where human testing fills an important gap.

The Critical Importance of User Feedback

User feedback emerges as a cornerstone in refining generative AI applications. It’s not just about identifying bugs, technical glitches, or user experience flaws; it’s about understanding the appropriateness and relevance of the AI’s outputs from the end-user’s perspective.

For example, OpenAI’s approach to improving GPT models involves extensive user testing phases, where feedback on the model’s responses is used to fine-tune its performance. This iterative process ensures that the AI’s outputs align more closely with human expectations and accurate outputs, enhancing the overall user experience. Leveraging user feedback effectively requires a multifaceted approach, such as live testing environments where real-world interactions can be observed and analyzed or feedback loops where users can directly annotate or critique AI outputs.

As an example, consider a generative AI educational tool for producing content personalized to each student. By engaging with a diverse group of educators and students in the tool’s testing phase, developers can gather valuable feedback on the AI’s ability to adjust to different learning styles and needs, different cultural backgrounds, and other variables. This feedback can guide the fine-tuning of later generative AI models to ensure they create accurate content that is tailored to engage with the different ways students learn. Some of the techniques that AI foundation model developers rely on are equally important for applications built upon the technology.

Red Teaming in Generative AI

Red teaming, a concept borrowed from cybersecurity, involves adopting an adversarial approach to test software systems. Generative AI red teaming challenges AI models with variable prompts to ensure their output matches the intent of the developers regardless of circumstances. Red teaming has been useful for spotting vulnerabilities in traditional AI systems, but its significance is magnified with generative AI. By simulating attacks or probing for weaknesses in a generative AI model’s responses, developers fix potential issues before they affect end users.

Red teaming checks not just for safety but also for how well the AI responds on different topics. Good red teaming employs both subject experts and people from a variety of backgrounds as generalists. The quality of the red teaming is tied closely to the quality of the team and the variability introduced. This enables software developers to increase the likelihood of identifying outlier behavior that introduces risk.

Model Fine-Tuning as Testing

Fine-tuning a generative AI model involves adjusting the model’s parameters based on specific sets of prompts and responses. This is both a technical change and a way of measuring how well the model responds in a form that matches expectations. That might mean making the model more of an expert on a specific topic or simply better at avoiding problematic answers. Since generative AI models evolve through user input and added data, continuous testing and monitoring are key to the process.

At its best, fine-tuning is both an art and a science. It requires people with a deep understanding of the model and its application domain as well as the creativity to guide the AI so that its outputs fit any relevant use cases. Human testing is a key mechanism for both discovering unwanted behavior and improving model performance.

Conclusion

Bringing a generative AI application to market is a multifaceted challenge that goes beyond traditional software development hurdles. A robust human testing approach helps mitigate risks and improve performance in ways that automated scripts can’t fulfill on their own. Human insight, adversarial testing, and continuous fine-tuning are all needed if the goal is an application that is resilient enough to cope with actual customers and fully delivers on the promise of the technology’s benefits.

It’s a dynamic relationship that requires ongoing adjustments, but generative AI’s value is making its widespread adoption inevitable. Many key innovations today are focused on how to deliver positive outcomes. Some people are surprised that this last mile of AI effectiveness often comes back to human contributions.

Webinar

Testing Generative AI Applications

Explore critical elements to include in an effective testing strategy to lower risk and increase user adoption for Gen AI apps.

Watch 'Testing Generative AI Applications' Now

Want to see more like this?

AI Training & Testing

Chris Sheehan

EVP, High Tech & AI

Published On: February 20, 2024

Reading Time: 6 min

AI Training & Testing

New Software Testing Methods in the AI Age

Learn how software development and testing are changing in the age of AI.

AI Training & Testing

Preparing for PSD3 and the IPR: Payment Testing Guidance

With new fintech regulations on the horizon, early preparation pays off.

AI Training & Testing

Anatomy of an Accessibility Pilot

Learn what a short accessibility pilot engagement looked like for a small tech firm that needed help fast.

AI Training & Testing

Should Software Testers Learn How To Code? Pros and Cons

Coding offers opportunity for career advancement, if you’re up to the task.

AI Training & Testing

What is a11y? Advocacy for Accessibility

Discover what a11y stands for and why it’s a critical component of digital quality for brands.

AI Training & Testing

Scaling Globally with Automation and Localization

Learn how to scale localization efforts by combining automation with real-world human expertise.

No results found.

The Last Mile of Generative AI Risk and Benefits Comes Down to Humans

Limitations of Automated Testing

The Critical Importance of User Feedback

Red Teaming in Generative AI

Model Fine-Tuning as Testing

Conclusion

Testing Generative AI Applications

New Software Testing Methods in the AI Age

Preparing for PSD3 and the IPR: Payment Testing Guidance

Anatomy of an Accessibility Pilot

Should Software Testers Learn How To Code? Pros and Cons

What is a11y? Advocacy for Accessibility

Scaling Globally with Automation and Localization

General

Company

Resources

Legal