Select Page
Man coding on a laptop computer. The camera is looking over his left shoulder. He has a beard and dark hair.

The Last Mile of Generative AI Risk and Benefits Comes Down to Humans

Generative AI introduces new variables to digital interactions because the technology enables variable inputs and outputs. Traditional systems operate within the realm of fixed or tightly bounded inputs and predefined outputs. Conversational systems such as voice assistants and chatbots provided the option for input variability but primarily relied on predefined outputs. Generative AI introduces a new level of complexity that most organizations recognize as important but do not have a clear playbook for accommodating the change.

Generative AI systems not only accommodate input variability but have added output variability that can stray well beyond intentional boundaries. Many organizations know about these issues of output “hallucinations” and recognize that user safety and security can be compromised both at the input and output stages. This new dynamic complicates the testing process significantly, rendering traditional automated testing methods insufficient. It requires rethinking how to apply human testing in the process.

Generative AI’s strength in automating processes that previously resisted automation is so attractive that companies are accepting the new risks introduced by these dynamic solutions. The challenge they face is how to mitigate these risks while enhancing user adoption and value. Thankfully, there are some tools and techniques that are helping address these risks. Updated models that employ human testing represent one important development.

Limitations of Automated Testing

Automated testing is the backbone of traditional software quality assurance, but it struggles when faced with the more fluid nature of generative AI. This inadequacy stems from automated tests’ reliance on predefined expectations—expectations that generative AI, with its potential for variability, often defies.

For instance, traditional customer service might only need to validate a finite set of user inputs against expected outputs, rejecting anything outside that input. However, generative AI chatbots developed for the same purpose can produce an array of responses based on the nuances of an interaction. This variability introduces new types of risks, such as generating inappropriate or irrelevant content. A rule-based automated system may catch these, though degrees of freedom in the acceptance criteria can inadvertently allow risks to infiltrate seemingly innocuous interactions.

Generative AI solutions can also generate fully accurate outputs that include textual variability that an automated system may incorrectly reject. Automated scripts can be helpful but are ill-equipped to catch both false and true outputs in different circumstances due to their reliance on predefined conditions and outcomes.

Moreover, the risks associated with automated testing are not merely technical but also ethical and reputational. Inadequately tested AI systems may inadvertently produce biased, offensive, or nonsensical content, eroding user trust and potentially causing harm. Robust testing mechanisms that go beyond automated scripts are required. This is one area where human testing fills an important gap.

The Critical Importance of User Feedback

User feedback emerges as a cornerstone in refining generative AI applications. It’s not just about identifying bugs, technical glitches, or user experience flaws; it’s about understanding the appropriateness and relevance of the AI’s outputs from the end-user’s perspective.

For example, OpenAI’s approach to improving GPT models involves extensive user testing phases, where feedback on the model’s responses is used to fine-tune its performance. This iterative process ensures that the AI’s outputs align more closely with human expectations and accurate outputs, enhancing the overall user experience. Leveraging user feedback effectively requires a multifaceted approach, such as live testing environments where real-world interactions can be observed and analyzed or feedback loops where users can directly annotate or critique AI outputs.

As an example, consider a generative AI educational tool for producing content personalized to each student. By engaging with a diverse group of educators and students in the tool’s testing phase, developers can gather valuable feedback on the AI’s ability to adjust to different learning styles and needs, different cultural backgrounds, and other variables. This feedback can guide the fine-tuning of later generative AI models to ensure they create accurate content that is tailored to engage with the different ways students learn. Some of the techniques that AI foundation model developers rely on are equally important for applications built upon the technology.

Red Teaming in Generative AI

Red teaming, a concept borrowed from cybersecurity, involves adopting an adversarial approach to test software systems. Generative AI red teaming challenges AI models with variable prompts to ensure their output matches the intent of the developers regardless of circumstances. Red teaming has been useful for spotting vulnerabilities in traditional AI systems, but its significance is magnified with generative AI. By simulating attacks or probing for weaknesses in a generative AI model’s responses, developers fix potential issues before they affect end users.

Red teaming checks not just for safety but also for how well the AI responds on different topics. Good red teaming employs both subject experts and people from a variety of backgrounds as generalists. The quality of the red teaming is tied closely to the quality of the team and the variability introduced. This enables software developers to increase the likelihood of identifying outlier behavior that introduces risk.

Model Fine-Tuning as Testing

Fine-tuning a generative AI model involves adjusting the model’s parameters based on specific sets of prompts and responses. This is both a technical change and a way of measuring how well the model responds in a form that matches expectations. That might mean making the model more of an expert on a specific topic or simply better at avoiding problematic answers. Since generative AI models evolve through user input and added data, continuous testing and monitoring are key to the process.

At its best, fine-tuning is both an art and a science. It requires people with a deep understanding of the model and its application domain as well as the creativity to guide the AI so that its outputs fit any relevant use cases. Human testing is a key mechanism for both discovering unwanted behavior and improving model performance.

Conclusion

Bringing a generative AI application to market is a multifaceted challenge that goes beyond traditional software development hurdles. A robust human testing approach helps mitigate risks and improve performance in ways that automated scripts can’t fulfill on their own. Human insight, adversarial testing, and continuous fine-tuning are all needed if the goal is an application that is resilient enough to cope with actual customers and fully delivers on the promise of the technology’s benefits.

It’s a dynamic relationship that requires ongoing adjustments, but generative AI’s value is making its widespread adoption inevitable. Many key innovations today are focused on how to deliver positive outcomes. Some people are surprised that this last mile of AI effectiveness often comes back to human contributions.

Webinar

Testing Generative AI Applications

Explore critical elements to include in an effective testing strategy to lower risk and increase user adoption for Gen AI apps.

Want to see more like this?
Published: February 20, 2024
Reading Time: 8 min

Usability Testing for Agentic Interactions: Ensuring Intuitive AI-Powered Smart Device Assistants

See why early usability testing is a critical investment in building agentic AI systems that respect user autonomy and enhance collaboration.

Do Your IVR And Chatbot Experiences Empower Your Customers?

A recent webinar offers key points for organizations to consider as they evaluate the effectiveness of their customer-facing IVRs and chatbots.

Agentic Workflows in the Enterprise

As the level of interest in building agentic workflows in the enterprise increases, there is a corresponding development in the “AI Stack” that enables agentic deployments at scale.

What is Agentic AI?

Learn what differentiates agentic AI from generative AI and traditional AI and how agentic raises the stakes for software developers.

How Crowdtesters Reveal AI Chatbot Blind Spots

You can’t fix what AI can’t see

A Snapshot of the State of Digital Quality in AI

Explore the results of our annual survey on trends in developing and testing AI applications, and how those applications are living up to consumer expectations.
No results found.