The Path To Exceptional AI Apps
As organizations look to capitalize on AI, they’re developing new apps and adding AI features to existing experiences. But not all of these new AI offerings work seamlessly. In a recent webinar, Chris Sheehan, Executive Vice President, High Tech & AI and Chris Munroe, Vice President, AI Strategic Customer Programs, shared best practices for training and testing all types of AI.
Sheehan and Munroe work closely with many Applause customers developing AI applications. They’ve witnessed firsthand what leading organizations do to reduce risk and ensure their AI applications create value for users. Here are some of the key points they shared during the webinar, “AI Testing: The Path to Exceptional Apps.”
AI applications pose unique challenges.
“The primary goal of AI testing is actually pretty straightforward: to create safe, responsible, and high-quality AI experiences that deliver value to your users,” Munroe said. “But as many of you know, that’s much easier said than done. Getting AI to behave safely and predictably while still delivering value is a complex challenge that many companies are still trying to figure out.”
Munroe walked through the differences between traditional and generative AI (Gen AI) and outlined why testing can be so challenging. The key difference lies in the fact that Gen AI is probabilistic and not deterministic. Deterministic systems serve up consistent, predictable outputs every time you input the same data. On the other hand, generative AI models are probabilistic, which means they integrate some element of randomness and unpredictability. “Because of that unpredictability, testing and quality assurance become much, much more complex,” Munroe said. “A quality testing roadmap is absolutely essential to ensure that your AI behaves the way that you intend.”
The role of human expertise in AI
Sheehan explained that a quality testing strategy centers around people, process and technology. “In working with our clients, we’ve found that there are five general areas where human expertise comes into play that can increase the quality of your AI system,“ Sheehan said. The five critical areas are:
- Data collection: build the foundation for strong AI systems
- Model fine-tuning data: adjust an LLM to improve performance
- Model evaluation: identify weaknesses and areas for improvement
- Red teaming: assess risks and vulnerabilities to increase safety
- Release testing and monitoring: get real user perspectives
Sheehan added, “The extent of using human expertise to do your testing depends very much on your use case. So you can imagine if you’re building a large language model-based application, say, in the health space, having a lot of human expertise at every step is going to be critical. The level of accuracy needed for that application is going to be extremely high. Now, you could contrast that with building a general purpose chatbot to do summarization for a small internal team. So a completely different use case and a very narrow use case. Your level of using human expertise may be less than what’s needed in the medical example.”
Collecting diverse, real-world data
“Let’s start with looking at data collection. This is really the foundation of building quality,“ Sheehan said. “You need high-quality data. And that can come from many different places, but human data sets with fidelity and diversity is critical: that starts with an experienced partner that knows how to do this at scale.” While gathering diverse, fit-for-purpose data is essential, few organizations can quickly source data across multiple countries while adhering to the relevant privacy and data protection laws.
Fine-tune the model with additional data
Munroe explained that additional focused data is crucial for fine-tuning models. “The type and variety of data you use can really make or break your model’s performance. Companies often use a mix of public data, proprietary data acquired through licensing, and synthetic data to fill in gaps. But here’s the reality. There will always be cases where synthetic data is not enough.”
As an example, use cases like specialized medical or financial applications call for real-world data with higher fidelity. The data could come from documents, like medical insurance policies, rental agreements or bills. An application may require custom data from domain experts in fields such as STEM, finance, or medicine to train a model on specific behaviors or responses. “Overall, the goal is to ensure that your AI systems understand the nuances of your specific company’s domain and the requirements from a training perspective around the variety of data sources that are needed,” Munroe said.
Case Study
How A Tech Company Improved Its LLM Chat App
Learn how a high-tech company worked to improve its LLM-based chat app’s responses and reduce common AI risks prior to launch.
Evaluate the model’s responses
Once you’ve trained the model, there’s still more work to do.
“This is where continuous evaluation comes in. And human judgment plays a critical role here,” Munroe said. “While some aspects [of testing] can be automated, factors like accuracy, relevancy, clarity, and language or tone often require human and subjective feedback, especially for nuanced situations.” He explained that Applause often collaborates with customers to develop customized prompt and response evaluation frameworks tailored to fit their needs.
Testers assess factors such as accuracy, relevance, clarity, language quality, tone and style. Each one is graded. Accuracy and relevance typically carry the most weight, often making up around 60 or more percent of the overall evaluation score. “Depending on the industry and our risk profile, we may use different scoring methods from a binary one, where the answer is true or false or yes or no to a more detailed three or five-point scoring scale, where more subjective feedback is appropriate,” Munroe said.
Reduce risks with red teaming
Red teaming is a crucial, but often overlooked aspect of AI development. The concept is fairly simple. It’s about thinking like an attacker. “The goal is to test your model in ways that identify its weak points before it goes live. This helps you uncover potential vulnerabilities that could be exploited to produce harmful content,” Munroe said.
Munroe went on to explain the approach to AI red teaming to identify vulnerabilities before they become real issues. The process starts with assessing the AI system’s capabilities. Next, the team creates risk assessment documents that outline the scope, objectives, and detailed requirements of the red teaming engagement. Munroe stressed the importance of ensuring that all stakeholders have a clear understanding of the expected tests. This includes agreement on evaluation criteria, error types, or harm categories. “Overall, it serves as a reference to align the red team’s effort with the organization’s risk assessment goals and AI vulnerabilities,” Munroe said.
Next, the team plans specific attack strategies. These tests are designed to simulate real-world threats to stress test the AI’s robustness and resilience. Some common adversarial techniques include things like data poisoning, bias detection, misinformation testing, sensitive topic probes, or the use of slang or jargon to try to bypass safeguards.
Testing teams conduct red team exercises in a controlled environment to see how well the system holds up against simulated attacks. Testers closely monitor the system’s responses to determine where security may need improvement. The final step is to analyze results and provide recommendations to enhance the AI’s security and resilience.
“Overall, it’s a continuous loop of testing, learning, and improving to ensure that your AI system is not just functional, but also is very safe, secure, and robust,” Munroe said.
Test before launch
Monroe pointed out that pre-launch testing is critical, as applications that work in a controlled environment can often fail under real-world conditions. He shared examples of companies that ran into “very public, embarrassing, and in some cases, very expensive issues, which could have likely been mitigated or avoided through comprehensive pre-launch testing.”
Establish a continuous testing process
Munroe and Sheehan agree that after an AI system is live, the work does not stop. Models can degrade over time, especially as they’re exposed to new data or changing user needs. Continuous testing and fine-tuning are absolutely essential to keep an AI system performing at a high level.
“Think of it almost like maintaining a car. Just because it runs well today doesn’t mean that you can ignore it. Regular maintenance is key to keep things running smoothly and ensuring, obviously, you’re meeting your users’ expectations,” Munroe said. “Overall, it’s important to retrain your model based on new data and assess how it adapts to that new data. And as you continue to train, fine-tune, and safeguard your system through the necessary post-release testing and the AI programs that I discussed earlier, that is the key to creating a high quality AI system.“
Operationalize AI to ensure quality and consistency
Sheehan closed out the webinar with guidance for organizations looking for ways to implement a comprehensive AI training and testing program. He shared three practical tips.
Start with a quality and a risk framework. Fine-tuning, model evals and red teaming all rely on some form of quality framework and some form of risk framework. Start by documenting the risks that you’re concerned about for your particular use case or use cases? How are you going to assess quality? “That becomes the guiding light, the guiding framework, as you start to operationalize this through your development process,“ Sheehan said. He emphasized that these should be living documents which are regularly reviewed and updated.
Decide how to fit these testing and quality procedures into your SDLC. Sheehan pointed out that organizations have the ability to conduct fine-tuning, model evaluations and red teaming whenever they’re touching a Gen AI system. “You want to be able to build that capability, both internally and externally, and have stage gates, much like you have in an SDLC,” Sheehan said.
Conduct ongoing, real-world testing to optimize the model and improve user experience. “You want to be able to get a larger group of users into your system for an extended period of time,” Sheehan said. He recommends having testers use the system for several weeks as they would in the real world. “It’s going to uncover a lot. It’s going to uncover not only bugs, but you’re going to get useful feedback. You’re also going to understand all of the things that you did in fine-tuning and red teaming and system prompts. Are there edge cases? Did I think of everything? How is this actually going to behave in the real-world?”