Testing AI in 2026: Progress, Priorities and Plateaus
Over the last few years, rapid innovations in generative AI and agentic AI have promised to reshape the world. AI has become ubiquitous across workplace tools, recommendation engines, and customer support chatbots. Teams across all industries are working to embed AI into backend systems that power millions of transactions every day to make businesses better, faster and more profitable. Some organizations have made massive strides in rolling out AI applications in a short time. Most, however, are still working to put the people, tools, and processes in place that will set them up for success.
See key findings from this year’s State of Digital Quality in Testing AI and learn what sets successful teams apart.
Moving AI applications from POC to production
In our recent survey of software development, QA, data science, AI research and product management professionals, more than half of respondents (54.5%) reported that their organization has already released AI features. Chatbots and Gen AI are the most common types of AI features in development:
What types of AI applications or features is your organization developing?
- Chatbots/customer support tools: 60.9%
- Generative AI (text/code/image): 60.3%
- Image recognition: 38.8%
- Predictive analytics: 34.2%
- Personalization: 32.7%
n=1,636
While the number of teams releasing AI features is promising, 44.1% have deactivated live AI features in the last year because the operational costs outweighed user value. In other cases, testing has revealed that AI features aren’t ready for release yet, leading to delays in deployment. So what’s preventing teams from getting it right the first time?
In many cases, AI tools lose context over the course of complex, multi-turn conversations. This was a common complaint when we conducted UX research on AI-assisted shopping a few months ago. Hallucinations are still rampant as well — 40.2% of more than 4,600 consumer survey respondents indicated they had encountered an AI hallucination since January 1 this year.
Alex Waldmann, Vice President, Applause, has encountered many examples of AI failures in the wild. As a result, he encourages organizations to adopt more rigorous testing protocols prior to release to avoid damaging customer trust, particularly for generative and agentic AI. “It is still commonplace that AI systems produce plausible sounding hallucinations. With agentic AI able to take action on behalf of a customer, it is even more crucial that systems are tested in depth continuously with real users. Without proper fine-tuning and user testing, these systems can behave erratically and may harm your brand,” he said.
“If you are a product manager, for example, trying to drive towards a board metric of adding something with AI to check that box, you may negatively affect customer perception of your brand because the AI might not hit the correct tone or offer the experience that your customers expect,” Waldmann said. “Thinking of the Kano model helps companies understand that the level of novelty that customers now expect to be delighted is different than turning an FAQ into a chatbot. As more companies adopt the same systems, they all begin to sound the same. Without thorough testing, organizations can lose control and part of their identity and customer relationship.”
Adopting new testing techniques to validate nondeterministic outputs
Applause’s EVP of AI and High Tech Chris Sheehan says that many teams currently face resource bottlenecks and lack the talent and technical maturity required to train, test, and tune models on an ongoing basis. “The frontier labs and hyperscalers are dealing with data volumes that far surpass human capacity — this is leading to an overreliance on automation that hits computational limits without providing the necessary refinement.” On the enterprise side, there’s a lack of expertise in executing evaluations and red teaming, as well as tool immaturity. “Existing AI stack frameworks for observability, evals, and the like often provide low-resolution, low-fidelity data that is unreliable for high-stakes decisions,” Sheehan said.
To fill gaps, teams are turning to AI and automation: 32.8% of survey respondents indicated that they use AI-assisted functional testing to assess their AI features. Those tools, however, also demand up-front investments and ongoing attention to ensure they work as intended and find defects, rather than just enabling organizations to churn out flawed code faster. While AI and automation can help with execution, many teams are trusting AI to develop strategy and write tests, which typically results in massive gaps in coverage. Finding — and implementing — the optimal blend of testing techniques poses a challenge for organizations already lacking AI talent.
“That's kind of the double standard when you're working with AI: you want to leverage AI as much as possible, but you can't rely solely on AI to test. When you do that, that's when you find biases.”
Daniela Silber, Technical Project Manager, IML/Red Teaming, Applause
Validating AI’s nondeterministic outputs calls for organizations to expand beyond the typical QA pass/fail mindset and traditional UX research. Techniques like evals rely on domain expertise and an understanding of nuance, while red teaming calls for creativity and unpredictability. While AI and automation can play a role, humans are essential in setting up the parameters and criteria for these types of tests.
As an example, Applause recently performed evaluations on a shopping agent for a luxury retailer. The Applause team worked with the retailer to develop an evaluation framework focused on a combination of different factors:
- Accuracy: is the information factually correct?
- Relevance: does the response directly address intent?
- Completeness: is all essential information included?
- Clarity: is language concise and easy to follow?
- Personalization: is the response tailored to the user?
- Tone & Brand Alignment: does the response maintain role fidelity and reflect the brand?
Testers reported that while the agent maintained a polite, on-brand persona, it was often unable to retain user preferences across multi-turn conversations. In addition, the agent frequently failed to provide clear, complete, concise answers or follow-up actions, like the ability to add an item to the shopping cart. The Applause team uncovered critical functionality gaps. The team also provided recommendations on ways the retailer could address and improve the model’s intelligence with a carefully curated golden dataset for training, focused on retaining context throughout interactions and delivering complete, correct and concise responses.
Incorporating human insight and feedback throughout the development lifecycle
When developing AI applications, human input is crucial at all stages of the process. Only humans can determine what constitutes the right dataset for training — and in many instances, depending on the intended purpose, provide sufficient quantities of diverse data to train the application. Our survey found that 54.4% of teams use human-generated prompt and response data sets for fine-tuning AI and 60.8% conduct evaluations with humans. Human experts are required to curate golden datasets for training and evaluate AI outputs for accuracy, toxicity, bias, and other flaws.
“The right tone and responses may look different for each client,” said Silber. “It’s important to really understand the goals and target audience to be able to develop the right data collection and testing approaches to reduce bias, offensive responses, and other outcomes that could damage a company’s reputation.”
Examples of where human input determines AI success at all stages of development:
- Problem definition and scoping: Product leaders and engineers still define the business problem, set objectives (KPIs), determine what type of AI is appropriate, and assess feasibility.
- Data acquisition and preparation: Human teams establish data requirements, including the volume and diversity of artifacts needed to adequately train the model. They also set parameters for data quality including how to handle missing, duplicate, or low-fidelity records, and prepare datasets for training.
- Model design and development: Engineers choose the appropriate algorithms and techniques and train the model to learn patterns.
- Model evaluation and validation: Humans curate the golden dataset and establish what good looks like. For teams using LLM-as-judge pipelines, humans conduct error analysis. Humans are also crucial for evaluating certain characteristics, such as tone and user experience. Human experts are essential for behavioral boundary mapping.
- Model deployment: In our survey of software development, QA, data science, AI research and product management pros, 46.5% stated they rely on human sentiment and usability to determine whether an AI feature is production ready.
- Monitoring and maintenance: Humans must establish a plan for continuously monitoring the model for accuracy and performance degradation (model drift), and provide updates as necessary: 30.7% of survey respondents reported that they use human-in-the-loop (HITL) monitoring.
While AI can help with scale, humans set the scope and strategy at each stage.
How crowdtesting can support quality AI development
Organizations that lack the skills required to design and deploy effective evaluations for their AI features can benefit from working with a crowdtesting partner with deep AI expertise. From quickly collecting high-fidelity, fit-for-purpose data at scale to red teaming, helping stand up automation, or executing expert evals, Applause can help.
Report
State of Digital Quality in Testing AI 2026
Read the State of Digital Quality in Testing AI report to learn what sets high-performing AI teams apart.