The State of Digital Quality in Testing AI 2026

2026 Annual Report

See the latest trends in assessing AI apps and features and learn the techniques and tactics that set successful teams apart.

AI is everywhere. It’s taken over search engine results, customer service interactions, meeting minutes, call transcripts, report generation and all sorts of other tasks – with no signs of slowing. Organizations are in a rush to release AI features and capitalize on the efficiencies they can create. But AI is scaling faster than organizations’ ability to test it — creating a widening ‘quality gap’ that is driving costly rollbacks. This year’s State of Digital Quality in Testing AI looks at where organizations are succeeding and where they fall short when it comes to releasing AI features at scale.

Moving from Proof of Concept to Production

Last year, 72.3% of the software development, testing, and product professionals who responded to Applause’s AI survey indicated that their organization was working on AI applications or features. This year’s survey found that 54.5% have already released AI features. While this demonstrates strong progress, it’s only part of the story – 44.1% have deactivated live AI features in the last year because the operational costs outweighed user value.

AI Releases and Rollbacks

have already released AI features

have deactivated AI features because cost outweighed value

Of the 1,000+ survey respondents working in software development, QA, data science, AI research and product management, 40.3% report that more than half of their organizations’ AI initiatives have reached full-scale production.

What percentage of your AI initiatives have successfully graduated from POC to full-scale production?

Distribution of AI initiatives successfully progressing from proof of concept (POC) to full-scale production. The largest share of respondents (25.3%) report that 26% to 50% of their initiatives reach production, followed by 21.6% reporting 51% to 75%. Smaller groups report more than 75% (18.7%) or 10% to 25% (18.6%) success rates. Fewer respondents report less than 10% (8.2%), and 7.6% are unsure.

n=1,097

Though there are many reasons projects fail to move beyond POC, integration challenges and costs are the most common. “Right now, many teams are struggling with internal workflows or finding that their data isn’t in the proper format for an agent,” said Chris Sheehan, Applause’s EVP, High Tech and AI. “In some cases, we’ve seen customers opt not to release agentic AI features or Gen AI chatbots because testing showed they’re not ready. It doesn’t mean that things are bad, it’s just the natural state of the technology as teams are trying to rapidly build and train AI. There’s a steep learning curve to get tools to accurately understand nuance and correctly interpret user intent, especially when there are multiple layers of context and requirements.”

An example: one credit card provider wanted to roll out a dining recommendation chatbot for customers. The company ran into difficulties trying to move from a structured system to natural language processing. The chatbot often ignored or misclassified the requested cuisine or misunderstood contextual cues such as “fancy,” “breakfast,” or “romantic.” Ultimately, the organization opted to delay the chatbot’s release based on feedback from Applause’s tester community.

Main reason projects fail to move past POC

Integration

Cost

Security

Hallucinations

n=1,967

Many teams early in the AI journey lack the skills to develop and test the technology in-house. “In the AI space, one of the prohibitive factors that some companies are facing is just lack of experience, lack of expertise, when it comes to things like red teaming and ensuring that you're launching not only a high quality product, but a safe and secure product," said Chris Munroe, Vice President, AI Programs.

Top Concerns And Challenges In Developing Safe and Reliable AI

Sourcing enough high-fidelity, fit-for-purpose training data

17.8%

Preventing bias, toxicity and hallucinations

16.4%

Identifying safety vulnerabilities

12.9%

Fine-tuning the model

10.5%

Creating strong user experiences

9.8%

n=1,112

Once projects reach production, teams must continue to monitor and fine-tune AI behavior as the model shifts over time.

“Companies can ensure sustained reliability of their evolving AI models by making sure that they are implementing continuous feedback loops.”

Chris Munroe

Vice President of AI Programs, Applause

How Teams are Testing AI

Most teams are using a variety of techniques to test AI, blending humans, automation, and AI itself. Organizations have had to evolve their testing strategies to go beyond traditional QA to assess the risks inherent in non-deterministic systems.

To achieve thorough test coverage for AI tools, teams are increasingly turning to automation and relying on AI to write test scripts. Adonis Celestine, Senior Director, Automation Practice Leader at Applause, cautions test teams not to exclusively rely on AI to develop automation. “One of the main objectives of having a human tester is that they will also validate the requirements,” he said. “A good tester will ask the implications and risks of building a feature, and now using AI to test we’re often bypassing that validation of requirements and automating based on a misunderstanding or something that is incorrect – creating tests for things that don’t really matter.”

Techniques teams are using to test AI

n=1,215

Human feedback is essential to train models for focused tasks with domain-specific knowledge – particularly in areas where nuance and compliance matter. Reinforcement Learning from Human Feedback (RLHF) teaches models to be helpful, honest, and harmless. When humans define the boundaries of appropriate content, teams reduce the risk of AI generating biased or harmful material.

Fine tuning

Fine tuning with unique human generated prompt/response datasets

Fine tuning with synthetic datasets

Evaluations and Safety Testing

While many teams recognize the imperative of using human testers for evaluations and safety testing, internal teams are rarely equipped to provide the breadth and depth of coverage necessary for thorough testing. Without diverse perspectives, teams may miss outputs that are culturally offensive or fail to conform to regional requirements and expectations.

Evaluations

Human

Crowdtesting

LLM as judge

Safety Testing

Human

Crowdtesting

Automated

AI-first web testing agents

AI-assisted functional testing (cross-platform)

Prompt engineering/optimizing

Human-in-the-loop (HITL) monitoring

"While businesses say they are conducting evals, many lack the specialized methodology and statistical rigor required to make those evaluations meaningful at scale. Without the right frameworks, consistent inter-rater reliability practices, and domain expertise, teams end up with scores that feel credible but don't accurately reflect how their AI will behave in production," said Sheehan.

Applause has built an LLM-as-judge infrastructure designed to operate at enterprise scale across the full range of AI modalities — text, image, and audio — without compromising data security or evaluation integrity.

The evaluation methodology follows four stages:

Build the golden dataset

Domain experts curate a golden dataset of human-validated examples that serves as the authoritative quality benchmark. This is not a one-time exercise. The dataset compounds over time, incorporating resolved edge cases and expert decisions into a durable regression testing asset.

Generate prompts at scale

Local models synthetically expand the golden dataset into a large, diverse volume of evaluation prompts — covering anticipated failure modes, adversarial inputs, and real-world edge cases at a volume no human team could produce manually.

Deploy a multi-model jury

A minimum of three independent frontier models from different model families evaluate each output in parallel. Using models from distinct providers, as opposed to a single vendor, prevents monoculture bias and ensures no single model's blind spots dominate the results. Before a result is recorded, 98% of outputs receive a second confirmatory review.

Validate with human experts

Domain specialists audit a statistically sampled percentage of results, deliberately oversampling from cases where the AI judges disagree. These are the decision boundaries where domain authority is irreplaceable — and where the golden dataset grows with every resolution.

The architecture reflects a foundational design principle: AI judges deliver scale and consistency, but human experts anchor the ground truth. The question for enterprise teams is not whether to use AI in evaluation — it is whether their methodology is rigorous enough to trust the results it produces.

This process itself serves as an example that even AI testing AI still requires human judgment at key points in the process.

How has your AI testing budget changed in the last 12 months compared to your AI development budget?

n=1,097

Distribution of changes in AI testing budgets relative to AI development budgets over the past 12 months. The largest share of respondents (29.6%) report that testing budgets grew faster than development, followed closely by 28.7% indicating both budgets stayed flat or increased at the same rate. Meanwhile, 22.9% say development budgets grew faster than testing. Smaller groups report having no dedicated AI testing budget (9.9%), and 8.8% are unsure.

As teams work to release AI features that deliver ROI, crowdtesting can help provide access to the specific skills, knowledge and datasets they need to produce quality outcomes.

Building Your AI Testing Toolkit

Learn the critical elements for a comprehensive AI testing plan. Experts share how to reduce risk, validate safety and create intuitive user experiences.

Read the Transcript

Where Humans Fit in the AI Training and Testing Process

Applause consistently emphasizes the role of diverse, high-quality, fit-for-purpose data as the foundation for training safe, reliable AI applications. Munroe said, “training is at the heart of what leads to a quality AI model, and that really comes down to creating massive, diverse, high quality data sets. That is what all the AI frontier model companies have been focused on for the last seven years or so. Now that Gen AI has taken center stage over the last two to three years, it's all about fine-tuning these models for specific business cases and business purposes.”

Fine-tuning models for specific business use cases relies on expert insight. One of Applause’s fintech customers has developed AI-powered month-end dashboards to provide meaningful insights to CFOs and boards. “These are the types of intelligence which typically would take a small team of accountants days, if not weeks, to produce,” Munroe said. “This company is able to glean those insights as to why revenue or margins or profitability has moved one way or the other literally in minutes.” Members of Applause’s testing community, including 30 different CFOs, provided in-depth expertise that showed the fintech where the model performed well and where it needed further training and tuning.

“The human part in the equation is essential for the models to move from pre-training all the way through to post-training and then actually to be useful in deployment.”

Chris Sheehan

EVP of High Tech and AI, Applause

In these early days of aligning AI to specific business cases, most teams are still relying on human oversight. Less than a quarter of survey respondents indicated that they’re developing AI systems that operate independently, with almost no human intervention. Fully agentic workflows are not on the roadmap for many organizations.

How much autonomy does the AI system you’re developing have?

None: AI acts as a suggestion engine, human takes all action

Partial: AI automates workflows but requires human oversight or approval for key steps

High: AI operates independently with human intervention only in circumstances of absolute failure

n=1,111

While safety testing is crucial for most types of AI, not all teams are conducting red team testing, and some who do rely on the original developers to identify vulnerabilities in their own work — a process that can miss critical flaws.

Who primarily performs red teaming or safety evaluations for your AI features?

Internal QA team

Original developers

Internal security team

External third party testers

n=607

An effective red teaming strategy relies on both breadth and depth: testing with a blend of generalists and domain experts. Sheehan explained that while experts go deep to identify domain-specific harms and inaccuracies, the generalists reflect the chaos and the unpredictability that's out there in the real world. “We use a breadth and depth approach so we pick up the full wide range of risks and harms that the model may have. We always advise when putting together a red team strategy that you should use both cohorts of testers to fully test an application.”

What High-Performing AI Teams Do Differently

Incorporate continuous evaluation loops

To bridge the trust gap and deliver enterprise-grade reliability, teams must include independent, human-led evaluation throughout the SDLC.

Use hybrid human + AI testing models

While AI testing AI improves speed and scale, human validation is essential to uncover bias, parse nuance and validate user experiences.

Involve domain experts

Teams that rely on subject matter experts to validate model accuracy and ensure they are fine-tuned for specific business use cases launch high-quality, trustworthy applications that deliver real value to users.

Include structured red teaming to reduce risk

Ongoing risk assessments that include both generalist and expert testers help ensure safe, secure enterprise applications.

Focus on cost-aware deployment strategies

To avoid costly delays and rollbacks, smart AI teams consider all costs early in the product design stage and look for opportunities to make optimized decisions around design and development without sacrificing quality.

Report Methodology

In February and March 2026, Applause conducted a survey of members of the uTest community as well as other software development, QA, product, AI and data science professionals, with the following goals:

Understand what types of AI features, applications and experiences organizations are prioritizing
Learn how how organizations are training and testing their AI features, applications and experiences

We also conducted interviews with technology leaders.

Explore Additional Digital Quality Insights

A software engineer working on an AI application

BLOG

Testing AI in 2026: Progress, Priorities and Plateaus

See highlights from Applause’s 2026 State of Digital Quality in Testing AI report

Read Now

VIDEO

Building An AI Testing Toolkit

Learn the critical elements for a comprehensive AI testing plan. Experts share how to blend AI, automation, and human testing for the best outcomes.

Watch Now

REPORT

Applause’s Testing and UX Frameworks

Learn how to improve your testing efforts across multiple dimensions

Learn More