Applause Logo

The State of Digital Quality in Testing AI 2026

2026 Annual Report

See the latest trends in assessing AI apps and features and learn the techniques and tactics that set successful teams apart.

    AI is everywhere. It’s taken over search engine results, customer service interactions, meeting minutes, call transcripts, report generation and all sorts of other tasks – with no signs of slowing. Organizations are in a rush to release AI features and capitalize on the efficiencies they can create. But AI is scaling faster than organizations’ ability to test it — creating a widening ‘quality gap’ that is driving costly rollbacks. This year’s State of Digital Quality in Testing AI looks at where organizations are succeeding and where they fall short when it comes to releasing AI features at scale.

    Moving from Proof of Concept to Production

    Last year, 72.3% of the software development, testing, and product professionals who responded to Applause’s AI survey indicated that their organization was working on AI applications or features. This year’s survey found that 54.5% have already released AI features. While this demonstrates strong progress, it’s only part of the story – 44.1% have deactivated live AI features in the last year because the operational costs outweighed user value.

    AI Releases and Rollbacks

    %

    have already released AI features

    %

    have deactivated AI features because cost outweighed value

    Of the 1,000+ survey respondents working in software development, QA, data science, AI research and product management, 40.3% report that more than half of their organizations’ AI initiatives have reached full-scale production.

    What percentage of your AI initiatives have successfully graduated from POC to full-scale production?

    Less than 10% | 8.2% 10% to 25% | 18.6% 26% to 50% | 25.3% 51% to 75% | 21.6% More than 75% | 18.7% I'm not sure | 7.6%
    Distribution of AI initiatives successfully progressing from proof of concept (POC) to full-scale production. The largest share of respondents (25.3%) report that 26% to 50% of their initiatives reach production, followed by 21.6% reporting 51% to 75%. Smaller groups report more than 75% (18.7%) or 10% to 25% (18.6%) success rates. Fewer respondents report less than 10% (8.2%), and 7.6% are unsure.
    Less than 10% | 8.2% 10% to 25% | 18.6% 26% to 50% | 25.3% 51% to 75% | 21.6% I'm not sure | 7.6% More than 75% | 18.7%
    Distribution of AI initiatives successfully progressing from proof of concept (POC) to full-scale production. The largest share of respondents (25.3%) report that 26% to 50% of their initiatives reach production, followed by 21.6% reporting 51% to 75%. Smaller groups report more than 75% (18.7%) or 10% to 25% (18.6%) success rates. Fewer respondents report less than 10% (8.2%), and 7.6% are unsure.
    I'm not sure | 7.6% Less than 10% | 8.2% More than 75% | 18.7% 10% to 25% | 18.6% 51% to 75% | 21.6% 26% to 50% | 25.3%
    Distribution of AI initiatives successfully progressing from proof of concept (POC) to full-scale production. The largest share of respondents (25.3%) report that 26% to 50% of their initiatives reach production, followed by 21.6% reporting 51% to 75%. Smaller groups report more than 75% (18.7%) or 10% to 25% (18.6%) success rates. Fewer respondents report less than 10% (8.2%), and 7.6% are unsure.
    I'm not sure | 7.6% Less than 10% | 8.2% More than 75% | 18.7% 10% to 25% | 18.6% 51% to 75% | 21.6% 26% to 50% | 25.3%
    Distribution of AI initiatives successfully progressing from proof of concept (POC) to full-scale production. The largest share of respondents (25.3%) report that 26% to 50% of their initiatives reach production, followed by 21.6% reporting 51% to 75%. Smaller groups report more than 75% (18.7%) or 10% to 25% (18.6%) success rates. Fewer respondents report less than 10% (8.2%), and 7.6% are unsure.
    I'm not sure | 7.6% Less than 10% | 8.2% More than 75% | 18.7% 10% to 25% | 18.6% 51% to 75% | 21.6% 26% to 50% | 25.3%
    Distribution of AI initiatives successfully progressing from proof of concept (POC) to full-scale production. The largest share of respondents (25.3%) report that 26% to 50% of their initiatives reach production, followed by 21.6% reporting 51% to 75%. Smaller groups report more than 75% (18.7%) or 10% to 25% (18.6%) success rates. Fewer respondents report less than 10% (8.2%), and 7.6% are unsure.

    n=1,097

    Though there are many reasons projects fail to move beyond POC, integration challenges and costs are the most common. “Right now, many teams are struggling with internal workflows or finding that their data isn’t in the proper format for an agent,” said Chris Sheehan, Applause’s EVP, High Tech and AI. “In some cases, we’ve seen customers opt not to release agentic AI features or Gen AI chatbots because testing showed they’re not ready. It doesn’t mean that things are bad, it’s just the natural state of the technology as teams are trying to rapidly build and train AI. There’s a steep learning curve to get tools to accurately understand nuance and correctly interpret user intent, especially when there are multiple layers of context and requirements.”

    An example: one credit card provider wanted to roll out a dining recommendation chatbot for customers. The company ran into difficulties trying to move from a structured system to natural language processing. The chatbot often ignored or misclassified the requested cuisine or misunderstood contextual cues such as “fancy,” “breakfast,” or “romantic.” Ultimately, the organization opted to delay the chatbot’s release based on feedback from Applause’s tester community.

    Main reason projects fail to move past POC

    0%

    Integration

    0%

    Cost

    0%

    Security

    0%

    Hallucinations

    n=1,967

    Many teams early in the AI journey lack the skills to develop and test the technology in-house. “In the AI space, one of the prohibitive factors that some companies are facing is just lack of experience, lack of expertise, when it comes to things like red teaming and ensuring that you're launching not only a high quality product, but a safe and secure product," said Chris Munroe, Vice President, AI Programs.

    Top Concerns And Challenges In Developing Safe and Reliable AI

    Sourcing enough high-fidelity, fit-for-purpose training data
    17.8%
    Preventing bias, toxicity and hallucinations
    16.4%
    Identifying safety vulnerabilities
    12.9%
    Fine-tuning the model
    10.5%
    Creating strong user experiences
    9.8%

    n=1,112

    Once projects reach production, teams must continue to monitor and fine-tune AI behavior as the model shifts over time.

    “Companies can ensure sustained reliability of their evolving AI models by making sure that they are implementing continuous feedback loops.”

    Chris Munroe
    Vice President of AI Programs, Applause

    How Teams are Testing AI

    Most teams are using a variety of techniques to test AI, blending humans, automation, and AI itself. Organizations have had to evolve their testing strategies to go beyond traditional QA to assess the risks inherent in non-deterministic systems.

    To achieve thorough test coverage for AI tools, teams are increasingly turning to automation and relying on AI to write test scripts. Adonis Celestine, Senior Director, Automation Practice Leader at Applause, cautions test teams not to exclusively rely on AI to develop automation. “One of the main objectives of having a human tester is that they will also validate the requirements,” he said. “A good tester will ask the implications and risks of building a feature, and now using AI to test we’re often bypassing that validation of requirements and automating based on a misunderstanding or something that is incorrect – creating tests for things that don’t really matter.”

    Techniques teams are using to test AI

    n=1,215

    Human feedback is essential to train models for focused tasks with domain-specific knowledge – particularly in areas where nuance and compliance matter. Reinforcement Learning from Human Feedback (RLHF) teaches models to be helpful, honest, and harmless. When humans define the boundaries of appropriate content, teams reduce the risk of AI generating biased or harmful material.

    Fine tuning

    Human feedback is essential to train models for focused tasks with domain-specific knowledge – particularly in areas where nuance and compliance matter. Reinforcement Learning from Human Feedback (RLHF) teaches models to be helpful, honest, and harmless. When humans define the boundaries of appropriate content, teams reduce the risk of AI generating biased or harmful material.

    %

    Fine tuning with unique human generated prompt/response datasets

    %

    Fine tuning with synthetic datasets

    Evaluations and Safety Testing

    While many teams recognize the imperative of using human testers for evaluations and safety testing, internal teams are rarely equipped to provide the breadth and depth of coverage necessary for thorough testing. Without diverse perspectives, teams may miss outputs that are culturally offensive or fail to conform to regional requirements and expectations.

    Evaluations

    0%

    Human

    0%

    Crowdtesting

    0%

    LLM as judge

    Safety Testing

    0%

    Human

    0%

    Crowdtesting

    0%

    Automated

    0%

    AI-first web testing agents

    0%

    AI-assisted functional testing (cross-platform)

    0%

    Prompt engineering/optimizing

    0%

    Human-in-the-loop (HITL) monitoring

    "While businesses say they are conducting evals, many lack the specialized methodology and statistical rigor required to make those evaluations meaningful at scale. Without the right frameworks, consistent inter-rater reliability practices, and domain expertise, teams end up with scores that feel credible but don't accurately reflect how their AI will behave in production," said Sheehan.

    Applause has built an LLM-as-judge infrastructure designed to operate at enterprise scale across the full range of AI modalities — text, image, and audio — without compromising data security or evaluation integrity.

    The evaluation methodology follows four stages:

    Build the golden dataset

    Domain experts curate a golden dataset of human-validated examples that serves as the authoritative quality benchmark. This is not a one-time exercise. The dataset compounds over time, incorporating resolved edge cases and expert decisions into a durable regression testing asset.

    Generate prompts at scale

    Local models synthetically expand the golden dataset into a large, diverse volume of evaluation prompts — covering anticipated failure modes, adversarial inputs, and real-world edge cases at a volume no human team could produce manually.

    Deploy a multi-model jury

    A minimum of three independent frontier models from different model families evaluate each output in parallel. Using models from distinct providers, as opposed to a single vendor, prevents monoculture bias and ensures no single model's blind spots dominate the results. Before a result is recorded, 98% of outputs receive a second confirmatory review.

    Validate with human experts

    Domain specialists audit a statistically sampled percentage of results, deliberately oversampling from cases where the AI judges disagree. These are the decision boundaries where domain authority is irreplaceable — and where the golden dataset grows with every resolution.

    The architecture reflects a foundational design principle: AI judges deliver scale and consistency, but human experts anchor the ground truth. The question for enterprise teams is not whether to use AI in evaluation — it is whether their methodology is rigorous enough to trust the results it produces.

    This process itself serves as an example that even AI testing AI still requires human judgment at key points in the process.

    How has your AI testing budget changed in the last 12 months compared to your AI development budget?

    I'm not sure | 8.8% We have no dedicatedAI testing budget | 9.9% Dev budget grewfaster than testing | 22.9% Testing budget grewfaster than dev | 29.6% Both stayed flat/increasedat the same rate | 28.7%
    Testing budget grew faster than dev |29.6% Dev budget grewfaster than testing| 22.9% We have no dedicated AI testing budget |9.9% I'm not sure | 8.8% Both stayed flat/increasedat the same rate | 28.7%
    Testing budget grew faster than dev |29.6% Dev budget grew faster than testing | 22.9% We have no dedicated AI testing budget |9.9% I'm not sure | 8.8% Both stayed flat/increased at the same rate | 28.7%
    Testing budget grew faster than dev |29.6% Dev budget grew faster than testing | 22.9% We have no dedicated AI testing budget |9.9% I'm not sure | 8.8% Both stayed flat/increased at the same rate | 28.7%
    Testing budget grew faster than dev |29.6% Dev budget grew faster than testing | 22.9% We have no dedicated AI testing budget |9.9% I'm not sure | 8.8% Both stayed flat/increased at the same rate | 28.7%

    n=1,097

    Distribution of changes in AI testing budgets relative to AI development budgets over the past 12 months. The largest share of respondents (29.6%) report that testing budgets grew faster than development, followed closely by 28.7% indicating both budgets stayed flat or increased at the same rate. Meanwhile, 22.9% say development budgets grew faster than testing. Smaller groups report having no dedicated AI testing budget (9.9%), and 8.8% are unsure.

    As teams work to release AI features that deliver ROI, crowdtesting can help provide access to the specific skills, knowledge and datasets they need to produce quality outcomes.

    Building Your AI Testing Toolkit

    Learn the critical elements for a comprehensive AI testing plan. Experts share how to reduce risk, validate safety and create intuitive user experiences.

    Read the Transcript

    Where Humans Fit in the AI Training and Testing Process

    Applause consistently emphasizes the role of diverse, high-quality, fit-for-purpose data as the foundation for training safe, reliable AI applications. Munroe said, “training is at the heart of what leads to a quality AI model, and that really comes down to creating massive, diverse, high quality data sets. That is what all the AI frontier model companies have been focused on for the last seven years or so. Now that Gen AI has taken center stage over the last two to three years, it's all about fine-tuning these models for specific business cases and business purposes.”

    Fine-tuning models for specific business use cases relies on expert insight. One of Applause’s fintech customers has developed AI-powered month-end dashboards to provide meaningful insights to CFOs and boards. “These are the types of intelligence which typically would take a small team of accountants days, if not weeks, to produce,” Munroe said. “This company is able to glean those insights as to why revenue or margins or profitability has moved one way or the other literally in minutes.” Members of Applause’s testing community, including 30 different CFOs, provided in-depth expertise that showed the fintech where the model performed well and where it needed further training and tuning.

    “The human part in the equation is essential for the models to move from pre-training all the way through to post-training and then actually to be useful in deployment.”

    Chris Sheehan
    EVP of High Tech and AI, Applause

    In these early days of aligning AI to specific business cases, most teams are still relying on human oversight. Less than a quarter of survey respondents indicated that they’re developing AI systems that operate independently, with almost no human intervention. Fully agentic workflows are not on the roadmap for many organizations.

    How much autonomy does the AI system you’re developing have?

    0%

    None: AI acts as a suggestion engine, human takes all action

    0%

    Partial: AI automates workflows but requires human oversight or approval for key steps

    0%

    High: AI operates independently with human intervention only in circumstances of absolute failure

    n=1,111

    While safety testing is crucial for most types of AI, not all teams are conducting red team testing, and some who do rely on the original developers to identify vulnerabilities in their own work — a process that can miss critical flaws.

    Who primarily performs red teaming or safety evaluations for your AI features?

    0%

    Internal QA team

    0%

    Original developers

    0%

    Internal security team

    0%

    External third party testers

    n=607

    An effective red teaming strategy relies on both breadth and depth: testing with a blend of generalists and domain experts. Sheehan explained that while experts go deep to identify domain-specific harms and inaccuracies, the generalists reflect the chaos and the unpredictability that's out there in the real world. “We use a breadth and depth approach so we pick up the full wide range of risks and harms that the model may have. We always advise when putting together a red team strategy that you should use both cohorts of testers to fully test an application.”

    What High-Performing AI Teams Do Differently

    Incorporate continuous evaluation loops

    To bridge the trust gap and deliver enterprise-grade reliability, teams must include independent, human-led evaluation throughout the SDLC.

    Use hybrid human + AI testing models

    While AI testing AI improves speed and scale, human validation is essential to uncover bias, parse nuance and validate user experiences.

    Involve domain experts

    Teams that rely on subject matter experts to validate model accuracy and ensure they are fine-tuned for specific business use cases launch high-quality, trustworthy applications that deliver real value to users.

    Include structured red teaming to reduce risk

    Ongoing risk assessments that include both generalist and expert testers help ensure safe, secure enterprise applications.

    Focus on cost-aware deployment strategies

    To avoid costly delays and rollbacks, smart AI teams consider all costs early in the product design stage and look for opportunities to make optimized decisions around design and development without sacrificing quality.

    Report Methodology

    In February and March 2026, Applause conducted a survey of members of the uTest community as well as other software development, QA, product, AI and data science professionals, with the following goals:

    • Understand what types of AI features, applications and experiences organizations are prioritizing
    • Learn how how organizations are training and testing their AI features, applications and experiences

    We also conducted interviews with technology leaders.

Explore Additional Digital Quality Insights

A software engineer working on an AI application

BLOG

Testing AI in 2026: Progress, Priorities and Plateaus

See highlights from Applause’s 2026 State of Digital Quality in Testing AI report

Read Now

VIDEO

Building An AI Testing Toolkit

Learn the critical elements for a comprehensive AI testing plan. Experts share how to blend AI, automation, and human testing for the best outcomes.

Watch Now

REPORT

Applause’s Testing and UX Frameworks

Learn how to improve your testing efforts across multiple dimensions

Learn More