Enthusiasm and Guardrails Around AI Adoption

BOSTON — There’s plenty to stoke the creative and intellectual fires of data science and engineering professionals, with near-daily AI and cloud services innovations enabling more power and sophistication.

The Open Data Science Conference (ODSC) East 2024 in April showcased some of these unique possibilities. Speaker sessions included topics ranging from scaling data quality in real time at Netflix to digital transformation within the game-day stadium operations for the Super Bowl champion Kansas City Chiefs. Each speaker track brought a unique lens to modern data science, engineering and analytics possibilities.

The palpable enthusiasm was dulled by the occasional word of caution — sometimes blunt caution.

“Who is this going to f—?” asked Cathy O’Neill, CEO at O'Neil Risk Consulting and Algorithmic Auditing and author of the New York Times best-seller Weapons of Math Destruction. Her statement implicates that algorithmic systems are discriminatory by nature — the fairness or lack thereof in the discrimination must receive scrutiny to avoid societal harms.

She used a simple, if discomforting, metaphor to explain the necessity of human intervention to mitigate the harms of algorithmic decisions. “I want you to think about going into an airplane [cockpit] and looking left and realizing, ‘They’re flying this plane with no pilot.’ That’s how I think most algorithms are being flown. We wouldn’t let that happen to us [as passengers].”

In her keynote address on the third day of the conference, O’Neill explained that the challenge behind rooting out biases and inaccuracies in predictive models is that those models are designed to discern. In addition, when those models include discriminatory data, it will use that data indiscriminately, carrying on the proverbial sins of the father to make potentially harmful conclusions. Stakeholder success criteria, which are business-oriented and simply defined, insufficiently account for harms against protected classes, which is why too many harms go unchecked.

To address the growing AI bias challenge, O’Neill recommends comprehensive risk assessments and audits to identify where products cause harm or fail to meet regulatory guidelines. Algorithmic audits fall into three primary buckets: adversarial audits, invitational audits and third-party audits, depending on the cause and purpose.

O’Neill created an ethical matrix to identify which individuals a system fails, with rows as stakeholders (such as customers, black customers, customers with disabilities, etc.) and columns as concerns (accuracy, usability, etc.). These matrices use a color-coded system to make it clear where concerns arise. The information that informs these problem areas come directly from conversations with stakeholders.

One of many use cases involved a disability insurance company, in which the audit examined areas such as claim approval time, length of initial claim, number of extensions and more. The audit process involved inferring protected class status to categorize the potential harm, then measuring outcomes for those different categories. The auditors then dig through the data, considering significant differences in the data outside the scope of an acceptable AI discernment and/or the condition of the claim — the duration of a pregnancy claim, for example, might differ quite a bit from a long-term injury. These inconsistencies and issues are presented to organizations for remediation.

The whole concept might be a foreign or uncomfortable one for some data scientists, who must rely on qualitative, human-led feedback, rather than a metric pointing to a course of action.

“This is not a math formula; this is a legal and ethical question that is up to lawyers to negotiate among themselves,” O’Neill said. “Explainable fairness is extremely narrowly constructed. The whole point of it is to separate the math and the data science, which is the easy part, from the hard part, which is public policy and legal theories.”

O’Neill mentioned that calls with many clients start with good intentions only to be shut down in a second call at the legal level, with many companies preferring a blind eye to a black eye. However, as regulations change and the demand for AI visibility grows, she cautions that the value of an audit and remediation will eventually outweigh risk, and organizations should not close the door on the concept.

Modeling LLM efficiency

The term AI conjures up many different images and interpretations, apocalyptic or otherwise. That’s owed to the origin of the term itself, which was coined in 1956, largely as a marketing driver to bring in revenue for a project. In its roller-coaster ride of hype vs. reality since, the success of AI projects has been determined by popularity more than capability.

“The success of AI is dependent upon people’s impression of it,” said Dr. Willam Streilein, Chief Technology Officer for the Chief Digital and AI Office of the U.S. Department of Defense in his keynote address. “There’s a lot of promise, and you have to believe in it to try it out and do it in an auditable and also responsible way.”

The malleable nature of that AI interpretation over time leads to success ultimately being tied to reliable, usable and responsible approaches. “We are the bearers of responsible AI principles at the DoD,” he said.

While some areas of the department are more AI-enabled than others — enterprise operations, for example, rather than warfare operations — adoption challenges arise across the spectrum, including inaccessible data, siloed development and lack of domain knowledge.

To remove impediments to adoption and trust, Streilein’s organization approached the task in several ways. First, it took an Agile approach to iterative AI adoption at scale, implementing user feedback and a product-based strategy. Streilein visualized this approach as a “food pyramid of AI,” with quality data serving as the broad base, followed by governance, insightful analytics/metrics, assurance and responsible AI at the top. This approach ultimately helped the department achieve concrete business outcomes, such as removing policy barriers, improving data management and expanding digital talent management.

The DoD created its Task Force Lima to accelerate adoption in accordance with these risk assessments, making use of a maturity model to help the department develop formulas for productivity increases. In addition to defining ethical principles, the task force is developing a maturity model to reconcile the promise of the technology against the risk of any individual AI project. An LLM use case rubric helps simplify this evaluation, with “scale” and “consequence” on separate axes to measure the consequences of model failure.

“[The goal is] to accelerate the most promising [options], to evaluate the use cases for LLMs and bring generative AI into the DoD space so that we can leverage it effectively,” said Streilein, emphasizing that the goal is not to serve as a blocker for LLM use cases, but rather help accelerate the options most likely to succeed from a risk and efficiency standpoint.

End-to-end evaluation of LLM outputs

“Where are the limitations of our systems?” asked Sebastian Gehrmann, Head of NLP at Bloomberg, in his session Model Evaluation in LLM-enhanced Products. The concept reinforces the helpful starting point in which, yes, these systems are imperfect and require careful exploration to reduce inconsistencies and inaccuracies.

He explained that the models that evaluate fluency of text generation have largely failed to keep up with the rapid pace of LLM outputs today. Generative AI platforms continue to push each other and improve, but despite a host of scoring models, qualitative methods, leaderboard and ELO scores, he finds true measures of quality lacking — this coming in an age where evaluation matters more than ever in determining readiness for clients and aligning with their expectations.

These systems must also align with stakeholder expectations and goals — no easy task. Rather than rely on automated methods or proxy metrics to provide an incomplete picture that placates stakeholders, success should ultimately boil down to a simple concept: trust.

“Can we trust our own systems that we’re building?” Gehrmann asked. “Can our clients trust the systems that we’re showing to them?”

Part of the problem, Gehrmann says, is that LLMs serve broad, general-purpose tasks, and methods like leaderboards cater to that generalization. Thus the need for humans, and especially subject-matter experts across various specialized fields, to validate system outputs and categorize issues.

“You do have to spend some resources on evaluation; there’s no way around that,” he said. “You can try and build yourselves leaderboards, and a lot of our work is based on leaderboards as well…but nothing really beats in-depth evaluation and investigations where people really look carefully at different inputs.

The challenge, Gehrmann said, is getting to the system-level rather than model-level performance, as a system is only as good as its weakest link. He ultimately recommended a multi-faceted approach that includes:

emphasizing specificity over generalization in evaluating models
avoiding over-reliance on off-the-shelf solutions that offer unclear measurements
including a metric development step in product development
nurturing the individual steps in evaluation pipeline
prioritizing the end-to-end experience
scrutinizing best practices to not blindly make the same mistakes as others.

As quality is a nuanced topic, details like performance, ROI and user adoption also play a big role in product success. Feedback from real users and collaboration with product teams provide visibility into these success criteria.

Humans in the loop

For all its powerful predictive capabilities, humans ultimately determine the success of AI systems. Likewise, humans are the standard bearers for improving and validating these systems.

Josh Poduska and Peter Pham from Applause highlighted the crucial integration of human feedback in refining AI models throughout the product lifecycle in their session Overcoming the Limitations of LLM Safety Parameters with Human Testing and Monitoring. The idea of using real customers extends beyond gathering training data; it's about infusing AI applications with nuanced, real-world human feedback at multiple stages of development.

“Introducing the crowd prior to release of a working application and model can yield insights that you wouldn't necessarily think of when you are working on the product yourself,” said Pham, a senior program manager at Applause, who later provided examples of generative AI failures in various industries pulled right from the headlines.

The session included practical examples, such as the use of crowdsourced testing and feedback to navigate complex ethical issues and to fine-tune AI responses to align with human expectations. Crowd-based validation of LLM outputs not only mitigates risks, he said, but also elevates the utility of AI by embedding diverse perspectives directly into the technology’s development process.

Human benchmarking grounds AI technology in reality and ensures interpretable and useful outputs, but not all methods are alike. Reinforcement learning from human feedback (RLHF), while a cornerstone in AI development, shows limitations that necessitate greater human involvement. The duo detailed instances where RLHF could be manipulated and pointed to vulnerabilities in this approach, citing research from the University of Illinois and Stanford University that indicates a success rate for attackers as high as 95% in removing RLHF protections.

“The takeaway here is that RLHF has its limitations, and as AI evolves, you're going to need more human involvement than just human preference in the RLHF process,” Poduska said.

A human-centric approach throughout the AI training and deployment phases can also help reduce unintended harms and biases. While collecting diverse data sets introduces operational challenges, it’s a strategic, regulatory and ethical necessity to do right by your customers. A proactive approach to ethics goes a long way toward achieving this goal, by involving diverse stakeholder groups early and often throughout the AI development process. But the massive scale of this task is still a challenge for many businesses.

“When it comes to training data, diversity is usually the most important factor,” Pham said. “Many large companies are able to add a so-called ‘human touch’ using only their employees. However, even though some of them have a workforce comparable to a small country's population, they are still not diverse enough to meet the needs of their product teams.”

The pair argues that this collaborative approach between AI and crowdtesters will be critical in steering the development of AI technologies towards accurate, yet inclusive outcomes.

E-Books

Building a Comprehensive Approach to Testing Generative AI Apps

Read 'Building a Comprehensive Approach to Testing Generative AI Apps' Now

Enthusiasm and Guardrails Around AI Adoption

Modeling LLM efficiency

End-to-end evaluation of LLM outputs

Humans in the loop

Building a Comprehensive Approach to Testing Generative AI Apps

Crowdtesting vs. System Integrators

EU AI Act: A Practical Guide for QA Leaders

Web Accessibility Testing: Audits, Insights and Ecosystems

Embracing AI and Modern Tools: A Blueprint for the Future of Development

Web Accessibility Testing: The Tactical Playbook and SDLC Integration

Web Accessibility Testing: Foundations, Stakeholders and Inclusivity

General

Company

Resources

Legal