Building Agentic AI That Works: Real-World Lessons
While agentic AI is getting a lot of press, most of that is about the possibilities it holds, rather than actual implementations. Yet there are organizations currently releasing successful agentic AI applications and features. These innovators recognize that mitigating the risks associated with this new technology calls for careful planning and next-level testing and evaluations. Applause’s recent webinar Developing Reliable Agentic AI: Planning, Testing and Real-World Lessons examined some current agentic AI applications and steps along the path to success.
Chris Sheehan, Applause’s EVP of AI and High Tech, moderated the discussion with IBM Senior Solution Architect David Bacarella and Applause’s Sr. Director, AI Research, Red Team Engineering & Architecture Jon Perreira. The panel began by exploring the key risks teams should weigh when developing agentic AI, and whether those risks are balanced—or outweighed—by the potential benefits.
Bacarella set the stage: “I personally believe agentic solutions are only going to increase as we go forward into the future.” While acknowledging that agentic AI can amplify existing risks and even introduce new ones, he emphasized that much of the concern stems from the concept of autonomy itself. “We do use autonomous systems every day, such as your air conditioning and heating at home. Set the thermometer to the value you want: seventy two degrees Fahrenheit, and forget about it. So sometimes autonomous is exactly what we need, and we just have to be wise enough to distinguish between the problem spaces where we want to use autonomous behavior and other conditions where we don’t want to.”
Careful planning can help mitigate some of the risks associated with agentic AI, such as unsupervised autonomy, data bias, misaligned or unexplainable actions and lack of transparency. In terms of planning, Bacarella advised that teams should focus on understanding the problem they’re trying to solve, then identifying discrete elements of the workflow. Sheehan pointed out that from there, determining which tools or types of agent best suit each step and the potential risks at each stage helps teams create an effective plan for thorough testing and evaluation. “In terms of testing and evaluation, it really is problem-space specific,” Bacarella said.
Agentic AI in action
Bacarella shared examples of three real-world agentic AI implementations covering common use cases. The first one, a multi edge cloud use case, focused on extending capabilities to allow devices that are on the edge, such as phones, cars, and laptops, to use agentic applications that are on the cloud. The next example described a data democratization project that allowed executives to interact with complex financial software through voice commands. Previously, execs had to request the data from subject matter experts.
Some common risks across the applications included:
- Execution time outs
- Calling tools when not desired
- External tools not working or performing slowly
- An untrustworthy Gen AI model that allows hate speech, provides copyright protected data or other undesirable output
- Insufficient security
“We ran into problems with timeouts, because the agentic happens to be very generative AI intensive in terms of the number of prompts it has to execute,” Bacarella said. “We’re in the process of optimizing that and using different techniques to do that. But that’s something to be very aware of.”
Bacarella explained that when the team started developing their agentic app, they were initially using MCP applications off the internet. They soon found that the underlying AI models weren’t trustworthy, so they switched them out for trustworthy models, such as IBM Granite. He also mentioned Granite Guardian, which can identify hate speech as well as inappropriate requests, such as guidance on how to make a bomb, basically functioning as a governance ethics module within the Granite infrastructure.
Bacarella emphasized that if agents rely on external tools or systems, it’s crucial to validate there’s an SLA in place with the provider around uptime and reliability.
The final example that Bacarella shared was a RAG (Retrieval Augmented Generative AI) implementation designed to equip agents with knowledge to perform QA against specific documents. “This is absolutely a brilliant concept in that the agents themselves are very, very focused in terms of the topic. So the context is very narrow,” he explained. Frequently, he said, large datasets are problematic for RAG because language that uses the same words over and over in different ways can cause hallucinations. Keeping the dataset narrow and tightly focused helps reduce that risk.
For this use case, testing security using various defined roles and policies to ensure proper data access was key. In addition, the team had to make sure that as documents were updated, the revisions were incorporated into the agentic AI system.
Some of the keys to success in the projects Bacarella described:
- Generating a golden dataset and comparing the actual results from agents/tools to the gold standard to improve outputs
- Ensuring agent descriptions are unique to allow proper routing
- Validating the underlying AI model for trustworthiness
- Configuring time-outs to give agents sufficient time to complete their tasks
Advanced AI evaluation approaches
Next, the conversation turned to why established testing and evaluation methods aren’t sufficient for agentic AI. Perreira explained why simple or rudimentary evals are effective in some scenarios, but not others. He shared how traditional evaluations provide myopic coverage. “They’re originally designed for static efficiency benchmarks. So they have not really evolved in line with the emergent behavior that we are actually seeing in frontier models.”
Emergent behaviors, the complexity of agentic and multimodal AI, and benchmark limitations demand new evaluation techniques. Perreira described contextual awareness, a greater understanding of nuance, and persistent memory as additional factors driving the shift. “With the addition of this increasing ability to parse nuance, you have increased attack surfaces and vulnerabilities that have emerged alongside this increase in utility. So the old benchmarks are limited,” Perreira said.
Next, Perreira described behavioral boundary mapping, an effective new evaluation approach. “Behavioral boundary mapping is a human-in-the-loop way of identifying where that line of safe and unsafe is, and being able to strengthen that line with respect to certain vulnerabilities that have been introduced with this increase in complexity, while still preserving the utility of the model,” he said.
Sheehan asked what types of metrics teams focus on when using behavioral boundary mapping. Perreira outlined a variety of metrics, including:
- Bypass-after-k – how many attempts it takes to circumvent a safety measure
- Edit-distance-to-bypass – how small a change triggers different behavior
- Refusal calibration – consistency of safety in responses across contexts
- Oracle-leakage index – information revealed through error messages that bad actors can use to refine inputs
- Temporal degradation – How safety deteriorates over extended interactions.
“Over time, our models become much more complex in a variety of ways, and they have persistent memory, a variety of features that allow more conversational interactions. Over time, the behavior of models start to be aligned or calibrated towards your particular needs conversationally the same way a human would… you will start to see certain emergent behaviors, sycophancy, you might see aberrations in behavior, odd hallucinatory type instances… These models over time could be pushed to suggestibility to be more compliant,” Perreira said.
Case studies: evaluating and testing real agentic AI
Perreira walked through two examples of how Applause has helped customers test agentic AI. The first case study described a large language model that allowed users to create images based on text inputs, such as digital ads for a business or printed promotional materials for events.
Applause used a variety of new evaluation methods to test vulnerabilities in the text to image pipeline, including system architecture reconnaissance and black-box pipeline profiling, oracle-leakage probing, semantic mutation testing, and decision-boundary mapping. Perreira outlined how testers adjusted the language they used for text inputs, based on error messages from the model to see if the model’s behavior and boundaries shifted over time. Testers were able to get the model to generate harmful content in the images it produced and in the accompanying text. The development team was able to address the vulnerabilities and areas that needed hardening within the model based on what the tests revealed.
In the second example, Perreira focused on persistent memory and whether an app truly purged information that it was instructed to remove from its memory. In terms of efficiency and contextual relevance, Perreira said, “it’s not helpful to have to reintroduce discrete details each time… Perhaps you’ve introduced personal information that you don’t want the model to persistently have over time, and you would need to be purged. Even if you forgot to explicitly tell the model to clear that information out, you would want some safeguard that the model isn’t just holding on and building a repository of information about you. This is what that persistent memory feature is describing: the utility of remembering contextual details and facts about you. But at the same time, how well is it managing that information over time?”
To test the application, Perreira said, the team sprinkled in explicit facts that users would not want to persist over time, such as personally identifiable information, tax information or health data. Testers would ask questions and see if the model recalled the information even after it had been explicitly instructed to purge those facts, and then testers closed out and started a new session.
Perreira said they measured whether the system legitimately purged sensitive information in a variety of ways. Essentially, at the system level, the team was able to identify whether the model had persistent memory of these particular facts that testers seeded in it.
Agentic AI calls for testing teams to adapt and evolve
Sheehan wrapped up the webinar by pointing out that there are many, many risks to take into account with agentic AI. “In terms of just the complexity of models today, you really do need to be using some of the cutting edge emerging evaluation techniques that you’ve talked about here, because it gets very complicated very quickly,” he said.
Webinar
Developing Reliable Agentic AI: Planning, Testing and Real-World Lessons
Learn from real-world agentic AI implementations. Examine key considerations for planning, testing, and evaluation to reduce risk and create safe, effective applications.