AI experiments, real-life projects, and what they teach us about AI agents

AI Studio

As the tech industry’s next big thing, AI agents have become an object of almost comical hype. Gartner recently reported that out of thousands of products marketed as AI agents, only about 130 had genuine agentic capabilities, while the rest were merely rebranded AI assistants and chatbots. This type of rebranding even got its own name — “agent washing.”

For companies exploring AI agents, cutting through this noise is critical. So what truly sets an AI agent apart from other GenAI solutions? And how can businesses leverage these differences to build agentic systems that succeed in the real world? To answer these questions, let’s briefly examine an experiment from one of the AI industry leaders, Anthropic.

Anthropic experiment: taking an LLM too far

In June, Anthropic shared the results of a month-long experiment in which a barebones AI agent, nicknamed Claudius, managed a small business. The AI was essentially put in charge of a makeshift vending machine, with complete control over inventory, pricing, customer communications, and supplier management.

In the company’s own words, they were impressed with “how close it was to success — and the curious ways that it failed.”

On the positive side, Claudius consistently succeeded at using its web search tool to identify products requested by customers. It also adapted to a wide range of customer behaviors and resisted most jailbreak attempts. However, it also made costly (and, at times, hilarious) mistakes, such as selling items at a loss, hallucinating payment and restocking details, and even claiming to visit a fictional address from The Simpsons.

What makes Anthropic’s experiment particularly interesting is that it shows the importance of having proper AI architecture, workflow decomposition, and guardrails in place. From an engineer’s perspective, Claudius illustrates what will happen if you take an AI model too far. The AI used in the experiment wasn’t some hyper-optimized shopkeeper, but rather a simple instance of Claude Sonnet 3.7.

Agentic AI case study: AI architecture and workflow decomposition

On a more practical level, the lesson from Anthropic’s experiment is clear: successful AI agents require more than just a robust ML model. Let’s compare Claudius to a real-world AI agent, using one of our recent projects as an example.

At AgileEngine, we recently developed a custom agentic solution for a B2B energy platform that enables enterprises to manage energy contracts, invoices, and bills effectively. Our solution enhances several critical capabilities of the platform by automating the extraction and validation of data from these documents, as well as handling invoice reconciliation for energy contracts. Here’s a quick overview of what the solution architecture looked like:

AI pipeline incorporating multiple large language models (LLMs)
Optical character recognition (OCR) for the extraction of data from PDF, DOC, and PNG files
AI-driven system comparing extracted data with contractual terms and highlighting inconsistencies
Web scraper leveraging deep search to retrieve PDFs and other relevant documents for analysis
Flexible and adaptive calculation framework that supports a variety of invoice types
Solution enabling a reliable evaluation process within the AI pipeline to reduce reliance on having humans in the loop

As you can see, this is a bit more complex than Anthropic’s experiment, with dedicated AI systems handling separate workflows. However, despite the complexity under the hood, our solution provides a reasonably simple user experience. From an end-user’s perspective, it’s a single agent adding documents to the system, updating dashboards, and occasionally requesting human intervention.

Complex solutions like this also don’t necessarily require a long time to build. In this specific client engagement, we completed all aspects of the AI functionality, UI, backend, Google Cloud-based infrastructure, and DevOps, within just two months.

But what’s our most important lesson learned here? In our experience, successful agentic AI projects often require a careful decomposition of workflows into clearly defined processes. These processes are typically managed by dedicated AI models that interact with external systems, humans, and other models.

Other considerations when launching agentic AI initiatives

Architecture is one of the key challenges companies face when launching AI initiatives. Other strategic considerations critical for agentic AI projects include governance, security, data readiness, and others.

Not everything has to be GenAI

While large language models are central to most AI agents, they’re not necessarily the best tool for every task involved. In our invoice management project, we replaced several LLMs with narrow-purpose ML models and rules-based logic, which proved faster, more precise, and more cost-efficient for specific tasks. In our experience, a successful AI architecture can often blend cutting-edge GenAI with tried-and-true approaches.

Not all good data is AI-ready

AI systems are only as good as the underlying data. When developing the document management agent from our case study, reducing noise in the pipeline has been crucial for making the AI perform up to standard. On a more strategic level, ensuring AI readiness requires more effort than traditional data management, as companies must qualify and govern data within the context of specific AI use cases.

Treat hallucinations and model drift as inevitable

Remember when Zillow lost $300 million due to property overestimation caused by AI model drift? Although this story is somewhat dated, model drift and AI hallucinations remain significant challenges today, just as they were a few years ago. What has changed since then, though, is the fact that we have better monitoring tools preventing AI agents from turning on you like Agent Smith in the Matrix.

Security is a bigger priority with AI agents

Another AI story that made headlines a while ago was about people exploiting ChatGPT integrations in third-party apps to access premium functionality at the expense of app developers. Closing and preventing loopholes like this is integral to our work on apps that incorporate GenAI APIs. Still, LLMs are prone to attacks like prompt injection, jailbreaks, and reverse psychology, and the attack perimeter is often larger with AI agents. As a result, a strong cybersecurity focus is a must-have for companies that have serious plans for agentic AI.

Securing long-term success for agentic AI initiatives

If you’ve been following the news about AI agents, you may have noticed that the technology is in a weird place right now. On the one hand, companies across all industries are exploring the potential for adoption. On the other hand, research firms like Gartner aren’t being too optimistic about the outcomes. In its June report, for instance, the firm predicted a 40% failure rate for agentic AI projects, naming hype, misapplication, and inability to scale among key reasons.

The key to success lies in marrying innovation with execution. Watching how leaders like Anthropic, OpenAI, and Google experiment with agentic architectures can provide valuable insights — but commercial success requires more than following trends. It demands:

A careful decomposition of workflows based on a thorough understanding of business processes
A pragmatic architecture that selects the right AI tool for every task involved
Governance, security, and data discipline

If you’re exploring agentic AI, now is the time to move beyond the hype and build with clarity. Feel free to check this post on LinkedIn for more information on the dynamics, potential, and limitations of agentic AI. And if you need in-depth guidance, a proof of concept, or production-grade AI engineering, our team is ready to help. Contact us for a free consultation and share your challenge with our AI Studio.