What No One Tells You About Shipping AI Agents in an Enterprise

What No One Tells You About Shipping AI Agents in an Enterprise

I spent the last six months leading a team shipping AI agents inside a major financial institution. Five finance use cases developed in parallel on a shared platform we built from scratch. Here's what I learned about the technology, the organization, and leading a team through uncharted territory.

The org wanted agents but had no framework for trusting them

When we started, there were no enterprise guidelines for AI agents. No policies for how agents should handle sensitive financial data. No governance framework for LLM-powered automation. No security playbook for a system that makes decisions with varying degrees of confidence. Business stakeholders wanted agents that could handle complex financial processes flawlessly end-to-end, autonomously, immediately. At the same time, the organization was deeply cautious. The same people asking "can the agent do X?" would also ask "but who's responsible when it's wrong?" Both instincts were reasonable. But the gap between what was expected and what the organization was ready to support nearly stalled us before we wrote a single line of code.

I'd envisioned interconnected agents reaching into internal systems and external services. That vision hit reality fast — every integration point raised security questions nobody had answers to, every autonomous decision raised governance questions no policy covered. My role became as much about navigating these blockers as it was about system design. Translating engineering constraints into risk language. Helping compliance teams understand what they were actually approving. Making the case that our architectural choices — not just our promises — were what made the system trustworthy.

The lesson: If you're building AI in a regulated enterprise, your biggest blocker won't be technical. Budget serious time for organizational readiness. Treat governance as a deliverable, not an obstacle and build trust into the architecture itself so you have something concrete to point to instead of just reassurances.

Shiny demos vs. boring engines

There was constant pressure from stakeholders and from our own excitement to build impressive demos fast. And honestly, it's tempting. You can get a prototype that looks magical in a week. But I'd seen where that leads: something exciting that doesn't work reliably, followed by months of hardening that never quite catches up. We chose the boring path and managed that pressure by showing progress instead. Demoing early agents in a proof-of-concept state while the underlying engine was still taking shape. I designed a shared execution engine with a workflow-as-graph abstraction where each financial process is a directed graph of steps. Workers pull tasks from a queue, heartbeat their progress, and self-recover from crashes. It's infrastructure work. It doesn't demo well. Nobody claps when you show them a fault-tolerant message queue. But this bet is what made parallel delivery possible.

I structured the team so each developer owned one finance use case end-to-end, paired with a domain expert. They built workflow graphs against the platform's abstractions. I owned the shared layer — execution, fault tolerance, scaling, encryption. Five workstreams ran simultaneously without stepping on each other because the architecture was the coordination mechanism. When four of the five workflows shipped to production at the same time, with the fifth following less than a month later, that was the platform paying off. Not because the team moved faster, but because the boring foundation made parallel delivery structurally possible.

The lesson: The choice between "shiny but fragile" and "boring but reliable" isn't just a technical preference. It determines whether your team can scale beyond one use case. Build the engine.

Start with the workflow, not the AI

The biggest mistake in enterprise AI is starting with "what can the technology do?" instead of "what does the human process actually look like?" Finance has almost no room for error. A wrong number cascades through downstream systems, reports, and decisions. So we hand-designed many of the steps that agents navigate — business rules as conditional logic, deterministic validation between LLM calls, human-judgment gates at critical points. The LLM is powerful, but it's one node type among many. The workflow graph is where the real design intelligence lives.

But getting there required something harder than architecture: genuine domain knowledge transfer. Engineers needed to understand why a particular financial convention matters. Finance experts needed to articulate rules they'd been following by instinct for years. We ran many structured sessions where domain experts walked engineers through real processes step by step. It was slow. It was frustrating for both sides. But it was necessary groundwork for the real turning point of the project. The finance experts were not passive reviewers but became active co-designers proactively flagging edge cases, suggesting guard rails, refining the rules that governed agent behavior.

It's easy to get excited and try to build fully autonomous agents that can handle an entire financial process without human involvement. But this domain taught me that constraint-driven design leads to better outcomes. Finance punishes overconfidence, so designing for that reality produced a system people actually trust and use.

The lesson: Start with the human workflow. Make the domain experts co-designers, not just reviewers. And accept that in high-stakes domains, less autonomy often means more value because trust is what gets you to production.

What I'd tell the next technical leader

Put in the groundwork early. It will pay off. The upfront investment in foundations, shared abstractions, and team alignment is what makes parallel delivery possible later.

Structure your team around use cases, not technical layers. One developer plus one domain expert per workflow, shared platform underneath. They'll produce more grounded solutions than separating "AI engineers" from "domain analysts" ever could.

Design for the error case. LLM outputs are inherently uncertain. In finance, that uncertainty has a dollar sign attached. Your architecture should assume the model will be wrong, and make it cheap to catch, correct, and continue.

Governance is a feature. Encryption, audit trails, human checkpoints — these aren't constraints imposed on your system. They're what separate a demo that impresses in a meeting room from a system that runs in production. Build them in from day one; don't bolt them on later.

Be honest about what the AI can't do. The organizations that ship successfully aren't the ones with the most sophisticated models. They're the ones that know exactly where to trust the model and where to put a guardrail instead.

The hard part was never the AI. It was everything around it — the people, the governance, the team, and the discipline to choose reliability over cleverness at every turn.

Next week: the technical deep dive — multi-model orchestration, tool-augmented verification loops, encrypted state management, and the design patterns that emerged from running five finance workflows on a single engine.