About half of companies using AI today start with chatbots—yet most still feel like clunky phone menus. In this episode, you’re dropped into three real meetings where leaders must decide: should this task go to a bot, a model, or a human? And what happens if they guess wrong?
Thirty to fifty percent: that’s how often RPA fails when you simply bolt bots onto existing workflows. Redesign the processes first and success jumps above 80%. The gap isn’t technology—it’s how precisely you define the work and who you assign it to. In this episode, we move from “we need AI” to “we need the right agents in the right roles.”
You’ll learn a practical way to deconstruct a business process into discrete tasks, then decide whether each belongs to a rule-based bot, a predictive model, a generative agent, or a hybrid. We’ll look at why some firms get a 9–14 month payback while others quietly shut down AI pilots, and how treating agents as specialised co-workers—backed by governance, data quality, and human oversight—dramatically lowers risk. By the end, you’ll have a playbook to design your own AI workforce, task by task.
Now we go one level deeper: not just “what task goes where,” but “what does good look like for each agent?” High performers don’t launch a customer‑support bot; they specify that it must resolve 60% of tickets under 2 minutes, with <3% escalation errors, using only approved knowledge. A forecasting model isn’t “deployed”; it’s required to beat last year’s manual forecast by 10% on MAPE within one quarter. We’ll walk through concrete thresholds, feedback loops, and handoff rules so every AI agent has clear, measurable expectations—just like any strong hire.
Start with one process that’s painful but contained—say, handling refund requests, KYC checks, or Level‑1 IT issues. Map the current metrics: volume per week, average handling time, error rate, escalation rate, and impact when things go wrong. If you can’t quantify those, you’re not ready to pick agents yet.
Next, write a one‑page “role description” for each candidate agent in that process, the way you would for a new hire. Include:
- **Mission:** One sentence. Example: “Reduce manual review of low‑risk refund tickets by 60% within 3 months.” - **Scope:** What it touches and what it must not touch. Be explicit: “Only orders under $500, within 30 days, with no prior disputes.” - **Quality bar:** Concrete thresholds. For a classification model: “≥92% precision, ≥88% recall on last quarter’s data.” For a generative agent summarising tickets: “<2 factual errors per 1,000 summaries; 95% of summaries under 120 words.” - **Guardrails:** Forbidden behaviours. “Never issue refunds over $200 without human sign‑off.” “Use only the approved policy corpus updated weekly.” - **Handoff rules:** When it must call a human or another agent. “Auto‑escalate if confidence <0.8, or if customer uses any of 15 high‑risk phrases.”
Treat these as pre‑conditions for going live. If a model can’t hit the bar in offline testing, it doesn’t “grow into” the job in production—you redesign the role or improve the training data. That’s where fine‑tuning on 50–100k domain‑specific records often earns its keep, especially for generative agents that need to stop “inventing” policies.
Now layer in **division of labour**. Resist the temptation to hand an entire workflow to one powerful model. A pragmatic pattern:
- A light‑weight classifier routes each case (fast, cheap). - A rules engine approves or rejects the obvious 60–70%. - A specialised predictive model handles the next 20–25% where patterns are learnable but non‑trivial. - A generative agent plus human reviewer tackles the final 5–10% of messy edge cases.
Quantify ownership: “This agent should touch 8,000 of 10,000 monthly tickets and fully resolve at least 5,500.” Review that split monthly. If your “edge case” bucket keeps growing, that’s a signal to retrain models or tighten rules—not to throw a bigger LLM at the problem.
In one global logistics firm, three “colleagues” now share invoice handling: a rules bot, a fraud model, and a generative reviewer. Out of 120,000 invoices a month, about 78,000 are cleared automatically by rules under 2 seconds. Another 30,000 go to a fraud‑risk model that flags roughly 2,100 (7%) as suspicious. Only the riskiest 6,000 total—5% of volume—reach the generative agent plus human pair, who spend an average of 90 seconds each. Net effect: finance touches 1 in 20 invoices instead of nearly all of them.
To design splits like this, test candidate agents on *real* historical flows. For example, replay 50,000 past tickets and measure how many each agent could have safely owned at your actual error tolerance. Then re‑segment: maybe the rules bot should handle 10% less, while the model picks up 15% more.
One useful framing: treat human experts as the “farm team.” Any pattern they handle repeatedly at >98% accuracy for three months becomes a candidate to graduate into an agent’s scope.
Next‑gen agents will reshape how you design teams, not just workflows. As long‑term memory improves, a single agent might own a 5‑day sequence: draft a contract, track redlines, propose revisions, then brief a human in 4 bullet points. Expect regulators to demand per‑decision logs—who (or what) did what, using which data. That means investing early in “agent management” tooling: versioned prompts, replayable traces, and change histories, so you can audit 10,000 automated decisions as easily as 10.
Track results like you would a sales funnel: design targets for “tickets touched,” “errors per 1,000,” and “escalations under 4 hours.” Then review weekly: kill or retrain agents that miss two cycles in a row. Your challenge this week: instrument one pilot so precisely that, within 30 days, you can prove (or disprove) at least a 15% uplift.

