Scroll

What Three Weeks at MIT Taught Us About Building Agentic AI in the Real World 17 JUNE 2026

What Three Weeks at MIT Taught Us About Building Agentic AI in the Real World
5 minute read

What Three Weeks at MIT Taught Us About Building Agentic AI in the Real World

By Matt Riddall, SVP Group Client Solutions, and Ella Grice, Head of People and Operations, Arum Global

Matt sponsors the programme overall and leads the proposal generation workstream. Ella, as Head of People, leads people development, using the same agentic principles to think differently about how we develop and grow people across the business. Two quite different problems, which turned out useful, because it meant testing every idea against two contexts rather than one.


 

Ella and I recently completed MIT's Agentic AI course together. Not for the certificate, Arum was already building an AI assisted system internally, our 80/AI programme. We wanted to pressure test what we were doing against the best thinking available rather than carry on with our own assumptions.

It's worth being upfront about something. This course was useful precisely because we weren't starting from zero. Arum has been using AI since 2019, and over the past year we've run a global RFI across 47 providers, worked hands on with a technology partner through a hackathon, and built and tested real workflows in collections and recoveries. The course gave us a sharper vocabulary for problems we were already living with. It didn't hand us those problems for the first time.

That distinction matters, because it's shaping how we think about the next stage: bringing voice enabled agentic AI into C&R for clients. Everything below, the organisational lessons, the autonomy thinking, the measurement questions, is informing that build, alongside everything we'd already learned the harder way through our own internal work.

What follows isn't a course review. It's what changed in how we both think, now that we're back applying it, internally and in client facing work.

We're sharing this because we're running a webinar later in the summer to talk through it properly. This post is the warm up.

We didn't go in naive

Agentic AI is a different category of problem to the AI we've used until now. These systems take multi-step actions and operate with a degree of autonomy that changes the question entirely.

It stops being "can the model do this?" and becomes "who decides what it's allowed to do, and how do we know if it's doing it well?" That reframing was the single most useful thing the course gave us, partly because it put language around instincts we'd already started forming.

The biggest constraint isn't the technology, it's the organisation.

This was said so often, in so many contexts, that it stopped feeling like a throwaway line and started feeling like the thesis.

Every technical demo we saw worked. The case studies that struggled didn't struggle because the AI couldn't do what was asked, they struggled because nobody had decided: 

  • who owned the rules
  • who was accountable when something went wrong
  • how the organisation would know if it was actually delivering value

And that lines up uncomfortably well with our own early instincts going into 80/AI.

Our first conversations were about which model to use and what the architecture should look like. All reasonable questions, but none of them were the hard part.

For Matt, on proposals, the hard part was deciding who in the business owns the proposal logic itself:

  • the judgement about what makes a good proposal
  • when to escalate
  • what tone fits which client

That's a business decision dressed up as a technical one, and conflating the two means building a system nobody trusts or maintains.

For Ella, on people development, the question looked different but was no less central. Whether an agentic system could personalise development plans or flag skills gaps was never really in doubt. The harder question was:

  • who owns the judgement about what good development looks like for a given person
  • how you build something that supports a manager's thinking rather than quietly replacing it

Get that wrong, and you end up with a system that works technically but that nobody, manager or employee, actually wants to use.

And the same lesson holds for our client work in C&R. Voice enabled agentic AI handling a collections conversation raises this question with higher stakes:

  • who decides what the agent is allowed to say
  • when it must hand off to a human
  • how the organisation knows it's treating customers fairly.

That isn't a question the technology answers, it's an operating model question, and it's one we'd already been working through with clients before this course gave us better language for it.

Pick the workflow that teaches you the most, not the safest one or the flashiest one

Our instinct, like most people's, was to pick the easiest possible first project. Low risk, quick win.

The course pushed hard against that. The advice was to deliberately choose a workflow that forces the hard questions early: exception handling, edge cases, ambiguity. Too easy a first project, and you learn very little about how the organisation actually needs to operate. You get a nice demo and not much else.

For Matt, that's why proposal generation became the first 80/AI use case; proposals are messy, and they require judgement about client context, relevant case studies, tone, what success even looks like. Handle that responsibly, with the right human checkpoints, and you learn far more than automating something simpler.

For Ella, though, the same logic led somewhere different.

The obvious starting point for people development would have been something administrative, scheduling or basic reporting. Instead, the workstream focuses on development conversations and growth planning, precisely because that's where the ambiguity lives. What does "ready for the next role" actually mean? How do you personalise development at scale without it feeling generic, or intrusive?

It's no a coincidence both workstreams pointed the same direction - pick the area where the organisational learning is richest, not where the technical lift is lowest. It's also why our work on voice agentic AI in C&R has deliberately gone after the harder conversations, vulnerability, payment negotiation, exceptions, rather than starting with simple automated reminders. We learned that the hard way before the course validated it as the right call.

Autonomy is a dial, not a switch

This is the idea we keep coming back to since the course ended, because it's such a simple reframe, and yet most organisations, us included early on, treat AI deployment as binary. Either a human does it, or the AI does it.

The more useful model is a dial. For any decision point, how much autonomy should sit here?

  • fully supervised, where a human reviews everything first
  • partially supervised, where the AI acts and a human spot-checks
  • fully delegated, where a human only gets involved if something flags as unusual

In Matt's proposal system, draft generation can run with relatively high autonomy. Worst case, a mediocre first draft, easily caught at review. Anything touching commercial terms or pricing needs to stay much further toward supervised, because the cost of getting it wrong is asymmetric.

In Ella's world, the dial looks different but the thinking is identical. Surfacing a relevant learning resource can sit at high autonomy. It's a suggestion, easily ignored. But anything feeding into a performance conversation or a sensitive personal circumstance needs to stay firmly supervised, because a wrong call there can do real damage to someone's trust in the process.

In voice agentic AI for C&R, the dial does real work too. A routine balance enquiry can run at high autonomy. A customer disclosing financial difficulty needs the dial turned hard toward supervised, with a clear and fast route to a human. This is the single most important design decision in any client deployment we're involved in, and it's one we were already making with clients before MIT gave us a cleaner way to describe it.

That discipline is valuable regardless of workstream. It forces you to map out where the risk actually sits, rather than making one blanket decision about how much AI to use.

Iterate fast, but keep humans genuinely in the loop, not just in name

There's a version of human in the loop that's real, and a version that's theatre. We've seen both.

Theatre looks like a human clicking approve on something they don't have time to properly read, because volume is too high and the review step has become a formality. That's worse than no oversight at all because it creates the appearance of governance without the substance.

Real human-in-the-loop design means the review step is genuinely useful: the right information, at the right point, for someone with the context and time to exercise judgement. That sometimes means slowing things down deliberately, which feels counterintuitive when the whole point of agentic AI is speed. But a fast system nobody trusts is slower in the end than a slightly slower one people actually rely on.

On proposals, we're iterating in tight cycles: build a piece, test it against real scenarios, see where a reviewer instinctively wants to intervene, adjust. On people development, Ella has taken the same approach, but the stakes feel different.

A mediocre proposal draft is a minor inconvenience. A development suggestion that lands badly is much harder to undo once trust is dented; that's pushed the iteration cycles even tighter, and the human review step even more deliberately unhurried.

We're carrying the same discipline into client work. A voice agent handling collections calls needs the same tight iteration, real calls and real escalation moments reviewed properly, not a pilot signed off on volume alone.

Specification is the real asset, not the model

If there's one idea from the course worth sitting with, it's this one.

The temptation is to think the valuable thing is the AI model itself. Pick the best one, and you're most of the way there, but that's backwards. Models are increasingly commoditised. What's hard to replicate is the specification: the rules, thresholds, escalation logic, the judgement about what good looks like in your specific context.

That specification lives in the heads of experienced people.

For Matt, it's the proposal writers who could tell you instinctively what makes a winning proposal versus a forgettable one.

For Ella, it's the managers who can sense, almost intuitively, when someone is ready for a stretch assignment versus when they need more support first.

That instinct is real expertise, just rarely written down, because until now there was no need to.

In C&R, the specification is the collections expertise we've built over years:

  • what good vulnerability identification looks like
  • what a fair payment conversation sounds like
  • when policy says escalate and when judgement says escalate anyway

That knowledge predates this course by a long way, and it's the actual reason clients should want to work with us on this rather than going straight to a technology vendor. We need that knowledge out of people's heads and into a system that can be reasoned about, tested, and improved is the real work, on every workstream we run. The model is just the engine that executes it, and it means the asset compounds. A well specified system can be ported to a new model relatively easily, while one where the judgement is baked into prompts nobody fully understands is much harder to evolve.

Measure outcomes, not just activity

This is something we're still actively working through. We said as much in a recent client workshop, that we don't think it's fully solved yet, but it's worth naming clearly.

Traditional KPIs measure activity:

  • proposals produced
  • turnaround time
  • training hours logged
  • calls handled

Those metrics made sense when a human did every step manually, but they make far less sense once an agentic system is doing some of the work, because they mask exactly what you most need to know.

On proposals, success is less about volume and more about whether reviewers trust what the system gives them enough to build on it rather than start from scratch.

On people development, the shift is arguably starker. Training hours and course completions were always a weak proxy for whether development actually happened. An agentic system makes that weakness impossible to ignore. Ella's view, which we share, is that the right metrics look more like career progression, retention, and honest feedback from the people on the receiving end, not completion rates.

In C&R, the same problem applies to client conversations. Calls handled and average handling time tell you almost nothing about whether a voice agent is actually serving customers well. We think the better questions are whether arrangements hold, whether vulnerability is being caught earlier, and whether a customer would say they were treated fairly. We don't have this fully solved either, but we think asking the right question matters more right now than having the perfect answer.

Build for scale, but prove value in phases

The last theme worth calling out is scope discipline. The course was consistent: pilot narrowly, but design as though it's going to grow, because if the pilot works, it rarely stays a pilot.

That's the posture we've taken with 80/AI. Proposal generation and people development are running as parallel phase one efforts, built on shared foundations: the same governance thinking, the same approach to oversight, the same instinct to measure outcomes rather than activity, so that what Matt's team learns on proposals transfers cleanly to Ella's team on people development, and vice versa.

It's the same posture we're taking into client conversations on voice agentic AI in C&R. Start narrow, prove it properly, and design the governance from day one so it can scale without being rebuilt later.

Where this leaves us

None of this stayed theoretical for long. We both came back and immediately started re-examining decisions already made, Matt on proposals, Ella on people development. Some held up, a few didn't, and forced harder internal conversations than we'd had before.

It's also sharpened how we're approaching client conversations. We came into this course with real experience, the RFI, the hackathon, the internal builds, and a course doesn't replace any of that. What it gave us was a more rigorous way to organise what we already knew, and sharper questions to bring into how we help clients think about voice enabled agentic AI in collections and recoveries.

That's the honest summary of what three weeks at MIT gave us. Not a finished playbook, but sharper questions to ask before mistaking technical progress for organisational readiness.

We'll share more detail, what we've built, what we'd do differently, and an open Q&A, at a webinar later in the summer. If you're wrestling with similar questions, in client facing work or people processes, we'd genuinely like to hear what you're finding.

We don't think anyone has this fully figured out yet...we certainly don't.


Matt Riddall is SVP Group Client Solutions and Executive Sponsor of the 80/AI Programme at Arum Global, leading the proposal generation workstream. Ella Grice is Head of People at Arum Global, leading the people development workstream within the same programme. Details of the summer webinar will follow soon. 

If you would like to speak to Arum about Agentic AI within your business, please fill out the form below. 

 


 

      

 

Request a Callback

Sign up to receive the latest collections and recoveries thought leadership insights from Arum:

For more information on how we use your data, please view our privacy policy

We use cookies to personalise your experience and to analyse our traffic. Do you want to allow all cookies or view and change settings?