Why do AI agents fail in production after passing all their tests?

AI agents often fail in production because test inputs are too clean and predictable compared to what real users actually type. Other common causes include tools that return vague responses the agent cannot reason about, and a lack of decision logging that makes it nearly impossible to diagnose what went wrong.

How do I stop my AI agent from choosing the wrong tool for a user request?

Write tool descriptions as retrieval hints rather than documentation, and use specific parameter names that clearly signal their purpose. Ambiguous phrasing from real users causes the agent to score tools incorrectly, so tighter tool contracts reduce misrouting significantly.

What is silent failure in an AI agent and why does it matter?

Silent failure happens when an action inside an agent workflow fails but the agent still generates a confident-sounding response, giving the user no indication anything went wrong. This is dangerous because the underlying task was never completed, and the problem may not surface until days or weeks later.

When should I add decision logging to an AI agent?

Decision logging should be built in before the agent goes live, not added after something breaks. Without it, you can see that the agent ran but not why it chose a particular tool or produced a specific response, which makes debugging little more than guesswork.

How does Copilot Studio agent tool selection actually work at runtime?

Rather than simply reading a list of tools and picking the best match, the orchestrator runs a planning pass that scores each tool based on relevance to the current turn. It factors in the tool name, description, input parameter names, and any enum values before selecting a candidate and resolving the required inputs. This scoring process is closer to retrieval than traditional routing.

Why does my Copilot Studio agent keep picking the wrong tool?

Vague tool names and thin descriptions are the most common cause, as the orchestrator scores tools for relevance and poorly described tools rank badly even if they are logically the right choice. Writing descriptions as retrieval hints rather than documentation, and being specific with parameter names, will improve selection accuracy significantly.

How do I write better tool descriptions for a Copilot Studio agent?

Instead of writing descriptions like reference documentation, treat them as signals that help the orchestrator match the tool to user intent. Be specific about what the tool does, what inputs it expects, and when it should be used rather than a similar tool. Precise parameter names and enum values also feed into the scoring process.

When should I split one Copilot Studio agent into multiple agents instead of adding more tools?

As the number of attached tools grows, the scoring pass has more candidates to evaluate and the risk of the wrong tool being selected increases. If you find yourself adding a large number of tools to a single agent, splitting responsibilities across multiple agents can improve reliability and reduce latency caused by retry loops.

How do I get started running Ollama locally for automation prototyping?

You can install Ollama on a machine with sufficient RAM (32GB is a comfortable starting point) and pull models suited to your tasks, such as a smaller 8B model for fast iteration or a larger 14B model for more complex reasoning. From there you can test prompts and extraction logic against local documents without touching any hosted endpoints or incurring API costs.

What are the main benefits of using a local AI model instead of a hosted one?

Local models let you iterate quickly without worrying about rate limits, token costs, or network latency. They also allow you to work with sensitive or uncleared documents that you would not want to send to an external service during early experimentation.

When should I use a local model versus a hosted AI endpoint for automation work?

Local models are well suited to the prototyping and prompt-tuning phase, where speed and privacy matter more than peak output quality. For production automation workflows, hosted models typically offer better reasoning quality and reliability, making them the stronger choice once you move past early experimentation.

Why does model size matter when choosing an Ollama model for automation tasks?

Smaller models respond faster and use less memory, which suits quick iterative work, but they reason less reliably on complex tasks compared to larger models. Understanding this tradeoff locally builds practical intuition that directly applies when selecting and configuring models in production agentic workflows.

Category: Artificial Intelligence in Business

Why Do AI Agents Fail in Production When They Worked Fine in Testing?

Short answer: AI agents fail in production because test conversations are too clean, tool descriptions are too vague, and nobody logs the decisions the agent actually makes. The agent looked fine in testing because you asked it the questions you already knew it could answer. Real users do not do that.

This is the question I keep getting from people at other organisations who just deployed their first Copilot Studio agent or LangGraph build and watched it fall over in week one. So here is the longer version.

The longer answer on why AI agents fail in production

Three things go wrong, and they usually go wrong together.

1. The input distribution shifts the moment real users show up. In testing you type “what is the status of order 12345”. In production someone types “hey can you check where my thing from last tuesday is at, i think it was for the warehouse team”. The orchestrator now has to route ambiguous phrasing across multiple topics or tools, and it picks one confidently and gets it wrong. I wrote about this specific failure mode in how Copilot Studio agent tool selection actually works under the hood. The planner is scoring, not dispatching. Noisy input equals noisy scores.

2. Tools return status instead of state. A tool that returns {"result": "done"} gives the agent nothing to reason about on the next turn. When something goes sideways the agent cannot recover because it does not know what actually happened. I have hit this the hard way. The fix is boring: tools return the state the agent needs to make the next decision, not a success flag.

3. Nobody logged the decisions. Run logs tell you the flow executed. They do not tell you why the agent picked tool A over tool B, or why it summarised the ticket that way. Without a decision log you are debugging blind. You end up guessing at prompts.

There is also a fourth thing that gets ignored. Silent failure. A Power Automate action inside a Copilot Studio topic can fail and the agent will still generate a confident-sounding response for work that was never done. The user gets an answer. Nothing throws. You find out three weeks later when someone asks why the record was never created.

How to fix it

Start with the tool contracts. Every tool your agent can call needs a precise name, a description written as a retrieval hint rather than documentation, and parameters with concrete names. customerAccountNumber not id. If you are on Copilot Studio, the Microsoft Learn docs cover the schema but not the hint-writing style. That part you learn by breaking things.

Then cap the tool count. Past roughly 10 to 15 tools on a single agent, selection quality degrades because the relevance signal gets noisy. Split the agent before you add another tool.

Log the decision, not just the execution. For every turn: what did the user say, what tools did the orchestrator consider, what did it pick, what did the tool return, what did the agent do with that. Store it. Query it. This is the only way to improve the policy over time. I go deeper on this in the post on decision ownership.

Add adversarial test cases. Not “does it work”. Test misspellings, mixed languages, requests that sit between two topics, requests that reference something from two turns ago, requests where the user gives incomplete information and expects the agent to ask. This is where production breaks. Test it before production does. If you want a structured way to think about adversarial evaluation, the red team methodology in Anthropic’s jailbreak safeguards framework for Fable transfers directly to internal agent testing.

Fail loudly. If a tool call fails, the agent should know and say so, not paper over it. Wire actual error handling into every Power Automate step the agent can invoke. Return the error state to the orchestrator so it can decide what to do next.

Related gotchas

Two more worth flagging.

Context window drift. Long conversations quietly push earlier turns out of context. The agent forgets what the user told it in turn one by turn twelve. If your use case has long sessions, either summarise state into a persistent variable or split the interaction. I have written more about when a conversational interface is even the right choice in Copilot Studio is not always the answer.

Model updates. The underlying model gets updated. Your prompts that worked last month behave differently this month. This is not theoretical. Anthropic and OpenAI both ship model updates that shift behaviour on edge cases. The Anthropic Fable 5 redeploy postmortem is a useful template for understanding how regressions surface and what a proper rollback and recovery process looks like. If you are running on Claude or any hosted model, version-pin where you can and regression-test when you cannot.

None of this is glamorous. It is the boring work that separates an agent that survives in production from one that gets rolled back in week two. I talk about more of these lessons on LinkedIn if you want to compare notes.

Frequently Asked Questions

Why do AI agents fail in production after passing all their tests?

AI agents often fail in production because test inputs are too clean and predictable compared to what real users actually type. Other common causes include tools that return vague responses the agent cannot reason about, and a lack of decision logging that makes it nearly impossible to diagnose what went wrong.

How do I stop my AI agent from choosing the wrong tool for a user request?

Write tool descriptions as retrieval hints rather than documentation, and use specific parameter names that clearly signal their purpose. Ambiguous phrasing from real users causes the agent to score tools incorrectly, so tighter tool contracts reduce misrouting significantly.

What is silent failure in an AI agent and why does it matter?

Silent failure happens when an action inside an agent workflow fails but the agent still generates a confident-sounding response, giving the user no indication anything went wrong. This is dangerous because the underlying task was never completed, and the problem may not surface until days or weeks later.

When should I add decision logging to an AI agent?

Decision logging should be built in before the agent goes live, not added after something breaks. Without it, you can see that the agent ran but not why it chose a particular tool or produced a specific response, which makes debugging little more than guesswork.

July 8, 2026
How Copilot Studio Agent Tool Selection Actually Works Under the Hood

Most people building agents think copilot studio agent tool selection works like this: you attach a few tools, write a description for each, and the LLM reads the list and picks the right one. That is directionally correct and completely misses what is actually happening at runtime. The orchestrator runs a planning pass. It scores your tools against the current turn. Descriptions, input schemas, and even the order of your tools all feed that scoring.

Once you see the mechanism, you stop writing tool descriptions like documentation and start writing them like retrieval hints. That changes how you name inputs, how many tools you attach to one agent, and when you split an agent instead of adding a fourteenth tool.

What you see from the maker portal

In the maker studio, you attach a tool to an agent. You give it a name, a description, and an input schema (either from a connector, a Power Automate flow, an MCP server, or a prompt). At runtime, you type a message, the agent thinks for a moment, and calls one of the tools. The trace view shows you which tool was picked and what inputs were passed.

That surface makes it look like the model reads the list top to bottom and picks the best match. It is not that simple. The trace hides the planning pass, and the planning pass is where 80% of the reliability of your agent lives. Microsoft’s generative orchestration docs hint at this but do not spell it out in the way a builder needs.

What the orchestrator is actually doing

Between the user turn and the tool call, the orchestrator does something closer to retrieval than dispatch. It takes the current turn, the conversation history, and the agent instructions, and it scores each attached tool for relevance. The scoring uses the tool name, description, input parameter names, parameter descriptions, and enum values if present. Tools with vague names and thin descriptions score badly regardless of how logically correct they are.

Then the planner picks a candidate tool, resolves inputs from the turn context (or asks the user for missing ones), and invokes. If the invocation fails or the result is empty, the planner may retry with a different tool. That retry loop is where token budget disappears and latency creeps up.

This is the same pattern you see in the Copilot Studio release plans that describe how tools and knowledge sources get grounded per turn. It is a retrieval problem wearing a routing costume.

Where the mechanism breaks down

Three failure modes show up over and over. I wrote about the schema version of this in the Dataverse MCP server tool shape post, but it applies to every tool surface.

Overlapping descriptions. Two tools both say something like “Get information about an order.” The planner cannot tell them apart at the description layer, so it falls back to parameter matching, which is noisier. You get silent misrouting where the agent confidently picks the wrong tool.

Vague input schemas. A parameter called id of type string tells the planner nothing. A parameter called customerAccountNumber with a description like “6-digit customer account, not the order number” gives the planner something to bind against.

Long tool lists. Once you attach more than roughly 10 to 15 tools, scoring quality degrades. The signal gets noisy. This mirrors what happens when you stuff too much into a system prompt, which I covered in the business skills post.

How to build once you know this

Write tool descriptions as retrieval hints, not documentation. State what the tool does, when to use it, and critically, when not to use it. “Use this to look up an order by its order number. Do not use this for customer profile lookups.” That negative clause is doing work.

Name parameters like a human would search for them. orderNumber beats id. Add a description on every parameter, even the obvious ones. Enum values are gold because they narrow the planner’s search space to something concrete.

Cap tool count per agent. If you find yourself attaching a fourteenth tool, split the agent by domain and use multi-agent orchestration to route between them. A focused agent with 6 well-described tools outperforms a monster agent with 20 tools every time. The same principle applies when deciding where your agent reads its data from — something I break down in SharePoint vs Dataverse as a Copilot Studio Knowledge Source.

Test the routing, not just the tools. Write a set of representative user turns and check which tool the planner picks. If two tools tie or the wrong one wins, fix the descriptions before you touch the model or the instructions. That is the fastest debugging loop I have found, and it is one I keep coming back to in my own work.

The mechanism is not magic. It is retrieval with extra steps. Once you build for that, your agents get more predictable, cheaper to run, and easier to explain to a stakeholder.

Frequently Asked Questions

How does Copilot Studio agent tool selection actually work at runtime?

Rather than simply reading a list of tools and picking the best match, the orchestrator runs a planning pass that scores each tool based on relevance to the current turn. It factors in the tool name, description, input parameter names, and any enum values before selecting a candidate and resolving the required inputs. This scoring process is closer to retrieval than traditional routing.

Why does my Copilot Studio agent keep picking the wrong tool?

Vague tool names and thin descriptions are the most common cause, as the orchestrator scores tools for relevance and poorly described tools rank badly even if they are logically the right choice. Writing descriptions as retrieval hints rather than documentation, and being specific with parameter names, will improve selection accuracy significantly.

How do I write better tool descriptions for a Copilot Studio agent?

Instead of writing descriptions like reference documentation, treat them as signals that help the orchestrator match the tool to user intent. Be specific about what the tool does, what inputs it expects, and when it should be used rather than a similar tool. Precise parameter names and enum values also feed into the scoring process.

When should I split one Copilot Studio agent into multiple agents instead of adding more tools?

As the number of attached tools grows, the scoring pass has more candidates to evaluate and the risk of the wrong tool being selected increases. If you find yourself adding a large number of tools to a single agent, splitting responsibilities across multiple agents can improve reliability and reduce latency caused by retry loops.

July 7, 2026
Anthropic Published a Jailbreak Safeguards Framework for Fable and the Red Team Methodology Is What I Am Studying

Anthropic just published the Fable Safeguards Jailbreak Framework, a detailed writeup of how they stress-test and defend Fable against adversarial prompts. The interesting part of the anthropic fable jailbreak safeguards framework is not that a creative-writing model has guardrails. It is that the methodology behind those guardrails is reproducible, and it maps almost one-to-one to how internal enterprise agents should be evaluated.

I read it twice. The second time I read it with a notebook open, because half of it applies directly to the agents I see people building on Copilot Studio and Power Platform.

What Anthropic actually shipped

The framework covers four things in concrete terms. First, a taxonomy of jailbreak attempts: role-play escalation, incremental context poisoning, persona hijack, and multi-turn drift. Each category comes with example prompts and the failure signature they produce. Second, a red team process: how attacks are generated, how they are scored, and how the results feed back into training and system prompt tuning. Third, an evaluation harness that runs adversarial suites on every candidate model before release, with pass thresholds per category. Fourth, a post-deployment monitoring loop that treats jailbreak patterns as a live signal, not a launch-time checkbox.

The document is Fable-specific in places. The character stability and narrative coherence categories are unique to a long-form fiction model. But the process scaffolding around it is general. It is the same shape you would want for any agent that talks to real users.

Why this framework matters beyond Fable

Most internal agent rollouts I hear about have no adversarial testing at all. The eval suite is a set of happy-path questions the product owner wrote in a spreadsheet. If the answers look reasonable, ship. I wrote about this pattern after the Fable 5 rollback, and this framework is essentially Anthropic answering the question of what a real pre-launch eval looks like.

The transferable parts are the ones I care about. A taxonomy of failure modes for your agent. A generation process for adversarial prompts (LLM-generated attacks scored by a separate LLM works surprisingly well). Pass thresholds per category, not one aggregate score. And a monitoring loop that logs suspicious prompts in production and feeds them back into the next eval run.

The Fable-specific parts do not transfer. Nobody deploying an HR agent needs to worry about narrative coherence over 20,000 tokens. But the persona hijack category absolutely applies. If your Copilot Studio bot has a system prompt that says “you are a helpful HR assistant,” someone will try to convince it it is now a poet, or a Linux terminal, or an unrestricted chatbot. Anthropic has a category for this with concrete test prompts. You can borrow it.

The other reason this matters: Anthropic is publishing the methodology openly. That is a shift. Most vendors treat red team work as internal-only. When the process is public, builders like me can use it as a contract. If your agent cannot pass a stripped-down version of this suite, you have not finished building it.

What I would do with it this week

Three concrete things.

One, take the taxonomy and map it to whatever agent you have in production or staging. Persona hijack, role-play escalation, incremental context poisoning, multi-turn drift. Write down what each of these would look like against your specific bot. For a Copilot Studio agent connected to a Dataverse knowledge source, persona hijack usually means getting the bot to answer questions outside its scope. Incremental context poisoning means feeding it a benign document that contains hidden instructions.

Two, generate an adversarial prompt set. I would use Claude itself for this. Give it your system prompt, your agent description, and the taxonomy, and ask it to produce 30 attack prompts per category. This takes an afternoon.

Three, run the suite before your next prompt change or model swap. Not just the happy path. If your only pre-deployment test is “does it still answer the FAQ correctly,” you are shipping the same launch mistake that caused the Fable 5 rollback. Regressions live in the edge cases. The Claude Fable 5 Mythos 5 release is a good reminder that even well-resourced teams treat persona stability and long-context behavior as areas that need dedicated eval coverage, not afterthoughts.

I have been writing more about this kind of pre-deployment discipline on LinkedIn, because it is the gap I keep seeing between teams that ship reliable agents and teams that ship one and spend six months patching it.

The framework is not a silver bullet. But it is the closest thing to a public template for red-teaming an LLM agent that I have seen, and I plan to keep it open in a tab for a while.

This post was inspired by Fable Safeguards Jailbreak Framework via Anthropic.

July 3, 2026
Anthropic Redeployed Fable 5 After Rolling It Back and the Postmortem Is What I Am Reading Twice

Anthropic pulled Fable 5 shortly after its initial release, then redeployed it with fixes and a public writeup. The anthropic fable 5 redeploy is not the story I care about. The postmortem is.

Rolling back a flagship model publicly is rare. Doing it with a clear explanation of what broke and how they verified the fix is rarer. I have been reading this one twice because it maps almost directly onto how anyone deploying agents internally should think about regressions.

What Anthropic actually shipped

Fable 5 is the narrative-tuned variant in the Claude lineup, the one I wrote about when it first landed alongside Mythos 5. The redeploy brings the model back online after Anthropic identified regressions the initial release had introduced, including drift in persona stability and inconsistencies the launch evals had missed.

The redeployed build ships with expanded regression tests, additional persona stability checks, and a documented verification pass before the rollout was reopened. In the writeup Anthropic names what broke, what they added to catch it next time, and what they changed in their pre-release process. That is the part I keep coming back to.

You can read the full note on Anthropic’s site. It is short. It is worth your ten minutes.

Why the rollback and redeploy matters

Most frontier model releases treat launch as a one-way door. Ship, patch quietly, move on. Public rollback of a flagship variant tells you something about how the company thinks about the contract with the people building on top of it.

Three things stand out.

First, the transparency raises the bar. If you are building an agent on top of a model and the vendor tells you exactly what regressed and how they verified the fix, you can decide whether your workflow is affected. If the vendor patches silently, you find out through user complaints. I would rather have the writeup.

Second, the operational discipline is the template. Naming the regression, adding tests that would have caught it, running a verification pass, and only then reopening the rollout is exactly the pattern I want internal agent deployments to follow. Most internal agent rollouts I hear about from people at other organisations skip at least two of those four steps.

Third, this raises expectations for the whole space. When one vendor publishes a postmortem like this, the ones that keep patching silently start looking like they have something to hide. That is good pressure on the market.

The uncomfortable read is what it says about launch evals in general. Fable 5 shipped, passed whatever gates it passed, and still needed a rollback. If that can happen at Anthropic, it is happening everywhere. The difference is whether you hear about it. That same assumption problem shows up in workforce planning too, and the OpenAI EU AI workforce report makes the case that the workflow change bucket is where most of the silent breakage actually lives.

What I would do with this news this week

Two concrete things.

One. Write down what a rollback looks like for the agents you have in production. Not the theory. The actual steps. Who decides to pull it, how you flip the switch, what the fallback behavior is, and how you tell users. If you cannot answer those four in a paragraph, you do not have a rollback plan, you have a wish.

Two. Look at your regression tests for agent behavior. If the only thing you check before a prompt change or a model swap is whether the happy path still works, you are shipping the same way Fable 5 shipped the first time. Add persona drift checks. Add tool-use reliability checks. Add a few adversarial prompts you know used to fail. This is exactly the kind of discipline I flagged when Opus 4.8 landed, because tail failures are what kill agentic workflows in production.

If you are running Claude in a Power Automate flow or a Copilot Studio agent, the practical version is simple. Pin the model version in your connector config. Do not auto-upgrade. Keep a small eval set you can rerun on every model change. Before you reach for a desktop flow to automate that verification step, it is worth checking whether a cloud flow would do the job instead. The docs make version pinning straightforward, and it costs you nothing until the day it saves you.

I have written more on how I think about model selection and agent reliability over on LinkedIn. The short version is this. Every model you depend on will regress at some point. The vendors that tell you about it are the ones worth building on.

The next flagship release will land soon. The question is whether the postmortem discipline sticks.

This post was inspired by Redeploying Fable 5 via Anthropic.

July 1, 2026
Anthropic Launched Claude Science for AI Workbench and Lab Workflows Are the Target

Anthropic shipped Claude Science, a research-focused capability in the AI Workbench aimed squarely at scientific workflows. Literature synthesis, experiment design, biological data analysis. This is the Claude Science AI Workbench play, and it lines up with what they did with Claude Finance earlier this year. The pattern is now obvious: Anthropic is building vertical agents, not chasing one general model that does everything passably.

I am not a scientist. I build automation inside a large enterprise. I still think this release is worth paying attention to, because the shape of how vertical AI gets packaged is changing in front of us.

What Claude Science actually does

Claude Science lives inside the AI Workbench and targets lab workflows specifically. The headline use cases Anthropic calls out: synthesizing literature across long sets of papers, helping design experiments, and analyzing biological data including things like protocol drafting and interpretation of assay results. It is not a chat wrapper. It is a Workbench surface tuned for how research actually happens, with connectors and tool use shaped for the domain.

The model behind it is Claude with science-tuned behavior. That means persona stability across long sessions, better grounding when the input is a 60-page paper, and tool use that does not fall apart when you ask it to chain a literature search into a protocol draft. Same Claude family I already know from Anthropic Shipped Claude Opus 4.8, shaped for a different job.

It is for researchers, lab leads, and computational biology teams. Not for me, not directly. But the Workbench is the same Workbench, and the patterns are visible from the outside.

Why this release matters

Claude Finance, then narrative-tuned variants like Fable and Mythos, now Claude Science. The lineup is getting longer and more segmented on purpose. The question you ask when picking a model is no longer which is smartest. It is which is shaped for the job.

This matters for automation work even if you never touch a pipette. Three reasons.

First, vertical specialization is going to hit the procurement conversation. Buying “Claude” used to be a single decision. Soon it is going to look more like buying SAP modules. Finance team wants Claude Finance, R and D wants Claude Science, marketing wants something else. The licensing and governance story for enterprise gets more complex, and the people who figure out how to map vertical models to internal use cases first are going to look smart in twelve months.

Second, the technical work behind a vertical model is the same work that makes any agent reliable. Persona stability, tool-use accuracy, grounded outputs on long documents. The improvements Anthropic ships for scientists end up in the base behavior eventually. I have written before that coding quality and tool-use reliability are the same problem. Same logic here. Lab workflow reliability and enterprise agent reliability are the same problem.

Third, this is a signal about where domain-specific Claude releases go next. Legal and medical are the obvious bets. Compliance is the one I am watching. A Claude tuned for regulatory text would land hard in any enterprise that has to track policy changes across jurisdictions. The broader question of how AI reshapes specialized roles is one the OpenAI EU AI workforce report mapped out in useful detail, and the workflow change bucket it describes fits vertical model adoption closely.

What I would do with it this week

I do not have a wet lab. I do have access to the Workbench. Here is the concrete plan.

Open the Anthropic Workbench and look at what Claude Science exposes as tools and connectors. The tool shape is the interesting bit. If they expose a literature search tool, a protocol generator, and a data analysis surface, that tells me how Anthropic is thinking about composing scientific work into discrete steps. That decomposition pattern translates directly to how I would design a multi-step agent for any domain.

Then I would feed it a long dense document I actually have, something like a 40-page internal architecture review, and see how it handles synthesis. Not because that is what it is for, but because long-document grounding is where most agents fall over and this is the variant tuned for it.

Last, I would compare the system prompt structure to what I see in the general Workbench. If Anthropic is shipping science-shaped defaults, those defaults are a free lesson in how they think prompts should be organized for reliability. That alone is worth an afternoon.

The vertical Claude lineup is going to keep growing, and the teams that learn to read these releases as architectural signals rather than product news are going to ship better agents.

This post was inspired by Claude Science Ai Workbench via Anthropic.

June 30, 2026
Outlook Add-in Mailbox 1.16 Hit GA and the COM to Web Gap Just Got Smaller

Microsoft just shipped Mailbox requirement set 1.16 for Outlook add-ins to GA. If you have ever tried to retire a legacy COM/VSTO add-in and hit a wall on signed or encrypted mail, the Outlook add-ins Mailbox 1.16 release is the one that finally moves the needle. The official announcement is on the Microsoft 365 Developer Blog and it is worth a careful read.

I have been watching this gap close slowly for years. This one is a real jump.

What shipped in Mailbox 1.16

Mailbox 1.16 is the next API surface for Outlook web add-ins. The headline capabilities focus on message and information security. Add-ins can now decrypt protected messages and attachments inside an event-based workflow, read and act on sensitivity labels, and handle a broader set of signed and encrypted mail scenarios that previously forced teams to keep a COM add-in in production.

For anyone who has not been tracking it, the Mailbox requirement set is the contract between your add-in manifest and the Outlook host. You declare the minimum requirement set you need, and Outlook tells you whether the host can run it. Hitting 1.16 GA means it is now safe to target this version on production Outlook surfaces across Windows, Mac, web, and the new Outlook for Windows.

What it actually does

The two pieces I care about most are event-based decryption and sensitivity label handling.

Event-based decryption is the one that unblocks real migrations. Before 1.16, a web add-in that needed to read the body of a protected message basically could not, which is why so many compliance and DLP add-ins stayed on COM. You can now wire an OnMessageRead or similar event, decrypt the protected payload, and run your logic against the actual content. That is the difference between a working add-in and a stub that fails on every rights-protected mail.

Sensitivity label APIs let the add-in read the applied label, react to label changes, and gate behavior on classification. If you are building anything that has to respect Purview labels, this is the surface you wanted.

It is not a full COM parity. Deep MAPI access is still off the table. If your old add-in pokes at properties through extended MAPI, you are still doing a rewrite, not a port.

Why it matters

COM add-ins are on a deprecation path. Microsoft has been signaling this for a long time, and the new Outlook for Windows does not run them the way classic Outlook does. Every team I talk to with a legacy Outlook add-in has the same backlog item: figure out the web add-in story before the runway runs out.

Until now, that backlog item had a hard blocker for any add-in that touched protected or classified mail. Compliance add-ins, encryption helpers, journaling tools, archive integrations. They all needed APIs that did not exist on the web side. So the migration plan stalled and the COM add-in stayed.

Mailbox 1.16 removes that blocker for a real chunk of those scenarios. Not all of them. But enough that the conversation changes from “we cannot migrate” to “we have a path, let us scope it.”

The other reason this matters: event-based activation plus decryption means you can do server-light add-ins that react to mail as it is read or sent, without standing up a full middle tier. For internal tooling teams, that is a smaller surface to maintain.

What I would do with it this week

If I had a legacy COM add-in I was trying to retire, I would do three things.

First, pull the manifest of the old add-in and list every property and event it touches. Map each one to the current Outlook JavaScript API requirement sets. The ones that now map cleanly to 1.16 are your quick wins. The ones that still do not map are the work you have to scope separately.

Second, build a small proof of concept that handles one protected message end to end. Event-based activation, decrypt, read the body, write something useful back. If that pipeline works on your tenant with your label policy, you have de-risked the hardest part of the migration.

Third, think about where the add-in talks to the rest of your stack. If you are landing extracted data into Dataverse or kicking off a Power Automate flow from the add-in, the same question I covered in Stop Reaching for Desktop Flows When a Cloud Flow Would Do the Job applies here too — choose the right automation layer before you wire anything up. Decryption failures, label mismatches, and event timeouts all need explicit handling, not a generic retry.

And if you are prototyping any of the logic locally before wiring it into the add-in, the notes from Thirty Days Running Ollama Locally for Automation Work are worth a look — the same trade-offs around local versus cloud processing show up when you are deciding where to run label inspection or decryption logic.

I have been waiting for this one. Time to see if it holds up on a real mailbox. More notes on my LinkedIn once I have something running.

The COM to web migration story for Outlook just got a lot more believable.

This post was inspired by Mailbox requirement set 1.16 now available for Outlook add-ins via Microsoft 365 Developer Blog.

June 30, 2026

SharePoint vs Dataverse as a Copilot Studio Knowledge Source

Every Copilot Studio agent I have built or seen built starts with the same decision. Where does the knowledge live. SharePoint or Dataverse. The copilot studio knowledge sources picker makes it look like a flat choice. It is not. The two behave very differently once real content and real users hit them, and the wrong call shows up around month three when answers start drifting.

I have been reading a lot about this lately and talking to people at other organisations who hit the same wall. Here is how I actually compare the two now, across four dimensions that matter.

Content ceiling and file limits

SharePoint as a knowledge source caps out at 4 SharePoint sites per agent, and Copilot Studio indexes the documents through Graph search. That sounds generous until you realise the same site is often wired into three agents with different filters, and indexing latency on freshly uploaded files can run several minutes before the agent sees them.

Dataverse knowledge sources let you attach tables with up to 25 columns indexed for semantic search, and you can scope to specific rows with security roles. The ceiling is higher and the control is finer. The tradeoff is you have to actually get the content into Dataverse rows in the first place, which is real work if your source of truth is a folder of PDFs.

Governance and ownership

This is where I keep seeing teams pick wrong. SharePoint feels easy because the content is already there. But the same SharePoint site ends up wired four different ways by four different makers with no shared governance, which I wrote about in more detail when Dataverse got knowledge sources and agent feedback loops.

Dataverse forces ownership. Each knowledge source is a managed record with an owner, a solution layer, and environment promotion. You can audit who changed what. With SharePoint sites, the agent sees whatever a site owner decides to upload that afternoon. That is not governance, that is hoping.

Latency and answer quality

Dimension	SharePoint	Dataverse
Indexing latency	2-10 minutes typical	Near real-time on row update
Content ceiling	4 sites per agent	Multiple tables, row-level filters
Citation quality	File and page reference	Row-level with column context
Governance	Site owner discretion	Solution-bound, owned record
Setup effort	Low if content already in SharePoint	Higher, needs data model

Citation quality matters more than people admit. When an agent cites a 200-page PDF in SharePoint, users still have to find the answer inside the document. With Dataverse row-level citations, the agent points at the specific record, which makes hallucinations easier to spot and correct. The new Dataverse MCP server tool shape splits metadata inspection, querying, and search into cleaner boundaries that make this even more precise for agents.

Cost and licensing

SharePoint knowledge sources use the standard SharePoint connector, which keeps things in the base licensing envelope for most internal scenarios. Dataverse knowledge sources mean Dataverse capacity, and depending on how much you index, that adds up. If near real-time data freshness is part of your argument for Dataverse, the low-latency sync from Dataverse to Fabric hitting GA is worth factoring into the broader data architecture conversation at the same time. Do not pay for Dataverse capacity you do not need.

Microsoft documents the current connector and capacity behaviour in the Copilot Studio docs, and it is worth checking before you commit a topology.

Choose SharePoint if, choose Dataverse if

Choose SharePoint as your Copilot Studio knowledge source if the content is already in SharePoint, lives as documents rather than structured data, the agent serves under a few hundred users, and you can live with multi-minute indexing latency. It is the right call for a policy lookup agent pointed at an existing HR site.

Choose Dataverse as your knowledge source if the content is structured, ownership and auditability matter, you need row-level security, or the agent is going to be promoted across environments under ALM. It is the right call for any agent that touches process logic, business skills, or anything a regulator could ask about later.

The decision is not which one is better. It is which one matches the shape of your content and the seriousness of the use case. I have made the wrong call on this and paid for it in debugging time. Pick deliberately.

Frequently Asked Questions

What are the best copilot studio knowledge sources for enterprise agents?

The two main copilot studio knowledge sources are SharePoint and Dataverse, and the right choice depends on your governance needs and content structure. SharePoint is quicker to set up if content already exists there, but Dataverse offers better control, row-level filtering, and near real-time indexing for more demanding use cases.

When should I use Dataverse instead of SharePoint as a knowledge source in Copilot Studio?

Dataverse is the better choice when you need strict governance, auditable ownership, or row-level security over your content. It also suits scenarios where indexing latency matters, since Dataverse updates are reflected near real-time compared to SharePoint’s typical 2-10 minute delay.

Why does my Copilot Studio agent not pick up newly uploaded SharePoint files immediately?

Copilot Studio indexes SharePoint content through Graph search, which can introduce a delay of several minutes before freshly uploaded files become visible to the agent. This latency is a known tradeoff of using SharePoint as a knowledge source rather than a structured data store like Dataverse.

How do I improve answer quality in a Copilot Studio agent?

Switching from SharePoint to Dataverse as your knowledge source can improve citation quality, since Dataverse returns row-level references with column context rather than broad file or page links. Scoping your knowledge source to well-structured tables with relevant columns indexed for semantic search also helps the agent return more precise answers.

June 30, 2026

OpenAI Published a Map of AI’s Impact on EU Jobs and the Workflow Change Bucket Is Where I Live

OpenAI published Mapping Europe’s AI Workforce Opportunity this week, a report that tries to sort EU occupations by how AI will hit them. The OpenAI EU AI workforce report breaks roles into three buckets: jobs facing automation, jobs likely to grow, and the massive middle where workflows get reshaped without the role itself disappearing. That middle bucket is where I have spent the last several years of my career.

The doom headlines will focus on bucket one. I want to talk about bucket two, because that is where the actual work is.

What the OpenAI EU AI workforce report actually does

The report maps occupations across the EU labour market against AI exposure, using task-level analysis rather than blunt job-title categorisation. It pulls from O*NET-style task decomposition and overlays current model capabilities to score how much of each role can be done by AI, augmented by AI, or left largely untouched.

Three buckets come out the other end.

The automation bucket holds roles where a large share of tasks are model-doable today. Think structured data entry, basic translation, first-line content moderation. The growth bucket holds roles that get more valuable because AI exists, including AI-adjacent engineering, training data work, and oversight roles. The workflow change bucket is the biggest of the three by headcount, and it covers knowledge workers whose individual tasks shift but whose overall job sticks around.

The report is careful. It does not predict timelines. It does not claim to know how regulation, adoption rates, or organisational inertia will shape the actual outcome. It is a map of exposure, not a prophecy.

Why it matters

The workflow change bucket is the entire job description of a Power Platform developer, an RPA engineer, an automation consultant, or anyone who builds Copilot Studio agents for a living. We are the people who go into a role, decompose the tasks, and figure out which ones get handed to a flow, an agent, or a model call, and which ones the human still owns.

The report is essentially describing the next five years of demand for this work.

Where I think it is right: the middle bucket is huge, and most people underestimate it. Headlines about full job replacement get clicks, but the operational reality is task-level reshaping inside roles that keep their name on the org chart. A finance analyst is still a finance analyst, but half their reconciliation work now runs through an agent and they spend more time on exception handling and commentary.

Where I think it understates the reality: the report treats workflow change as if it happens because the technology exists. It does not. Workflow change happens when someone redraws decision rights, and most organisations avoid that conversation because it is uncomfortable. I have watched plenty of automation projects stall not because the tech failed but because nobody was willing to say who owns the decision after the agent makes its recommendation.

The report also does not capture the latency problem. A reshaped workflow where the AI step takes two seconds and the human approval step takes two days is not actually reshaped. It just has a faster front end and a longer queue.

What I would do with it this week

If you build automations for a living, read the report and find the occupations in your organisation that sit in the workflow change bucket. Not the automation bucket. The middle one. Those are the roles where you have the most leverage in the next twelve months, because the people in them are not afraid of being replaced. They want the boring parts gone.

Then pick one task inside one of those roles. Just one. Decompose it. Figure out which steps a Power Automate cloud flow handles, which steps need a Copilot Studio agent, and where the human stays in the loop. Build a small version. Ship it to one team.

This is the work. It is unglamorous. It is also exactly what the OpenAI EU AI workforce report says the EU economy is going to need a lot of, for a long time.

I have been doing this work for a while and have written about the patterns that show up across these projects. The report does not change my day-to-day. It validates it.

The next five years are going to be a lot of careful task decomposition inside roles that keep their names. That is fine by me.

This post was inspired by Mapping Europe’s AI Workforce Opportunity via OpenAI.

June 29, 2026
Stop Reaching for Desktop Flows When a Cloud Flow Would Do the Job

I keep seeing the same pattern on LinkedIn and in community calls. Someone builds a desktop flow to read an Outlook inbox. Someone else builds a desktop flow to write rows into SharePoint. Then they wonder why the bot crashes every time IT pushes a Windows update. The cloud flows vs desktop flows decision is not a style choice. It is an architecture choice, and most teams get it backwards.

The problem

People reach for desktop flows because they saw a demo where someone automated a click on a button and it felt powerful. So now every automation idea starts with Power Automate Desktop open and a recorder running. That is how you end up with a UI automation reading an Exchange mailbox that has had a documented Graph API for years.

Here is the rule I use, and it has not failed me yet.

If the system has an API or a connector, use a cloud flow. If it does not, use a desktop flow. That is it.

Everything else is rationalisation. Outlook has a connector. SharePoint has a connector. Dataverse has a connector. SAP has a connector. ServiceNow has a connector. Salesforce has a connector. If you are clicking buttons in any of these with Power Automate Desktop, you are building a fragile silo for a problem that was already solved.

The fix

Before you open Power Automate Desktop, run through three checks.

Check 1. Does the target system have a Power Automate connector? Go to the connector reference and search. There are over 1000 connectors. If yours is there, the conversation is over. Build a cloud flow.

Check 2. Does the target system have a documented REST API? If yes, and there is no premade connector, build a custom connector or use the HTTP action in a cloud flow. Still cloud. A custom connector is a one-time investment that pays back for years. A desktop flow against the same system needs babysitting forever.

Check 3. Is the target a legacy desktop app, a Citrix session, a thick client with no API, or a website with no public API and aggressive anti-automation? Now you have a real desktop flow use case. Build it, but build it knowing the maintenance cost.

The reason this matters is not aesthetics. It is failure modes. Cloud flows fail on API errors you can read and handle. Desktop flows fail because someone moved a window, changed a font scaling setting, or the machine running the unattended bot rebooted for patches at 3am. I wrote about this exact trade-off in RPA vs AI automation. Determinism is the whole point of RPA. The moment you put RPA in front of a system that already exposes a stable contract through an API, you are throwing away the determinism you came for and replacing it with selector fragility.

One more pattern I see and want to call out. Hybrid flows are fine. A cloud flow that triggers a desktop flow only for the one screen that has no API is a perfectly clean design. What is not clean is a desktop flow that opens Outlook, opens SharePoint, opens Excel Online, all through the UI, when every single one of those has a first-party connector. That is not automation. That is a recorded macro pretending to be an enterprise solution. If you are prototyping flow logic and want a faster feedback loop without spinning up connectors every time, Thirty Days Running Ollama Locally for Automation Work covers where local AI actually saves time during that phase and where it falls short.

The Microsoft Learn docs lay this out clearly under Power Automate documentation, but the guidance gets lost in the noise. The short version, from what I have built and watched others build: API first, connector second, custom connector third, desktop flow last. Reverse that order and you will spend more time fixing the automation than you saved by building it. And if your flows are starting to touch Dataverse directly, the Power Platform May 2026 Update has a few things worth turning on before you go further, particularly Power Fx UDTs hitting GA.

June 29, 2026
Thirty Days Running Ollama Locally for Automation Work

I spent the last thirty days running Ollama locally for automation prototyping. Power Automate flow drafts, Copilot Studio prompt iteration, extraction tests on documents I did not want to push to a hosted endpoint during early experimentation. This is the honest review.

The short version: local AI earned a permanent spot in my prototyping loop. It did not earn a spot in production. The gap between those two things is bigger than most tutorials admit, and the hardware reality check is the part nobody writes about until they have lived through it.

What I used it for and the setup

Workstation is a 32GB RAM machine with a decent GPU. Nothing exotic. I ran Ollama with a rotation of models: Llama 3.1 8B for fast iteration, Qwen 2.5 14B for anything that needed actual reasoning, and a quantized Mistral variant for extraction work. Pulled, swapped, benchmarked over four weeks.

The work itself was three buckets. Drafting Copilot Studio topic prompts and testing how they handled vague phrasing before I touched the real environment. Prototyping extraction logic for unstructured text where I wanted to see the shape of the output before deciding on a hosted model. And quick scratch work: rewriting a Power Fx expression in plain English, sketching a flow outline, asking dumb questions I did not want logged anywhere.

What it does well

Iteration speed. This is the win. When I am tuning a system prompt for a Copilot Studio topic, I want to run it twenty times with small variations and see what breaks. Doing that against a hosted endpoint means watching token costs, hitting rate limits, and waiting on network round trips. Locally, I iterate as fast as I can type. That alone saved me probably six hours over the month.

Working with sensitive drafts. There are documents I would not paste into a cloud chat during a five-minute exploration. Internal text, draft policies, anything that has not been cleared for an external endpoint. Having a local model means I can think out loud against real text instead of synthetic placeholders. The placeholders always lie to you about how the real prompt will behave.

Offline. I traveled twice this month. Trains, hotel wifi that pretends to exist. Ollama did not care. My prototyping loop kept moving.

Learning the failure modes up close. Running models locally forced me to actually see how an 8B model reasons versus a 14B, where quantization hurts, where context windows bite. That intuition transfers directly to how I think about agentic workflows, because the reasoning layer is the reasoning layer regardless of where it runs.

Where it falls short

Quality drop on anything that needed real reasoning. I tried wiring a local model into a Power Automate flow through a custom connector for an extraction task that hosted models handle cleanly. The 8B model produced confidently wrong JSON about thirty percent of the time. The 14B was better but slower than I could tolerate inside a flow that triggered on document upload.

Hardware reality. Most tutorials show a 7B model running snappy and call it a day. Try running a 14B model with a 16K context window while you also have Teams, a browser with forty tabs, and Power Apps Studio open. My machine started swapping. Cooling fans did things I did not know they could do. If you do not have a dedicated GPU with serious VRAM, you are running small models or you are waiting.

Tool calling is rough. Hosted models have spent a year getting better at structured tool use. Local models I tested were inconsistent at best. For an automation developer, this matters. The whole point of an LLM in a flow is reliable structured output. When the local model returns malformed JSON for the fourth time in a row, you go back to the hosted endpoint and you do not feel bad about it. The work being done on the Dataverse plugin for coding agents is a good example of how the hosted side is pulling ahead on exactly this problem.

No place in production. I want to be direct about this. I would not put a locally hosted model behind a production Power Automate flow at my desk. It is not redundant, it is not monitored, it is not anyone’s responsibility but mine. Production is a hosted endpoint with an SLA or it is self-hosted properly in infrastructure that someone else also watches. The OpenAI and Dell Codex on-prem partnership is the more honest version of what on-premise AI actually requires to work at that level.

Where I would (and would not) reach for it next time

I will keep running Ollama locally for prompt iteration, exploratory extraction tests, and any draft work that should not leave my machine. That is the lane. It is a real lane and it saves me real hours.

I will not reach for it when I need consistent tool calling, when the task needs the reasoning quality of a frontier model, or when anyone other than me depends on the output. For those, I go back to hosted. I share my thinking on this kind of tradeoff regularly on LinkedIn, and the pattern is always the same: prototype locally, ship hosted.

Should you start learning local AI right now? Yes, if you prototype a lot or handle drafts you would rather not send to a cloud endpoint. The intuition you build about model size, quantization, and context limits pays off everywhere else. Just do not let anyone sell you that your laptop is going to replace a hosted model in production. It is not. Not yet.

Frequently Asked Questions

How do I get started running Ollama locally for automation prototyping?

You can install Ollama on a machine with sufficient RAM (32GB is a comfortable starting point) and pull models suited to your tasks, such as a smaller 8B model for fast iteration or a larger 14B model for more complex reasoning. From there you can test prompts and extraction logic against local documents without touching any hosted endpoints or incurring API costs.

What are the main benefits of using a local AI model instead of a hosted one?

Local models let you iterate quickly without worrying about rate limits, token costs, or network latency. They also allow you to work with sensitive or uncleared documents that you would not want to send to an external service during early experimentation.

When should I use a local model versus a hosted AI endpoint for automation work?

Local models are well suited to the prototyping and prompt-tuning phase, where speed and privacy matter more than peak output quality. For production automation workflows, hosted models typically offer better reasoning quality and reliability, making them the stronger choice once you move past early experimentation.

Why does model size matter when choosing an Ollama model for automation tasks?

Smaller models respond faster and use less memory, which suits quick iterative work, but they reason less reliably on complex tasks compared to larger models. Understanding this tradeoff locally builds practical intuition that directly applies when selecting and configuring models in production agentic workflows.

June 28, 2026

Category: Artificial Intelligence in Business

The longer answer on why AI agents fail in production

How to fix it

Related gotchas

Frequently Asked Questions

Why do AI agents fail in production after passing all their tests?

How do I stop my AI agent from choosing the wrong tool for a user request?

What is silent failure in an AI agent and why does it matter?

When should I add decision logging to an AI agent?

What you see from the maker portal

What the orchestrator is actually doing

Where the mechanism breaks down

How to build once you know this

Frequently Asked Questions

How does Copilot Studio agent tool selection actually work at runtime?

Why does my Copilot Studio agent keep picking the wrong tool?

How do I write better tool descriptions for a Copilot Studio agent?

When should I split one Copilot Studio agent into multiple agents instead of adding more tools?

What Anthropic actually shipped

Why this framework matters beyond Fable

What I would do with it this week

What Anthropic actually shipped

Why the rollback and redeploy matters

What I would do with this news this week

What Claude Science actually does

Why this release matters

What I would do with it this week

What shipped in Mailbox 1.16

What it actually does

Why it matters

What I would do with it this week

Content ceiling and file limits

Governance and ownership

Latency and answer quality

Cost and licensing

Choose SharePoint if, choose Dataverse if

Frequently Asked Questions

What are the best copilot studio knowledge sources for enterprise agents?

When should I use Dataverse instead of SharePoint as a knowledge source in Copilot Studio?

Why does my Copilot Studio agent not pick up newly uploaded SharePoint files immediately?

How do I improve answer quality in a Copilot Studio agent?

What the OpenAI EU AI workforce report actually does

Why it matters

What I would do with it this week

The problem

The fix

What I used it for and the setup

What it does well

Where it falls short

Where I would (and would not) reach for it next time

Frequently Asked Questions

How do I get started running Ollama locally for automation prototyping?

What are the main benefits of using a local AI model instead of a hosted one?

When should I use a local model versus a hosted AI endpoint for automation work?

Why does model size matter when choosing an Ollama model for automation tasks?