Latency Is the Quiet Killer of Agentic Workflows and Almost Nobody Talks About It

Diagram showing agentic workflow latency across multiple model calls in a Copilot Studio and Power Automate loop

Everyone obsesses over model quality, tool design, and prompt structure when building agents. The thing that actually kills adoption in production is something else entirely. Agentic workflow latency is the quiet killer, and most Power Platform and Copilot Studio builders are not thinking about it until users start abandoning the tool.

I came across a post from OpenAI about using WebSockets and connection-scoped caching in the Responses API to speed up their Codex loop. It confirmed something I keep running into building multi-step agents internally. The math is brutal once you do it honestly.

Why Agent Loops Feel Slow Even When Each Call Is Fast

A single model call at 800ms feels fine. A tool call at 300ms feels fine. A Dataverse lookup at 500ms feels fine. Everyone looks at these numbers in isolation and says the platform is fast enough.

Then you build an actual agent. It reasons, calls a tool, reads the result, reasons again, calls another tool, checks a condition, calls a third tool, summarises, responds. That is 8 to 15 round trips for one user request. Each round trip carries connection setup, authentication overhead, token streaming setup, and the model’s own time to first token.

A 400ms overhead per call sounds small. Multiply by 12 calls. That is almost 5 seconds of pure overhead before any actual thinking or work happens. Users do not wait 15 seconds for a confident answer. They ask once, get nothing for a few seconds, and switch back to the old way of doing it.

I have watched this kill internal tools that were technically correct. The agent did the right thing. Nobody used it.

What OpenAI Just Shipped and Why It Matters Beyond Codex

The short version of what they did: move from repeated HTTP requests to a persistent WebSocket connection, and keep cache state scoped to that connection so repeat context does not need to be re-processed on every turn.

This is not a Codex-only trick. It is a general pattern. Connection-scoped caching means the expensive part of a call, the part that handles your system prompt and tool definitions and prior context, does not get redone from scratch every time your agent takes another step.

For anyone building agents that loop, this is the shape of the next year of infrastructure work. The platforms that expose this properly will feel instant. The ones that do not will feel like they are thinking through molasses.

What This Looks Like Inside Copilot Studio and Power Automate

Here is where it gets uncomfortable. In Copilot Studio, you do not see the round trips. You see a topic, a few actions, a generative answer node. The platform hides every call behind its own orchestration.

That hiding is the problem. A Copilot Studio agent doing generative orchestration with three tool calls backed by Power Automate flows is making far more round trips than most builders realise. Each tool call is a Power Automate HTTP trigger plus whatever that flow does internally, often including another connector call to SharePoint, Dataverse, or an external API. The agent then reads the response and decides what to do next, which is another model call. And if you are hitting Power Automate throttling limits under real load, every one of those round trips gets longer.

I built one recently that felt snappy in testing with one user. In production with ten concurrent sessions, response times doubled. Nothing in the flow was slow on its own. The sum was slow, and throttling on shared connectors made it worse. This is the same class of problem I wrote about in Most Agentic Workflows Are Just Fancy If/Then Logic in a Trench Coat. The difference between a real agent loop and a glorified flow shows up in latency first.

How I Would Budget Latency Before I Build the Agent

I treat latency as a first-class design constraint now, not something I measure after the fact. Before I build, I do this:

  • Estimate the number of model calls per user request. Not best case. Typical case.
  • Estimate the number of tool calls and what each one hits. A SharePoint list call in the same tenant is not a Graph API call with auth handshake.
  • Set a budget. I aim for under 4 seconds total for anything conversational, under 10 seconds for anything that is clearly doing work.
  • Cut calls aggressively. Can two tools be one? Can I pre-fetch context in a single call instead of three? Can the agent skip a reasoning step when the intent is obvious?
  • Parallelise where I can. Power Automate lets you run actions in parallel branches. Most builders do not use them.

The other thing I stopped doing: chaining LLM calls for steps that do not need reasoning. If a step is deterministic, I call the tool directly, not through the model. Every model call I can remove from the loop gives me back 500 to 1500ms.

Latency is also where the question of who owns the decision in an agentic workflow becomes a performance problem, not just a governance one. Every checkpoint that routes back to a human approver adds another wait state to the loop. The more of those you have, the more your total response time is dominated by human latency, not model latency.

I have written more about my approach to this kind of trade-off on my LinkedIn, because I keep having the same conversation with people at other organisations who hit the wall when their demo hits real users.

The agents that win in production are not the smartest ones. They are the ones that answer before the user gives up.

Frequently Asked Questions

Why does agentic workflow latency get so bad in multi-step agents?

Each individual call in an agent loop may seem fast, but the overhead adds up across 8 to 15 round trips per user request. Connection setup, authentication, and token streaming costs stack on every single step, turning individually acceptable delays into a frustrating overall wait time.

What is connection-scoped caching and how does it help agent performance?

Connection-scoped caching keeps expensive context like system prompts, tool definitions, and prior conversation state ready across multiple calls instead of reprocessing it each time. This avoids redundant work on every step of an agent loop and significantly reduces the overhead that accumulates across a multi-turn interaction.

How do I reduce latency in Copilot Studio and Power Automate agents?

Start by auditing how many round trips your agent actually makes for a single user request, since this is where most hidden latency lives. Look for opportunities to batch tool calls, reduce unnecessary steps in your loop, and watch for platform-level improvements like persistent connections that reduce per-call overhead.

Why do users abandon AI agents even when the agent gives correct answers?

If the response takes too long, users lose confidence and revert to familiar alternatives before the agent finishes. Technical correctness does not matter if the experience feels slow enough to suggest something has gone wrong.

This post was inspired by Speeding up agentic workflows with WebSockets in the Responses API via OpenAI.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *