← Back to blog April 30, 2026

Flat tool lists don't scale — and agents are about to find out

Your agent has 12 tools and it's fine. So you've never thought of the tool list as a design problem. It's just a list. Lists are free. You need a new capability, you append it.

That instinct is about to cost you. Not because the list gets expensive — a separate post covers the token bill — but because a flat list stops being a usable structure long before you notice it has. If you run Claude Code or Cursor against a growing pile of MCP servers all day, this is the wall you're walking toward.

A flat list has no idea what's legal

Start with the deepest problem, the one that has nothing to do with size. A flat tool list is stateless. Every tool in it is always "available." Always callable. Right now.

But your actual system has order. You can't issue the refund before the charge settles. You don't deploy before the tests pass. The PR gets created after the branch exists, not before. A list can't express any of that. It has no slot for "this comes after that" or "this isn't legal yet."

So that knowledge has to live somewhere else — and "somewhere else" is your system prompt, or the model's guesses. At 12 tools you paper over it: a few lines of prompt instructions, "always check X before Y," and it mostly holds. At 200 tools, across a dozen subsystems, there is no prompt long enough. Every "do X only after Y" rule is really a tiny state machine — the legal order your system already enforces in code. With a flat list none of that lives anywhere the model can check, so it's left improvising your state machine on every turn, and you've given it no way to know when it's wrong.

A flat list has no notion of relevance

The second problem is navigation. The model needs one tool out of 200. The list offers exactly one way to find it: consider all 200.

There's no "search the tools." There's only "scan the tools." And the scan gets harder as the list grows in the most natural way — you've wired in three issue trackers, so now there are three create_issue variants with near-identical descriptions, and the model has to disambiguate lookalikes on every call.

A list doesn't rank. It doesn't cluster. It doesn't surface what matters now. It just grows, and every entry you add makes every other entry slightly harder to find.

Why this becomes everyone's problem

You might be thinking 200 tools is someone else's stress test. Look at the direction of travel.

Every serious SaaS vendor is shipping an MCP server. An enterprise agent doesn't connect to one of them — it connects to the issue tracker, the cloud console, the data warehouse, the CRM, the deploy system, the docs, the internal tools team's six servers. Each one brings ten to fifty tools. You don't decide to have 200 tools. You wire in the ninth useful integration and you're there. 200 tools isn't a stress test; it's a Tuesday in any reasonably ambitious deployment.

The fix is structure, not a bigger list

flowgate replaces the flat list with two things a list never had: search and state.

Search. Instead of the model scanning, a query argument to flowgate.query scores capabilities against their title, description, and tags and returns the hits. The 200 capabilities still exist — they're just discovered on demand instead of crowding the model's context. (It's lexical scoring, so good titles and tags matter; more on that limit below.)

State. In flowgate every capability is a transition in a state machine. A flat set of tools is just the trivial case — it compiles to a single workflow called proxy_default, one state, every tool looping back to it. The moment you want ordering, you wrap the relevant tools in a real workflow with real states. Now "what's legal now" isn't a prompt instruction the model might forget — it's computed by the engine and returned as links on every response.

And you don't rewrite what you already have. proxy.import points at an existing MCP server, lists its tools, and folds them into the registry — discoverable through search, with no hand-copied definitions, and each imported tool logged with a capability.discovered audit event so you can see exactly what joined. The tools are there. They're just not in the model's face.

Three concepts a flat list collapses into one

Take a single action you've already seen — create_pr. In a flat list it's one entry, and that one entry is silently doing three jobs at once. It's a thing that exists (the gateway knows how to open a PR). It's a thing that's offered to the agent (it shows up in the tool list, under that name). And it's a thing that's callable right now (nothing stops the model from firing it before the tests run). A flat list fuses those three into one undifferentiated entry. flowgate keeps them separate — and the separation is exactly where structure becomes possible:

A capability is what the gateway can do — "open a pull request," defined once, reusable.
An exposure is what it publishes to callers — whether create_pr is discoverable at all, and under what name the agent sees it.
A workflow is when that capability may run — the states, the order ("only after tests pass"), the guards.

Pull them apart and each can carry its own structure. You can define create_pr once and expose it under three names. You can have a capability that exists but is only reachable from inside a particular workflow, never called directly. A flat list can't make any of those distinctions — it has one axis, and one axis doesn't scale.

One action, many policies — without duplicate entries

Watch how a flat list grows in practice. You have create_pr. Then one team needs a version that requires tests first, so you add create_pr_safe. Then an admin path with fewer checks — create_pr_admin. Three list entries, one real action, differing only in policy. The list grew by cloning.

The three concepts handle this with wrappers: one base capability, with thin layers that add policy on top.

gateway.yaml

capabilities:
  raw.create_pr:
    executor: { kind: mcp, connection: github, tool: create_pull_request }

  safe.create_pr:
    wraps: raw.create_pr
    guards: [{ kind: evidence, requires: [tests_passed] }]

  audited.create_pr:
    wraps: safe.create_pr
    guards: [{ kind: permission, permission: github.write }]

raw.create_pr is the one real implementation. safe.create_pr inherits it and adds a guard; audited.create_pr adds another. Different teams get exposed different wrapper levels of the same underlying action — and when the action itself changes, you change it in one place. The list didn't grow by three lookalikes. The capability graph grew by intent, and every layer says exactly what policy it adds.

When you outgrow one gateway: they stack

Almost everything above lives in a single gateway running right next to your agent — the same one you'd run inside Claude Code or Cursor. That's where most of this work happens, and for a lot of readers it's the whole story. You can stop here.

But there's one more structural move worth knowing exists, for the day a single config gets unwieldy. A gateway exposes two MCP tools (flowgate.query and flowgate.command) — and a gateway is itself an MCP server. Which means a gateway can sit in front of another gateway. The model at the top still sees exactly two tools, whether there are 20 capabilities underneath or 2,000, because every layer exposes the same two-tool surface and passes the rest through.

That's what lets a config stay small as the system grows: each layer owns one concern and inherits everything below it, so you never grow one file forever. Even within a single gateway the same instinct applies — an include: list composes a config from responsibility-shaped slices, one file per workflow, one per bounded context. The config scales the way the system does, by splitting along concerns, not by piling everything into one list. (The stacking pattern and how to author multi-tier setups are in the architecture guide linked below; you don't need them to get value from a single gateway.)

That's the difference between a list and a system. A list scales by getting longer. A system scales by getting layered.

The honest limits

This is structure, and structure has costs. Search is lexical, not semantic — it leans on you writing clear titles and tags; vague metadata gives vague results. A workflow is more to author than a one-line list entry. The two-tool surface adds a search-then-start hop before the real work.

And a flat list is genuinely the right answer when you have a handful of tools and no ordering to enforce. Don't build stacked gateways and workflows for eight tools — that's structure for its own sake, and the docs say as much: if one MCP server with no guardrail needs is all you have, point your host straight at it.

The argument isn't "always add structure." It's that a flat list is a default, not a plan — and somewhere on the way to 200 tools you need a plan.

The failure is quiet

A flat tool list works right up until it doesn't, and the breakdown never announces itself. There's no error, no alert. Just an agent that's gradually worse at picking the right tool, slower to get there, and you can't point at the day it changed.

The teams that handle the scaling curve well are the ones who treated the tool list as a design surface before it forced them to. The tool count where that decision is cheap is the one you're at right now.

The full composition model — capabilities, exposures, workflows, and the stacked gateway pattern — is in the MCP control architecture guide.