← Back to blog

Flat tool lists don't scale — and agents are about to find out

Your agent has 12 tools and it's fine. So you've never thought of the tool list as a design problem. It's just a list. Lists are free. You need a new capability, you append it.

That instinct is about to cost you. Not because the list gets expensive — a separate post covers the token bill — but because a flat list stops being a usable structure long before you notice it has.

A flat list has no idea what's legal

Start with the deepest problem, the one that has nothing to do with size. A flat tool list is stateless. Every tool in it is always "available." Always callable. Right now.

But your actual system has order. You can't issue the refund before the charge settles. You don't deploy before the tests pass. The PR gets created after the branch exists, not before. A list can't express any of that. It has no slot for "this comes after that" or "this isn't legal yet."

So that knowledge has to live somewhere else — and "somewhere else" is your system prompt, or the model's guesses. At 12 tools you paper over it: a few lines of prompt instructions, "always check X before Y," and it mostly holds. At 200 tools, across a dozen subsystems, there is no prompt long enough. The model is improvising your state machine, and you've given it no way to know when it's wrong.

A flat list has no notion of relevance

The second problem is navigation. The model needs one tool out of 200. The list offers exactly one way to find it: consider all 200.

There's no "search the tools." There's only "scan the tools." And the scan gets harder as the list grows in the most natural way — you've wired in three issue trackers, so now there are three create_issue variants with near-identical descriptions, and the model has to disambiguate lookalikes on every call.

A list doesn't rank. It doesn't cluster. It doesn't surface what matters now. It just grows, and every entry you add makes every other entry slightly harder to find.

Why this becomes everyone's problem

You might be thinking 200 tools is someone else's stress test. Look at the direction of travel.

Every serious SaaS vendor is shipping an MCP server. An enterprise agent doesn't connect to one of them — it connects to the issue tracker, the cloud console, the data warehouse, the CRM, the deploy system, the docs, the internal tools team's six servers. Each one brings ten to fifty tools. You don't decide to have 200 tools. You wire in the ninth useful integration and you're there. 200 tools isn't a stress test; it's a Tuesday in any reasonably ambitious deployment.

The fix is structure, not a bigger list

mcp-flowgate replaces the flat list with two things a list never had: search and state.

Search. Instead of the model scanning, gateway.search scores capabilities against their title, description, and tags and returns the hits. The 200 capabilities still exist — they're just discovered on demand instead of crowding the model's context. (It's lexical scoring, so good titles and tags matter; more on that limit below.)

State. In mcp-flowgate every capability is a transition in a state machine. A flat set of tools is just the trivial case — it compiles to a single workflow called proxy_default, one state, every tool looping back to it. The moment you want ordering, you wrap the relevant tools in a real workflow with real states. Now "what's legal now" isn't a prompt instruction the model might forget — it's computed by the engine and returned as links on every response.

And you don't rewrite what you already have. proxy.import points at an existing MCP server, lists its tools, and folds them into the registry — discoverable through search, with no hand-copied definitions, and each imported tool logged with a capability.discovered audit event so you can see exactly what joined. The tools are there. They're just not in the model's face.

Three concepts a flat list collapses into one

Here's the root of why a list breaks: it collapses three different questions into one undifferentiated thing. Every entry in a flat list is, all at once, "a thing that exists," "a thing that's offered to the caller," and "a thing that's callable right now." mcp-flowgate keeps those three separate — and the separation is exactly where structure becomes possible.

  • A capability is what the gateway can do — defined once, reusable.
  • An exposure is what it publishes to callers — which capabilities are discoverable, and under what name.
  • A workflow is when a capability may run — the states, the order, the guards.

Pull them apart and each can carry its own structure. You can define a capability once and expose it under three names. You can have a capability that exists but is only reachable from inside a particular workflow, never directly. A flat list can't make any of those distinctions — it has one axis, and one axis doesn't scale.

One action, many policies — without duplicate entries

Watch how a flat list grows in practice. You have create_pr. Then one team needs a version that requires tests first, so you add create_pr_safe. Then an admin path with fewer checks — create_pr_admin. Three list entries, one real action, differing only in policy. The list grew by cloning.

The trichotomy handles this with wrappers: one base capability, with thin layers that add policy on top.

gateway.yaml
capabilities:
  raw.create_pr:
    executor: { kind: mcp, connection: github, tool: create_pull_request }

  safe.create_pr:
    wraps: raw.create_pr
    guards: [{ kind: evidence, requires: [tests_passed] }]

  audited.create_pr:
    wraps: safe.create_pr
    guards: [{ kind: permission, permission: github.write }]

raw.create_pr is the one real implementation. safe.create_pr inherits it and adds a guard; audited.create_pr adds another. Different teams get exposed different wrapper levels of the same underlying action — and when the action itself changes, you change it in one place. The list didn't grow by three lookalikes. The capability graph grew by intent, and every layer says exactly what policy it adds.

The pattern that actually scales: gateways stack

Here's the structural move that makes this hold up. A gateway exposes seven MCP tools — and a gateway is itself an MCP server. Which means a gateway can sit in front of another gateway.

That turns one big flat list into a layered system, each layer owning exactly one concern and passing the rest through:

  • Enterprise gateway — identity, global audit export, org-wide limits.
  • Team gateway — team roles, team audit, approval queues.
  • Project gateway — project workflows, resource locks, scoped capabilities.
  • Local-dev gateway — the actual MCP servers and CLIs.

Each layer's config stays small because it only does its own job — the layer below already enforces everything below it. Here's a team gateway that imports the project gateway beneath it and adds one wrapper plus team-wide audit:

team-gateway.yaml
connections:
  project:
    kind: mcp
    url: http://project-gateway.svc:8000/mcp

proxy:
  import:
    - { connection: project, prefix: project }

capabilities:
  team.create_pr:
    wraps: project.github.create_pr
    guards: [{ kind: evidence, requires: [security_scanned] }]

audit:
  sink: file
  path: /var/log/team-gateway/audit.jsonl

That's a handful of meaningful lines. Everything else — the capabilities, the project workflows, the local tools — came from the gateway below. And because every layer exposes the same seven tools, the layers above don't care how deep the stack runs. The model at the top still sees seven tools, whether there are 20 capabilities underneath or 2,000.

Within a single layer, the same instinct applies to the file itself: an include: list composes a config from responsibility-shaped slices — one file per workflow, one per bounded context. The config scales the way the system does, by splitting along concerns, not by growing one file forever.

That's the difference between a list and a system. A list scales by getting longer. A system scales by getting layered.

The honest limits

This is structure, and structure has costs. Search is lexical, not semantic — it leans on you writing clear titles and tags; vague metadata gives vague results. A workflow is more to author than a one-line list entry. The seven-tool surface adds a search-then-start hop before the real work.

And a flat list is genuinely the right answer when you have a handful of tools and no ordering to enforce. Don't build a four-layer gateway stack for eight tools — that's structure for its own sake, and the docs say as much: if one MCP server with no governance needs is all you have, point your host straight at it.

The argument isn't "always add structure." It's that a flat list is a default, not a plan — and somewhere on the way to 200 tools you need a plan.

The failure is quiet

A flat tool list works right up until it doesn't, and the breakdown never announces itself. There's no error, no alert. Just an agent that's gradually worse at picking the right tool, slower to get there, and you can't point at the day it changed.

The teams that handle the scaling curve well are the ones who treated the tool list as a design surface before it forced them to. The tool count where that decision is cheap is the one you're at right now.

The full composition model — capabilities, exposures, workflows, and the hierarchical gateway pattern — is in the MCP control architecture guide.