Show HN: I benchmarked MCP vs. CLI for browser automation. MCP wins by 25x

Achiyacohen · 2026-04-11T23:30:55 1775950255

Author here. Some context that didn't fit in the title.

I built safari-mcp a few weeks ago — a macOS-native Safari automation MCP server (no Chrome, no headless, keeps Safari logins). 84 tools via the Model Context Protocol, used directly by Claude Code, Cursor, Cline, etc.

When I saw HKUDS/CLI-Anything (29k stars, auto-wraps open-source software as agent-native CLIs), I wondered if wrapping safari-mcp as a CLI was actually a good idea — so I benchmarked it before shipping.

The numbers, measured live against real Safari:

  Per-call latency (10x list_tabs, warm cache):
    MCP (persistent stdio session):   119ms median
    CLI (subprocess per call):      3,023ms median
    MCP is 25.3x faster.

  5-op reactive workflow:
    MCP:                  2.7s
    CLI sequential:      15.3s
    CLI shell pipeline:  15.2s
    MCP 5.6x faster (pipelining does NOT amortize npx spawn).

  Token overhead per API call (real tools.json, cl100k_base tokenizer):
    MCP (84 tool definitions):  7,986 tokens
    CLI (just `bash` tool def):    95 tokens
    CLI 84x fewer per-call tokens.

  Accuracy: byte-identical output (both paths hit the same safari-mcp).

So for Claude Code / Cursor / Cline users, MCP is the right answer — 25x lower latency per call. I say this up front in the harness's README and SKILL.md.

The CLI exists for a different audience:

- Agents that don't speak MCP (Codex CLI, GitHub Copilot CLI, older frameworks, bash scripts) - CI / cron — subprocess-friendly, jq-pipeable JSON output - Long Opus sessions where tool-def tokens dominate cost. At $15/MTok input, sending 7,986 tokens of tool definitions on every API call adds up. 100-turn session: ~$12 in tool-def overhead for MCP vs ~$0.22 for CLI. Prompt caching narrows the gap to ~10x but it's still real money at scale.

The harness is schema-driven: an offline parser reads safari-mcp's Zod definitions, emits a JSON bundle, and at import time safari_cli.py generates 84 Click commands from it — zero manual mapping, parity tests pin the result. The parser went through 5 review rounds before I caught everything, including a sneaky nested-schema bug where .describe() was picked from the inner field instead of the outer.

Happy to answer questions about the architecture, the benchmark methodology, or why it took 5 review rounds to find all the bugs.

Full writeup with methodology and the bug post-mortems: https://dev.to/achiya-automation/mcp-vs-cli-for-browser-auto...

safari-mcp repo: https://github.com/achiya-automation/safari-mcp