Sebastian Tirelli
2026-03-08 9 min read Tooling ← All writing

A code graph for Claude Code cut my investigation tokens by 59%

AI coding agents burn tokens reconstructing codebase structure on every session: grep, read, grep again, piece together what a symbol graph already knows. I built CodeGraph, a local MCP server that parses a repo with tree-sitter and exposes a pre-computed call graph to Claude Code through six tools. I use it daily to keep Claude Code token bills sane. A headless benchmark on a 484-file FastAPI stack measured a 59% drop in tokens, a 60% drop in turns, and 82 seconds less wall time per investigation, with file-level recall held at 100%.

Every time Claude Code starts investigating an unfamiliar codebase, it rebuilds the call graph from scratch with grep, read, and pattern-matching. For a single “who calls this?” question, that is routinely five to ten tool calls and twenty to fifty thousand tokens of context. The agent is re-deriving information that a compiler-style symbol index already has.

CodeGraph is a local MCP server I built to pre-compute that graph and hand it to the agent in one tool call. I now run it in every Claude Code session I open against a codebase larger than a toy; the token savings compound to real dollars over a week of work. The numbers below are from a headless benchmark against a 484-file FastAPI + Celery stack. Investigation tasks dropped to 59% fewer tokens, 60% fewer turns, and 82 seconds faster wall time, with the set of files the agent touched remaining identical. Refactor tasks showed no meaningful improvement; that result is honest and sits in the benchmark harness.

The tool itself is private (the benchmark repo is under NDA). This post walks through what it does, the one design decision that made it work, and the numbers.

TL;DR

What the six tools do

ToolWhat it answers
search_symbolsFind functions, classes, methods, or routes by name or concept. FTS5 BM25 with partial and conceptual matching (“campaign state machine” resolves to the class even if the phrase is never literal).
get_contextFull picture of one symbol: every caller, every callee, class membership, file and line. One call replaces the usual five-grep investigation.
trace_callsRecursive call chain, inbound (“who reaches this eventually?”) or outbound (“what does this touch?”), with depth control. Cycle-safe.
find_dead_codeSymbols with no inbound references, excluding entry points, framework decorators, and exports.
get_schemaIndex stats: symbols by kind, references, languages, last indexed, stale file count. Cheap sanity check before a large query.
index_repositoryFull or incremental reindex. Normally the hooks keep the index fresh; this is the escape hatch.

A concrete example from the benchmark harness, a 484-file FastAPI codebase:

search_symbols("campaign state machine")
→ CampaignStateMachine (class)   src/models/campaign.py:42
→ StateMachineBase (class)       src/core/base.py:11

get_context("CampaignStateMachine")
→ Callers (4): CampaignService.update, CampaignRouter.patch, tests/test_campaign.py, ...
→ Callees (7): StateMachineBase.transition, validate_state, emit_event, ...
→ Class: CampaignStateMachine → StateMachineBase

trace_calls("process_webhook", direction="out", depth=4)
→ process_webhook
   ├─ parse_payload → validate_schema → JSONSchema.validate
   ├─ update_campaign → CampaignStateMachine.transition → emit_event
   └─ notify_crm → CRMClient.post → httpx.AsyncClient.request

The same investigation against the same codebase done with Grep and Read took 15 to 20 tool calls and missed indirect references through object variables and inherited dispatch.

Type-aware resolution, not just grep

The resolver is what separates CodeGraph from a plain text search over def and class. Three levels, all at index time:

Level 1. self.method() intra-class resolution. When a method calls self.validate(), the resolver looks up validate within the same class and creates a direct reference. No guessing by name overlap across unrelated classes.

Level 2. Type-annotation bindings. Parameter annotations (user: User) and variable annotations (db: Session = Depends(get_db)) register type bindings. When the function later calls user.update(), the resolver knows user is a User and resolves to User.update. FastAPI Depends() injection is common in the codebases I tested; a naïve grep-based finder misses most of the real control flow there.

Level 3. Constructor tracking. Assignments like self.service = CampaignService() in __init__ register bindings that propagate to every method in the class. A call to self.service.create() three methods down resolves to CampaignService.create without any annotation.

Applied to Python and TypeScript. On a 484-file production codebase, this lifted method resolution rate from 33% to 35% of methods having resolved callers (roughly +4% total references). Not dramatic on paper; the important effect is that the resolved subset is dominated exactly by the indirect calls that grep cannot see, which is where investigation tasks spend most of their time.

Unresolvable calls (dynamic dispatch, external libraries, importlib, string-based lookup) are silently skipped. The tool does not invent references.

The hook that beat the agent override

The first instinct for teaching Claude Code to prefer MCP tools over grep was to write a custom .claude/agents/Explore.md subagent with a system prompt telling it to prefer search_symbols over Grep. It was the natural move and it was the first thing I tried.

It does not work reliably. Tool substitution ratio in the benchmark ranged from −0.25 to −6.33 across runs and model variants: depending on instruction ordering and prior training, the model would sometimes prefer the MCP tool and sometimes fall straight back to grep. No single prompt shape fixed it consistently.

The deterministic version is a PreToolUse hook that matches ^Agent$, inspects the subagent_type field in the dispatch, and rewrites "Explore" to "codegraph-explore" whenever the project has a CodeGraph index. The model never sees the original dispatch. It cannot fall back because the rewrite happens at the harness level, before the subagent starts. Same benchmark, with the hook in place: +1.75× tool substitution ratio, consistently.

The generalisable lesson is narrow: when you need guaranteed behavior around tool choice, move it out of the prompt and into the harness. Agent instructions are an influence, not a contract.

The benchmark harness, built with Claude headless

The numbers in this post are from a benchmark suite bundled with the project at src/codegraph/benchmark/. Tasks are YAML specs under tools/benchmark/tasks/ that describe a natural-language prompt and a ground truth (expected files, optional validation command). The runner uses claude -p to execute each task twice, once against a clean clone of the target repo and once with CodeGraph installed. It captures the full stream of tool calls, token counts, turns, wall time, cost, and the diff of files produced. Output lands as metrics.json, summary.md, and per-run .diff files in timestamped result directories.

The investigation task reported above, trace the complete Resend bounce webhook flow from HTTP entry to database state change:

MetricWithout CodeGraphWith CodeGraphDelta
Total tokens2,169,442885,842−59%
Turns208−60%
Tool calls7751−34%
Bash (grep/cat) calls284−86%
Wall time337 s255 s−82 s
Files recall100%100%preserved

Cost in USD is intentionally not in the table. Across the full set of investigation runs the dollar delta was noisy, between −$0.11 and +$0.40, because Anthropic pricing is not linear in total tokens: input, output, cache-write, and cache-read each have different rates, and the mix shifts from run to run even when the task and repo are identical. The story at current pricing is “roughly the same dollars, half the context, fewer turns, faster.” Tokens, turns, and wall time move in the same direction on every investigation run; dollars are downstream noise.

Refactor tasks (rename a logging pattern, wire a new config key) landed between −4% and +12% tokens across runs. That is expected. Mechanical edits spend their tokens on the write, not on navigation. CodeGraph is an investigation tool.

What I learned

Three things, in order.

Pre-computing structure is the cheapest unlock in the agentic-coding stack. Everything an agent does with grep and read over a 500-file repo is reinventing a call graph badly. Building the graph once, keeping it fresh incrementally via content hash, and exposing it through tools collapses the investigation phase to a size that fits comfortably inside the context window. The delta is large enough that the overhead of running a second process is invisible at this scale.

Deterministic harness hooks beat brittle agent instructions. Every time I tried to fix behavior by rewriting a system prompt, the win evaporated as soon as the model version, instruction ordering, or prior training shifted. Every time I moved the same behavior into a PreToolUse hook that intercepted the call before the model got to decide, the win held. The rule of thumb: if you need a guarantee, move the logic out of the prompt.

Benchmark your own tool, honestly, before you ship it. I discovered the refactor-tasks flat result by running the benchmark. Without that, I would have shipped the post claiming a general-purpose win. The benchmark also caught the early version of the Explore redirect as a regression, which is how the hook approach became the production one. The cost of building the harness was one long evening, and it paid for itself inside the first week.

If the methodology or numbers in this post are useful, the harness is the reproducible part; the specific tool is less important than the measurement discipline around it.