2026-03-08 9 min read Tooling ← All writing

A code graph for Claude Code cut my investigation tokens by 59%

AI coding agents burn tokens reconstructing codebase structure on every session: grep, read, grep again, piece together what a symbol graph already knows. I built CodeGraph, a local MCP server that parses a repo with tree-sitter and exposes a pre-computed call graph to Claude Code through six tools. I use it daily to keep Claude Code token bills sane. A headless benchmark on a 484-file FastAPI stack measured a 59% drop in tokens, a 60% drop in turns, and 82 seconds less wall time per investigation, with file-level recall held at 100%.

Every time Claude Code starts investigating an unfamiliar codebase, it rebuilds the call graph from scratch with grep, read, and pattern-matching. For a single “who calls this?” question, that is routinely five to ten tool calls and twenty to fifty thousand tokens of context. The agent is re-deriving information that a compiler-style symbol index already has.

CodeGraph is a local MCP server I built to pre-compute that graph and hand it to the agent in one tool call. I now run it in every Claude Code session I open against a codebase larger than a toy; the token savings compound to real dollars over a week of work. The numbers below are from a headless benchmark against a 484-file FastAPI + Celery stack. Investigation tasks dropped to 59% fewer tokens, 60% fewer turns, and 82 seconds faster wall time, with the set of files the agent touched remaining identical. Refactor tasks showed no meaningful improvement; that result is honest and sits in the benchmark harness.

The tool itself is private (the benchmark repo is under NDA). This post walks through what it does, the one design decision that made it work, and the numbers.

TL;DR

CodeGraph is a local MCP server. It parses Python, TypeScript, and JavaScript with tree-sitter, builds a call graph and symbol index in SQLite (with an FTS5 virtual table for BM25 symbol search), and exposes six tools over stdio.
Five hooks wire it into the Claude Code session lifecycle: inject the index summary on session start, reindex edited files in under 200 ms, redirect the built-in Explore subagent to a CodeGraph-aware one, nudge structural greps toward MCP tools, and re-inject the index summary before context compaction.
Type-aware resolution cascade (Levels 1–3) separates it from a grep-based call finder: self.method() intra-class lookup, type-annotation binding, and constructor tracking. Resolves calls through variable-typed receivers and injected services that grep cannot see.
Benchmarks were built and run with claude -p in headless mode. The harness runs the same YAML task spec against the same repo, with and without CodeGraph, and compares tokens, turns, tool calls, wall time, cost, and files recall. Results and stream logs land as artifacts under tools/benchmark/results/.
Investigation tasks: −59% tokens, −60% turns, −82 s wall time, 100% files recall preserved. Refactor tasks: essentially flat. CodeGraph is an investigation tool, not a write-speed optimizer.
Deterministic hook-based Explore redirect beats agent override: prompting a custom subagent to prefer MCP tools gave tool substitution ratios between −0.25 and −6.33 (regression). Rewriting the dispatch at the harness level gave +1.75×. The model never sees the original Explore; the substitution happens before the subagent starts.
Runs locally on my laptop. No external service. No API keys. I use it on my own projects and on client repos under NDA.

What the six tools do

Tool	What it answers
`search_symbols`	Find functions, classes, methods, or routes by name or concept. FTS5 BM25 with partial and conceptual matching (“campaign state machine” resolves to the class even if the phrase is never literal).
`get_context`	Full picture of one symbol: every caller, every callee, class membership, file and line. One call replaces the usual five-grep investigation.
`trace_calls`	Recursive call chain, inbound (“who reaches this eventually?”) or outbound (“what does this touch?”), with depth control. Cycle-safe.
`find_dead_code`	Symbols with no inbound references, excluding entry points, framework decorators, and exports.
`get_schema`	Index stats: symbols by kind, references, languages, last indexed, stale file count. Cheap sanity check before a large query.
`index_repository`	Full or incremental reindex. Normally the hooks keep the index fresh; this is the escape hatch.

A concrete example from the benchmark harness, a 484-file FastAPI codebase:

search_symbols("campaign state machine")
→ CampaignStateMachine (class)   src/models/campaign.py:42
→ StateMachineBase (class)       src/core/base.py:11

get_context("CampaignStateMachine")
→ Callers (4): CampaignService.update, CampaignRouter.patch, tests/test_campaign.py, ...
→ Callees (7): StateMachineBase.transition, validate_state, emit_event, ...
→ Class: CampaignStateMachine → StateMachineBase

trace_calls("process_webhook", direction="out", depth=4)
→ process_webhook
   ├─ parse_payload → validate_schema → JSONSchema.validate
   ├─ update_campaign → CampaignStateMachine.transition → emit_event
   └─ notify_crm → CRMClient.post → httpx.AsyncClient.request

The same investigation against the same codebase done with Grep and Read took 15 to 20 tool calls and missed indirect references through object variables and inherited dispatch.

Type-aware resolution, not just grep

The resolver is what separates CodeGraph from a plain text search over def and class. Three levels, all at index time:

Level 1. self.method() intra-class resolution. When a method calls self.validate(), the resolver looks up validate within the same class and creates a direct reference. No guessing by name overlap across unrelated classes.

Level 2. Type-annotation bindings. Parameter annotations (user: User) and variable annotations (db: Session = Depends(get_db)) register type bindings. When the function later calls user.update(), the resolver knows user is a User and resolves to User.update. FastAPI Depends() injection is common in the codebases I tested; a naïve grep-based finder misses most of the real control flow there.

Level 3. Constructor tracking. Assignments like self.service = CampaignService() in __init__ register bindings that propagate to every method in the class. A call to self.service.create() three methods down resolves to CampaignService.create without any annotation.

Applied to Python and TypeScript. On a 484-file production codebase, this lifted method resolution rate from 33% to 35% of methods having resolved callers (roughly +4% total references). Not dramatic on paper; the important effect is that the resolved subset is dominated exactly by the indirect calls that grep cannot see, which is where investigation tasks spend most of their time.

Unresolvable calls (dynamic dispatch, external libraries, importlib, string-based lookup) are silently skipped. The tool does not invent references.

The hook that beat the agent override

The first instinct for teaching Claude Code to prefer MCP tools over grep was to write a custom .claude/agents/Explore.md subagent with a system prompt telling it to prefer search_symbols over Grep. It was the natural move and it was the first thing I tried.

It does not work reliably. Tool substitution ratio in the benchmark ranged from −0.25 to −6.33 across runs and model variants: depending on instruction ordering and prior training, the model would sometimes prefer the MCP tool and sometimes fall straight back to grep. No single prompt shape fixed it consistently.

The deterministic version is a PreToolUse hook that matches ^Agent$, inspects the subagent_type field in the dispatch, and rewrites "Explore" to "codegraph-explore" whenever the project has a CodeGraph index. The model never sees the original dispatch. It cannot fall back because the rewrite happens at the harness level, before the subagent starts. Same benchmark, with the hook in place: +1.75× tool substitution ratio, consistently.

The generalisable lesson is narrow: when you need guaranteed behavior around tool choice, move it out of the prompt and into the harness. Agent instructions are an influence, not a contract.

The benchmark harness, built with Claude headless

The numbers in this post are from a benchmark suite bundled with the project at src/codegraph/benchmark/. Tasks are YAML specs under tools/benchmark/tasks/ that describe a natural-language prompt and a ground truth (expected files, optional validation command). The runner uses claude -p to execute each task twice, once against a clean clone of the target repo and once with CodeGraph installed. It captures the full stream of tool calls, token counts, turns, wall time, cost, and the diff of files produced. Output lands as metrics.json, summary.md, and per-run .diff files in timestamped result directories.

The investigation task reported above, trace the complete Resend bounce webhook flow from HTTP entry to database state change:

Metric	Without CodeGraph	With CodeGraph	Delta
Total tokens	2,169,442	885,842	−59%
Turns	20	8	−60%
Tool calls	77	51	−34%
Bash (grep/cat) calls	28	4	−86%
Wall time	337 s	255 s	−82 s
Files recall	100%	100%	preserved

Cost in USD is intentionally not in the table. Across the full set of investigation runs the dollar delta was noisy, between −$0.11 and +$0.40, because Anthropic pricing is not linear in total tokens: input, output, cache-write, and cache-read each have different rates, and the mix shifts from run to run even when the task and repo are identical. The story at current pricing is “roughly the same dollars, half the context, fewer turns, faster.” Tokens, turns, and wall time move in the same direction on every investigation run; dollars are downstream noise.

Refactor tasks (rename a logging pattern, wire a new config key) landed between −4% and +12% tokens across runs. That is expected. Mechanical edits spend their tokens on the write, not on navigation. CodeGraph is an investigation tool.

What I learned

Three things, in order.

Pre-computing structure is the cheapest unlock in the agentic-coding stack. Everything an agent does with grep and read over a 500-file repo is reinventing a call graph badly. Building the graph once, keeping it fresh incrementally via content hash, and exposing it through tools collapses the investigation phase to a size that fits comfortably inside the context window. The delta is large enough that the overhead of running a second process is invisible at this scale.

Deterministic harness hooks beat brittle agent instructions. Every time I tried to fix behavior by rewriting a system prompt, the win evaporated as soon as the model version, instruction ordering, or prior training shifted. Every time I moved the same behavior into a PreToolUse hook that intercepted the call before the model got to decide, the win held. The rule of thumb: if you need a guarantee, move the logic out of the prompt.

Benchmark your own tool, honestly, before you ship it. I discovered the refactor-tasks flat result by running the benchmark. Without that, I would have shipped the post claiming a general-purpose win. The benchmark also caught the early version of the Explore redirect as a regression, which is how the hook approach became the production one. The cost of building the harness was one long evening, and it paid for itself inside the first week.

If the methodology or numbers in this post are useful, the harness is the reproducible part; the specific tool is less important than the measurement discipline around it.