7 Things Eating Your Claude Code Tokens (and How to Fix Them)

Name: 7 Things Eating Your Claude Code Tokens (and How to Fix Them)
Author: vybecoding

Beginner14m readFull-stack developers

A 12,578-byte CLAUDE.md ships to the model on every single message before you type a word. That is one of seven quiet token drains. Here is how to find what is eating your Claude Code budget, with the fix for each.

Primary Focus

ai development

AI Tools Covered

AI-firstNext.jsConvex

What You'll Learn

✓Use `/context` Before You Try to Optimize Anything
✓Your CLAUDE.md Is Probably Costing You on Every Turn
✓Wrong Model for the Job — When Sonnet or Haiku Beats Opus
✓Skipping Plan Mode Burns Tokens on the Wrong Path
✓Late `/compact` Is Worse Than No `/compact`
✓Sub-Agent Overhead — When Agent Calls Cost 7–10x

Guide Curriculum

Diagnose Where Your Tokens Actually Go

Measure before you optimize. Find the file that's quietly costing you tokens on every single turn.

2 lessons

•Use `/context` Before You Try to Optimize Anything2m
•Your CLAUDE.md Is Probably Costing You on Every Turn2m

Pick the Right Model and Mode for the Work

Match the task to the cheapest model that can do it, and force a plan before expensive multi-file work.

2 lessons

•Wrong Model for the Job — When Sonnet or Haiku Beats Opus2m
•Skipping Plan Mode Burns Tokens on the Wrong Path2m

Manage the Context Lifecycle

Compact early, skip the sub-agent tax, and adopt three habits that cut spend on every long session.

3 lessons

•Late `/compact` Is Worse Than No `/compact`2m
•Sub-Agent Overhead — When Agent Calls Cost 7–10x2m
•Context Hygiene — `/clear`, Batching, and Surgical File Refs2m

Preview: First Lesson

Diagnose Where Your Tokens Actually Go

Use `/context` Before You Try to Optimize Anything

Stop guessing. Run one command.

When a token budget feels tight, the reflex is to start optimizing something, anything, before you know what's actually eating the budget. That reflex is the mistake. Claude Code ships a diagnostic command precisely so you don't have to guess.

Token waste is almost never where you think it is. People blame their long messages. The real bloat is usually quieter: a CLAUDE.md you forgot was loaded, a sub-agent that re-read a 40 KB project file, a tool schema sitting at several thousand tokens before you've typed a single character.

So measure first. Run /context and read what it gives back. It lists everything currently loaded into the conversation, grouped by source: system prompt, CLAUDE.md, recent tool results, file contents, the running conversation. The quiet offenders surface in seconds.

/context

A typical breakdown looks like this:

Source	Tokens	Note
System prompt	4,200	Fixed
CLAUDE.md (project)	6,800	<-- this is large
CLAUDE.md (user)	1,400
Tool schemas	7,900
File reads (this turn)	2,100
Conversation so far	12,400

See a 6,800-token CLAUDE.md sitting at the top? You've just found the single highest-leverage file to trim, with no guesswork. The next two lessons are about acting on exactly that kind of finding.

The verdict is simple. If you only adopt one habit from this whole guide, make it this one: never optimize a number you haven't looked at.

Free Access

Start learning with this comprehensive guide

This guide includes:

3 modules with 7 lessons

14m estimated reading time

About the Author

✨ Vibe Coder

@hiram-clark

Hiram Clark is the founder and managing editor of vybecoding.ai and sets editorial direction for the guides and news published here. Articles are drafted with AI assistance and edited before publication. He works hands-on with the AI development tools, workflows, and infrastructure covered on the site.

Full Guide Content

Complete lesson text — start the interactive course above for exercises and progress tracking.

Module 1Diagnose Where Your Tokens Actually Go

1.1Use `/context` Before You Try to Optimize Anything

Stop guessing. Run one command.

/context

A typical breakdown looks like this:

| Source | Tokens | Note |

|---|---|---|

| System prompt | 4,200 | Fixed |

| CLAUDE.md (project) | 6,800 | <-- this is large |

| CLAUDE.md (user) | 1,400 | |

| Tool schemas | 7,900 | |

| File reads (this turn) | 2,100 | |

| Conversation so far | 12,400 | |

See a 6,800-token CLAUDE.md sitting at the top? You've just found the single highest-leverage file to trim, with no guesswork. The next two lessons are about acting on exactly that kind of finding.

The verdict is simple. If you only adopt one habit from this whole guide, make it this one: never optimize a number you haven't looked at.

1.2Your CLAUDE.md Is Probably Costing You on Every Turn

Here's a number I measured by hand. The CLAUDE.md in this project is 12,578 bytes. Run wc -c CLAUDE.md and you get the same figure. At the standard four-characters-per-token estimate that's about 3,100 tokens, and a word-count cross-check (wc -w, roughly 0.75 words per token) lands close to the same place. That file loads into the prompt for every message you send.

CLAUDE.md is the "always on" instruction file. It loads once per session conceptually, but it rides along in the prompt on every turn. That makes it feel free. It is the opposite of free.

A 3,100-token CLAUDE.md costs 3,100 tokens per message. Thirty messages into a session, you've spent roughly 93,000 tokens on the same block of instructions. And if half that file is reference material the model rarely needs, MCP server tables, framework cheat sheets, full architectural decision logs, you're paying every turn for content that sits idle most of the time.

The fix is to treat CLAUDE.md as a lookup index, not a manual. Pull the heavy reference sections into separate files and link to them. The model reads a linked file on demand when a task actually touches that area. When a task doesn't, the tokens never get spent.

Before, one bloated file (~4,800 tokens):

# Project Rules
[20 lines of core rules]
[150 lines of MCP server table]
[100 lines of GitNexus directive details]
[80 lines of key references table]

After, a trim file (~1,400 tokens) plus on-demand links:

# Project Rules
[20 lines of core rules]

## MCP Servers
-> docs/claude-ref-mcp.md

## GitNexus
-> docs/claude-ref-gitnexus.md

## Key References
-> docs/claude-ref-key-references.md

That's roughly a 70% cut on the always-loaded file. The reference material is still there. Claude reads the linked file when a task needs it, and most sessions never will.

One real gotcha I hit while writing this. Claude Code's persistent memory file has a hard cap, and the project memory here is 27,475 bytes. My own session logged a warning that it had blown past the 24.4 KB limit and only part of it loaded. Oversized always-on files don't just cost tokens. Past a certain point they get silently truncated, so the model is paying for a file it can't even fully see. A reasonable target: keep the always-on CLAUDE.md under 2,000 tokens, and watch any memory file before it crosses its own ceiling.

Module 2Pick the Right Model and Mode for the Work

2.1Wrong Model for the Job — When Sonnet or Haiku Beats Opus

Default to Opus for everything and you're paying 3 to 5 times what a large chunk of your work actually needs.

The current Anthropic pricing (May 2026) makes the gap concrete:

| Model | Input ($/M tokens) | Output ($/M tokens) |

|---|---|---|

| Claude Opus 4.6 | $5.00 | $25.00 |

| Claude Sonnet 4.6 | $3.00 | $15.00 |

| Claude Haiku 4.5 | $1.00 | $5.00 |

Opus output costs 5x what Haiku output costs. Sonnet sits in the middle at 3x the Haiku rate.

Most coding sessions are a mix. Some genuinely hard reasoning, and a much larger pile of routine implementation, mechanical edits, lookups, and formatting. Push all of it through Opus and you're paying premium rates for work a smaller model handles just as well.

Route by complexity instead:

/model haiku    # mechanical: rename, format, lookup, simple shell, find-where-X-is
/model sonnet   # daily: implementation, refactors, debugging, code review
/model opus     # hard: architecture, multi-file design, cross-cutting refactors

A heuristic that holds up. If you can state the task in one sentence and the answer fits in three tool calls, Haiku is right. If the task involves design judgment or files you don't yet understand, Sonnet is the floor. Opus earns its rate only when reasoning depth changes the outcome.

This compounds fast. A 30-minute session run entirely on Opus that could have run on Sonnet costs three times what it should. Across a week, that's a real bill, not a rounding error.

The verdict: make Sonnet your default and reach up to Opus deliberately, not by habit. Reserve Haiku for the mechanical jobs it does just fine.

2.2Skipping Plan Mode Burns Tokens on the Wrong Path

Tap Shift + Tab before a big task. It's the cheapest insurance Claude Code offers.

Plan mode is the feature where the model writes an implementation plan before it touches any code, and you approve or revise that plan before execution starts. On complex work, the worst token waste isn't a verbose answer. It's a wrong-direction implementation you only catch after the model has written 800 lines, run a dozen tool calls, and burned 25,000 tokens. By the time you say "wait, wrong approach," the spend is gone. Fixing it usually costs another 25,000 in redirection, partial reverts, and re-explaining.

Watch the same task with and without a plan.

Without plan mode:

You: "Refactor the auth flow to use the new session helper"
Claude: [reads 12 files, edits 8, runs tests, finds the approach was wrong]
You: "No, the session helper is async - you need to wait on it"
Claude: [re-reads, undoes most edits, re-implements]
Total: ~50,000 tokens

With plan mode:

You: [Shift+Tab] "Refactor the auth flow to use the new session helper"
Claude: [reads 4 key files, produces a plan]
You: "The session helper is async, adjust the plan"
Claude: [adjusts plan]
You: "Approved"
Claude: [executes the correct plan]
Total: ~18,000 tokens

Same outcome. The plan itself is cheap, usually 1,000 to 3,000 output tokens. A wrong plan costs you 500 tokens of dialogue to correct, and the model hasn't written a line yet. Once the plan is right, execution is direct and the expensive flailing never happens.

When does it pay off? When the task spans multiple files, when you suspect a non-obvious constraint, or when you'd struggle to spot a wrong answer until the model had already finished. For a single-file edit, skip it. For anything cross-cutting, use it every time.

Module 3Manage the Context Lifecycle

3.1Late `/compact` Is Worse Than No `/compact`

The worst time to compact is right before the model runs out of room. That's also when most people do it.

Claude Code auto-compacts a session, summarizing prior turns into a shorter form, when the context window nears its limit around 95% full. By the time that fires, the model has already been working inside a saturated context for a while, and quality has already slipped.

Wait for auto-compact and three things have already gone wrong. You've paid full token cost for a bloated context across several turns. The model has been making calls inside a context stuffed with stale exploration and dead ends. And the compact itself is now summarizing that low-quality recent reasoning, baking the mess into the summary.

Run /compact yourself at around the 60% mark instead, while the key information is clear and the conversation hasn't filled with clutter. The summary comes out cleaner, the session continues with sharper context, and you haven't spent the last several turns paying full freight for a context that was already saturated.

/context           # check fill
# If shown ~60% full and the current task is winding down:
/compact           # summarize now, before quality degrades

A practical trigger: every time you finish a discrete piece of work, a story, a bug fix, an investigation, and you're about to start something unrelated, that's a /compact moment. Often it's a /clear moment instead (Lesson 7). Either way, don't let a finished investigation sit in context dragging down the next task.

3.2Sub-Agent Overhead — When Agent Calls Cost 7–10x

The Agent tool spawns a sub-agent that runs on its own and hands back a summary. It's a strong primitive. It is also one of the most expensive things you can do, roughly 7 to 10 times the cost of equivalent inline work.

Why so much? A sub-agent inherits none of the parent conversation's prompt cache. Every system prompt, every CLAUDE.md, every tool schema gets reprocessed at full cost on its first turn. Then it has to re-read all the context the parent already had loaded, project structure, the story spec, the neighboring code, before it can do anything. And its results come back through a final summarization turn that costs more on top of the parent's own continued work. A 15-minute job that runs ~5,000 tokens inline routinely runs 35,000 to 50,000 as a sub-agent for the same result.

So the rule of thumb: anything under about 15 minutes of work should execute inline. The 7-to-10x overhead only pays back when the task is big enough to amortize that fixed cost across many turns.

Spawn a sub-agent when:

The work is genuinely independent and parallelizes with other long-running work.
The story is large, 30+ minutes, multiple files, deep research, so the overhead is a small fraction of the total.
Context isolation is the actual goal, exploring an unfamiliar area without polluting the parent conversation.

Keep it inline when:

It's a one-off lookup, a single-file edit, or a mechanical rename.
The parent could finish it in one to three tool calls.
The parent already has all the context loaded.

Reaching for Agent on a sub-15-minute task? The answer is almost always to do it inline.

3.3Context Hygiene — `/clear`, Batching, and Surgical File Refs

Three habits beat any single dramatic optimization. They cost nothing to adopt and they compound.

First, /clear between unrelated tasks. Compounding context across unrelated work is the number-one token waste in long sessions. Finish a bug fix, pivot to a feature, and the bug fix's exploration is still riding along in every prompt. Clear it.

/clear   # before starting an unrelated task

Second, batch multi-step work into one prompt. Three sequential messages cost about three times what one combined message costs, because the system prompt and context get reprocessed every turn. If the plan is "read these files, find the bug, propose a fix, then implement it," ask for all four at once.

# Wasteful, three turns
"Read auth/session.ts"
"Now find the bug"
"Now fix it"

# Efficient, one turn
"In auth/session.ts lines 30-90, identify the session-expiry bug
 and propose a fix. If the fix is small, implement it."

Third, be surgical with file references. "Look through the auth code" triggers expensive multi-file exploration. An exact path and line range keeps the model on rails.

# Wasteful
"Look through the auth code for the bug"

# Surgical
"Compare src/auth/session.ts lines 30-90 with
 src/api/login.ts lines 10-60. There's a session
 lifecycle mismatch between them."

That's it. /clear, batch, point precisely. No new tooling, no config, no plugin. Adopt all three and a long session routinely runs 20 to 40% cheaper than the same work done sloppily.