How to Save Tokens with Claude Code and Codex

If you want to save tokens with Claude Code and Codex, the bad news is that there is no magic flag you can flip. The good news is that most token waste comes from boring workflow mistakes, not some mysterious pricing trick. Bloated context, repeated file reads, sprawling sessions, huge instruction files, and overpowered models for routine jobs are what usually blow the budget.

That matters more now because coding agents are no longer just toys for one-off code snippets. They sit inside real engineering workflows. They inspect repositories, run commands, read documentation, review pull requests, and keep multi-step sessions alive for hours. When that workflow is tight, the spend feels fair. When it is sloppy, costs spiral before anyone notices.

Anthropic's own Claude Code docs say the average cost is around $6 per developer per day, with 90% of users staying below $12 daily, but they also warn that context size is the main cost driver and that agent teams scale spend with each extra context window. OpenAI is moving Codex pricing toward direct token-based rates for applicable plans, which makes the economics even more visible. In plain English, the larger and messier the context, the more you pay.

So this is not a Claude-vs-Codex fanboy piece. It is a practical guide to where tokens really get burned, what to change in your habits, and when paying for a stronger model is actually worth it.

Where tokens get wasted in real coding-agent workflows

Most teams assume token usage is driven by how hard the task is. Sometimes that is true, but more often the real culprit is how much irrelevant material the agent has to carry around while doing the task.

Think about a typical session. You open a coding agent, paste a long objective, let it inspect half the repo, ask it to read several docs, run tests, explain a failure, propose a plan, revise the plan, then chat back and forth while it keeps the entire conversation alive. By the time you ask a simple follow-up like "rename that helper" or "fix the lint issue", the agent is dragging along a huge memory of everything that came before.

That is the hidden tax. Nimble Approach, a UK consultancy, published a case study after one developer accidentally used 66 million tokens in a single workflow. The big lesson was not that AI tools are reckless by default. It was that long sessions, excessive documentation loading, and tiny follow-up messages all inherit the cost of the whole conversation.

In practice, the biggest token leaks usually come from five places:

1. Carrying stale context between unrelated tasks

If you finish one feature and then start a different problem in the same session, you are paying to keep old context alive. Anthropic explicitly recommends clearing between tasks for exactly this reason.

2. Reading too much code too early

Agents are fast enough that they can feel cheap while exploring. Then you realise they read ten files when two would have done. If the task was narrow, that exploration was not free. It became part of the session's baseline.

3. Dumping whole documentation sets into context

This one is common with coding agents because developers love thoroughness. "Read the docs for all of this" sounds sensible, but it can be absurdly expensive if only one endpoint or one config block matters.

4. Using a premium model for low-judgement work

There is no glory in burning expensive reasoning tokens to rename variables, reformat tests, or make a small copy tweak. Strong models should be reserved for ambiguity, architecture, and decision-heavy work.

5. Letting verbose output flood the session

Large test outputs, long logs, giant diffs, and chatty status updates all eat context. Anthropic's docs specifically point to hooks, skills, and subagents as ways to filter or isolate that noise before it bloats the main thread.

A developer reviewing an overloaded AI coding session with crowded panels, logs, and context windows in a dark terminal setup

The shared principles that save money in both Claude Code and Codex

Before splitting into tool-specific advice, it helps to get the universal rules clear. These work regardless of whether you prefer Claude Code, Codex, or a mixed workflow.

Keep the task scope brutally narrow

A cheap session starts with a tight brief. Instead of saying, "review this authentication flow and suggest improvements", say, "inspect src/auth/session.ts, find why expired sessions fail to refresh, patch it, and run the relevant tests only".

That does three useful things. It reduces initial exploration, it cuts the chance of irrelevant file reads, and it tells the agent what success looks like. Anthropic's best-practices guide makes the same point from another angle: precise instructions reduce corrections, and fewer corrections mean fewer tokens.

Plan when uncertainty is high, not by habit

Planning saves tokens when the change is fuzzy, cross-cutting, or risky. It wastes tokens when the task is obvious. The efficient pattern is explore first, create a short plan, then execute. The inefficient pattern is asking for a grand strategy document before changing a three-line bug.

This matters because planning itself consumes context. OpenAI's Codex prompting guide also warns against unnecessary status chatter and overhead. In other words, if the diff can be described in a sentence, stop turning it into a workshop.

Start fresh more often than feels comfortable

Developers are used to continuity. We like staying in one thread because it preserves momentum. With coding agents, continuity can be expensive. Once the session contains a lot of dead context, even a small request becomes more costly than starting a new, focused thread.

A good rule of thumb is simple: if the next task would require you to explain "ignore most of what we just did", you probably need a fresh session.

Filter before the model sees the noise

This is where mature workflows separate themselves from casual usage. Anthropic's cost guidance specifically recommends preprocessing with hooks and offloading high-volume tasks to skills or subagents. That is smart because tokens spent on irrelevant output are still tokens spent.

If a test suite prints 10,000 lines and only 20 matter, your job is not to ask the model to be patient. Your job is to stop feeding it the other 9,980.

Choose the cheapest model that still has enough judgement

This is probably the most underused habit. Developers often keep one model running for the whole workflow because context switching feels annoying. Financially, that is backwards.

Use a stronger model when the problem is genuinely hard: architecture trade-offs, deep debugging, messy legacy logic, or cases where the agent must infer intent from partial evidence. Use a cheaper or smaller model when the task is mechanical: generating tests, wiring boilerplate, renaming functions, drafting docs, or cleaning low-risk code.

OpenAI explicitly suggests switching to GPT-5.4-mini for routine Codex tasks to make usage limits last longer. Anthropic says much the same in different words: Sonnet is enough for most coding, while more expensive models should be reserved for harder reasoning.

How to reduce token usage in Claude Code specifically

Claude Code is excellent when you want strong reasoning across a messy codebase, but it gets expensive fastest when you let the context sprawl.

Watch context like you would watch memory or CI minutes

Anthropic is blunt about this in its docs: context is the key resource. Claude Code even gives you native ways to inspect cost and context usage. If you are serious about saving money, checking usage cannot be an afterthought. It should be part of the normal workflow.

The mindset shift is important. Token usage is not a weird AI billing detail. It is now an engineering resource. You already watch build times, cloud spend, and test runtime. Context should sit in the same mental bucket.

Keep CLAUDE.md lean

One of Anthropic's best recommendations is moving specialised instructions out of CLAUDE.md and into on-demand skills. That is exactly right. If your main instruction file contains every database rule, deployment policy, review checklist, migration pattern, and team convention, you are paying for all of that even when the session only needs one tiny part.

Base instruction files should carry durable essentials only. Anything niche belongs in a more targeted layer that is loaded when needed.

Reduce MCP and tool overhead

This point gets missed because fancy tool integrations feel productive. Anthropic notes that MCP tool definitions and extra servers add to context overhead, even with deferred loading improvements. If a normal CLI tool will do the job, using that simpler path often costs less context.

That does not mean "never use MCP". It means stop collecting integrations like browser extensions. Every extra surface adds noise unless it is actively helping the current task.

Use compaction and resets deliberately

Compaction is useful, but it is not a license to keep one immortal session open all day. Summaries are still summaries. Once the session has wandered through enough unrelated work, the cleaner choice is often to rename it for reference, clear it, and start again.

If you do compact, tell the tool what matters. Anthropic supports custom compaction instructions for a reason. A summary that preserves the code change, failing test, and next action is much cheaper than one that tries to remember every branch of the conversation.

Be careful with subagents and agent teams

Subagents can save the main session from noisy output, which is great. But they are not free. Anthropic's cost docs say agent teams roughly scale with the number of active teammates because each one gets its own context window, and plan-mode teams can use about seven times more tokens than a standard session.

So use them for isolation, not theatre. If a subagent can summarise documentation or process logs and return a tight answer, that is efficient. If you spawn a small army because it looks impressive, your finance sheet will notice before your users do.

A focused AI coding assistant planning a small task on one monitor while another screen shows a compact, clean context dashboard in blue and purple light

How to reduce token usage in Codex specifically

Codex has its own flavour of waste. The tool is flexible, fast, and increasingly transparent about how usage limits and credit consumption work, which is good news if you are disciplined and bad news if you are not.

Treat AGENTS.md like a budget document

OpenAI's pricing guidance explicitly mentions reducing the size of AGENTS.md and nesting instructions where possible. That is not a minor note. It is one of the clearest ways to stop paying context tax on every interaction.

If your AGENTS.md reads like a novella, you are making every task carry a backpack full of rules it may not need. Split global rules from project-specific rules. Push specialised guidance closer to the directories that actually need it.

Use mini models for boring work

OpenAI says GPT-5.4-mini can stretch usage limits by roughly 2.5x to 3.3x for routine tasks. That is a massive lever. It means a team that mindlessly uses a premium model for everything can burn through capacity several times faster than a team that distinguishes between judgement-heavy work and straightforward execution.

That distinction is what matters. Bug triage in a strange subsystem might justify a stronger model. Converting a group of tests, updating docs, or making a narrow refactor probably does not.

Remove prompt overhead that sounds helpful but is not

OpenAI's Codex prompting guide makes an interesting point that many teams ignore: unnecessary preambles, constant status narration, and forced up-front plans can hurt efficiency. Some overhead helps coordination, but too much of it becomes ceremonial token spend.

You see this most clearly in teams that make agents write a mini essay before every change. It feels safe. In reality, it often adds cost without improving outcomes. Ask for the plan when the plan itself is valuable. Otherwise, let the agent get on with the work.

Control the number of external systems in play

OpenAI also warns that every MCP server or similar integration adds context. The principle is identical to Claude Code even if the surface area differs. Each extra tool increases the amount the agent has to remember, choose between, and sometimes describe.

A lean setup almost always wins over a maximal one.

When Claude Code is worth the spend, and when Codex or a cheaper model is enough

This is the part people usually want reduced to a slogan. That would be a mistake.

Claude Code is worth paying for when the work is ambiguous, the repo is unfamiliar, and the value of better reasoning is higher than the value of cheap throughput. Think architectural decisions, ugly debugging, risky migrations, or cases where the real job is understanding intent before touching code.

Codex, or a smaller model inside Codex, is often the better bargain when the problem is already framed. Once the path is clear, you do not need constant premium reasoning. You need competent execution, sensible edits, and enough intelligence to verify the result.

A lot of teams would save money by splitting the workflow like this:

Use the stronger model to scope the problem.
Freeze the plan while the context is still clean.
Switch to the cheaper model for the implementation grind.
Bring the stronger model back only if the task gets weird again.

That is the balanced approach. Not loyalty to one tool, but model arbitrage based on the kind of judgement required.

This also matches how experienced engineers work. You do not use senior-level attention for every tiny task all day. You spend it where uncertainty is expensive.

A senior engineer and an AI coding assistant switching between a high-powered planning workspace and a lighter execution workspace in a cinematic studio environment

A practical low-cost workflow you can actually use

If you want one workflow to steal and adapt, use this:

Step 1: Frame the problem narrowly

Name the exact file, bug, output, or acceptance criteria. Do not ask the agent to explore the whole codebase unless that exploration is genuinely necessary.

Step 2: Limit discovery

Ask it to inspect only the most relevant paths first. If more files are needed, it can branch out from there. Make it earn the wider context.

Step 3: Decide whether planning is worth it

For a broad or risky task, get a short plan. For an obvious task, skip the planning ceremony.

Step 4: Keep logs and docs filtered

Do not pipe giant outputs into the main session unless the volume itself is the point. Summarise first. Grep first. Trim first.

Step 5: Switch models on purpose

Use the premium model to understand and de-risk. Use the cheaper one to execute routine work.

Step 6: End the session before it goes stale

When the job is done, close it out. Do not let yesterday's migration tax tomorrow's bugfix.

That is not glamorous advice, but it is what works.

Final thought

The biggest mistake teams make with AI coding tools is treating token spend as something that happens to them. It does not. It is a product of workflow design.

Claude Code and Codex can both be cost-effective. They can also both become expensive fast if you let context bloat, over-read the repo, flood the session with output, and use the strongest model for tasks that barely need judgement.

The cheapest token is still the one you never send. Keep the task tight, keep the context clean, and be honest about when you need brilliance versus when you just need a competent pair of hands.

If your team wants help designing leaner AI coding workflows, or you want to build agent-driven tooling without letting costs drift out of control, talk to us at appwebdev.co.uk. We help businesses turn AI from an expensive experiment into something operational, measurable, and actually useful.

How to Save Tokens with Claude Code and Codex Without Slowing Your Team Down