Cost Optimization
Optimize model selection, caching, and token management for efficiency.
What You’ll Learn
Using AI coding tools at scale has real cost implications. Every API call consumes tokens, and tokens cost money. But cost optimization is not about using the cheapest model for everything --- it is about using the right model for each task.
By the end, you’ll understand:
- What drives cost in AI-assisted coding
- How to select the right model for each task
- The model cascade pattern
- Token management and context efficiency
- Caching strategies that reduce redundant work
The Problem
A developer who uses the most capable model for every task --- including simple file searches, listing directories, and reading configuration --- is overspending dramatically.
Typical session breakdown:
5 file searches (simple) ← Haiku could do this
3 code explorations (moderate) ← Sonnet is ideal
1 architecture decision (complex) ← Opus is worth it
10 small edits (simple) ← Sonnet handles fine
2 test investigations (moderate) ← Sonnet is ideal
All 21 ops with Opus → expensive
5 Haiku + 15 Sonnet + 1 Opus → same quality, fraction of cost
The difference compounds over days and weeks across hundreds of sessions.
How It Works
Understanding Cost Drivers
┌──────────────────────────────────────────────────┐
│ Cost Drivers │
│ │
│ 1. INPUT TOKENS │
│ System prompt, CLAUDE.md, conversation │
│ history, file contents, tool outputs │
│ │
│ 2. OUTPUT TOKENS │
│ Reasoning, responses, tool call arguments, │
│ generated code │
│ │
│ 3. MODEL CHOICE │
│ Haiku (lowest) → Sonnet (moderate) → Opus │
│ │
│ 4. NUMBER OF API CALLS │
│ Each tool use = another round trip │
│ Each call re-sends the full context │
│ │
└──────────────────────────────────────────────────┘
The critical detail: input tokens are sent on every API call. If your context is 50,000 tokens and the agent makes 20 tool calls, that is 50,000 tokens sent 20 times. A lean context pays dividends on every single call.
Model Selection Strategy
┌─────────────────────────────────────────────────┐
│ Model Selection Guide │
│ │
│ HAIKU (fast, lowest cost) │
│ ├── File search and exploration │
│ ├── Listing directory contents │
│ ├── Simple text transformations │
│ ├── Reading and summarizing files │
│ └── Pattern matching across codebase │
│ │
│ SONNET (balanced, moderate cost) │
│ ├── Writing and modifying code │
│ ├── Debugging and fixing errors │
│ ├── Refactoring, writing tests │
│ └── Most day-to-day coding tasks │
│ │
│ OPUS (powerful, highest cost) │
│ ├── Architecture decisions │
│ ├── Complex multi-file refactors │
│ ├── Resolving subtle bugs │
│ └── Critical code that must be correct │
│ │
└─────────────────────────────────────────────────┘
The Model Cascade Pattern
Start with the cheapest model, escalate only when needed:
┌──────────────────────────────────────────────────┐
│ Model Cascade │
│ │
│ Task arrives │
│ │ │
│ ▼ │
│ Try with HAIKU │
│ ├── Quality OK? → Done (lowest cost) │
│ │ │
│ ▼ │
│ Escalate to SONNET │
│ ├── Quality OK? → Done (moderate cost) │
│ │ │
│ ▼ │
│ Escalate to OPUS → Done (highest quality) │
│ │
└──────────────────────────────────────────────────┘
In practice, you pre-assign models based on task category rather than cascading every individual task. The cascade is a mental model for deciding which tasks deserve which model.
Token Management
Keeping context lean reduces the cost of every subsequent API call:
1. TARGETED FILE READS
Bad: "Read all files in src/" → 50 files, 40K tokens
Good: "Read src/auth/middleware.ts" → 1 file, 800 tokens
2. SPECIFIC SEARCHES
Bad: Search "config" (200 hits)
Good: Search "databaseConfig" in src/db/ (3 hits)
3. SUBAGENT ISOLATION
Bad: Explore in main context → 20 reads in history
Good: Spawn explore subagent → only summary returned
4. PROACTIVE COMPACTION
Bad: Let context fill to 95%
Good: Use /compact when switching tasks
Caching Strategies
System Prompt Caching:
First call: 3000 tokens → CACHE MISS → full rate
Later calls: 3000 tokens → CACHE HIT → reduced rate
Avoid Re-Reading:
Read file once, keep info in conversation. Second read = wasted tokens.
Batch Related Reads:
Bad: Read A, B, C sequentially (3 round trips)
Good: Read A, B, C in parallel (1 round trip)
Lean CLAUDE.md:
Bloated: 5000 tokens × every API call
Lean: 800 tokens × every API call
Savings: 4200 tokens × every call in session
Subagent Model Selection
This single decision can reduce costs by 50-70% for exploration-heavy workflows:
Exploration → Haiku ("Find all test files", "List API routes")
Implementation → Sonnet ("Write unit test", "Refactor to async/await")
Architecture → Opus ("Design data model", "Evaluate trade-offs")
Key Insight
Cost optimization is not about using the cheapest model. It is about using the right model for each task, which often means a mix of models at different price points.
A workflow using Haiku for search, Sonnet for coding, and Opus for architecture produces the same quality output as using Opus for everything --- at a fraction of the cost.
The second insight: context size is a multiplier. Because the full context is sent on every API call, reducing context from 50K to 20K tokens saves 30K tokens on every subsequent call. Over a 30-call session, that is 900K tokens saved.
This is why the patterns from earlier sessions --- subagent isolation (Session 4), context compaction (Session 6), targeted tool use --- are not just about quality. They are directly about cost.
Hands-On Example
Configuring a Cost-Optimized Workflow
Step 1: Categorize your tasks
Feature: Add search functionality
Exploration (Haiku):
- Find existing search-related code
- List database tables
- Check package.json for search libraries
Implementation (Sonnet):
- Write search API endpoint
- Create search index
- Build search UI component, write tests
Architecture (Opus):
- Decide: full-text vs fuzzy match vs external service
- Design search index schema
Step 2: Keep CLAUDE.md lean
# Expensive (2000 tokens):
This project is a web application built with React 18,
TypeScript 5.3, Tailwind CSS 3.4, Prisma ORM 5.x...
[extensive descriptions, standards, deployment procedures]
# Lean (400 tokens):
Tech: React 18, TypeScript, Tailwind, Prisma, PostgreSQL
Style: Prettier defaults, no semicolons
Test: Vitest, run with "pnpm test"
Build: "pnpm build", deploys via GitHub Actions
Key dirs: src/api/, src/components/, src/db/
Step 3: Know the cost-quality tradeoff
SAVE money on: INVEST money on:
├── File search ├── Security-critical code
├── Directory listing ├── Data migration logic
├── Running commands ├── Architecture decisions
├── Boilerplate generation ├── Complex debugging
└── Repetitive edits └── API design
Rule of thumb: if getting it wrong costs more than getting it right, use the best model. If the task is mechanical and easily verified, use the cheapest.
What Changed
| Unoptimized | Cost-Optimized |
|---|---|
| One model for everything | Right model per task |
| Large CLAUDE.md (2000+ tokens) | Lean CLAUDE.md (< 500 tokens) |
| Read all files in directory | Targeted, specific reads |
| Explore in main context | Explore via Haiku subagents |
| Context fills up, re-sent bloated | Context kept lean, lower per-call cost |
| No awareness of token cost | Conscious model and context decisions |
Next Session
Session 22 covers Human-in-the-Loop --- how to design workflows where the AI operates autonomously on routine tasks but pauses for human approval on critical decisions, balancing speed and safety.