Cost Optimization - ClaudeWorld

What You’ll Learn

Using AI coding tools at scale has real cost implications. Every API call consumes tokens, and tokens cost money. But cost optimization is not about using the cheapest model for everything --- it is about using the right model for each task.

By the end, you’ll understand:

What drives cost in AI-assisted coding
How to select the right model for each task
The model cascade pattern
Token management and context efficiency
Caching strategies that reduce redundant work

The Problem

A developer who uses the most capable model for every task --- including simple file searches, listing directories, and reading configuration --- is overspending dramatically.

Typical session breakdown:
  5 file searches          (simple)     ← Haiku could do this
  3 code explorations      (moderate)   ← Sonnet is ideal
  1 architecture decision  (complex)    ← Opus is worth it
  10 small edits           (simple)     ← Sonnet handles fine
  2 test investigations    (moderate)   ← Sonnet is ideal

All 21 ops with Opus    → expensive
5 Haiku + 15 Sonnet + 1 Opus → same quality, fraction of cost

The difference compounds over days and weeks across hundreds of sessions.

How It Works

Understanding Cost Drivers

┌──────────────────────────────────────────────────┐
│              Cost Drivers                         │
│                                                   │
│  1. INPUT TOKENS                                  │
│     System prompt, CLAUDE.md, conversation         │
│     history, file contents, tool outputs          │
│                                                   │
│  2. OUTPUT TOKENS                                 │
│     Reasoning, responses, tool call arguments,    │
│     generated code                                │
│                                                   │
│  3. MODEL CHOICE                                  │
│     Haiku (lowest) → Sonnet (moderate) → Opus     │
│                                                   │
│  4. NUMBER OF API CALLS                           │
│     Each tool use = another round trip             │
│     Each call re-sends the full context           │
│                                                   │
└──────────────────────────────────────────────────┘

The critical detail: input tokens are sent on every API call. If your context is 50,000 tokens and the agent makes 20 tool calls, that is 50,000 tokens sent 20 times. A lean context pays dividends on every single call.

Model Selection Strategy

┌─────────────────────────────────────────────────┐
│           Model Selection Guide                  │
│                                                  │
│  HAIKU (fast, lowest cost)                       │
│  ├── File search and exploration                 │
│  ├── Listing directory contents                  │
│  ├── Simple text transformations                 │
│  ├── Reading and summarizing files               │
│  └── Pattern matching across codebase            │
│                                                  │
│  SONNET (balanced, moderate cost)                │
│  ├── Writing and modifying code                  │
│  ├── Debugging and fixing errors                 │
│  ├── Refactoring, writing tests                  │
│  └── Most day-to-day coding tasks                │
│                                                  │
│  OPUS (powerful, highest cost)                   │
│  ├── Architecture decisions                      │
│  ├── Complex multi-file refactors                │
│  ├── Resolving subtle bugs                       │
│  └── Critical code that must be correct          │
│                                                  │
└─────────────────────────────────────────────────┘

The Model Cascade Pattern

Start with the cheapest model, escalate only when needed:

┌──────────────────────────────────────────────────┐
│            Model Cascade                          │
│                                                   │
│  Task arrives                                     │
│       │                                           │
│       ▼                                           │
│  Try with HAIKU                                   │
│       ├── Quality OK? → Done (lowest cost)        │
│       │                                           │
│       ▼                                           │
│  Escalate to SONNET                               │
│       ├── Quality OK? → Done (moderate cost)      │
│       │                                           │
│       ▼                                           │
│  Escalate to OPUS → Done (highest quality)        │
│                                                   │
└──────────────────────────────────────────────────┘

In practice, you pre-assign models based on task category rather than cascading every individual task. The cascade is a mental model for deciding which tasks deserve which model.

Token Management

Keeping context lean reduces the cost of every subsequent API call:

1. TARGETED FILE READS
   Bad:  "Read all files in src/"  → 50 files, 40K tokens
   Good: "Read src/auth/middleware.ts" → 1 file, 800 tokens

2. SPECIFIC SEARCHES
   Bad:  Search "config" (200 hits)
   Good: Search "databaseConfig" in src/db/ (3 hits)

3. SUBAGENT ISOLATION
   Bad:  Explore in main context → 20 reads in history
   Good: Spawn explore subagent → only summary returned

4. PROACTIVE COMPACTION
   Bad:  Let context fill to 95%
   Good: Use /compact when switching tasks

Caching Strategies

System Prompt Caching:
  First call:  3000 tokens → CACHE MISS → full rate
  Later calls: 3000 tokens → CACHE HIT  → reduced rate

Avoid Re-Reading:
  Read file once, keep info in conversation. Second read = wasted tokens.

Batch Related Reads:
  Bad:  Read A, B, C sequentially (3 round trips)
  Good: Read A, B, C in parallel   (1 round trip)

Lean CLAUDE.md:
  Bloated: 5000 tokens × every API call
  Lean:    800 tokens × every API call
  Savings: 4200 tokens × every call in session

Subagent Model Selection

This single decision can reduce costs by 50-70% for exploration-heavy workflows:

Exploration → Haiku    ("Find all test files", "List API routes")
Implementation → Sonnet ("Write unit test", "Refactor to async/await")
Architecture → Opus    ("Design data model", "Evaluate trade-offs")

Key Insight

Cost optimization is not about using the cheapest model. It is about using the right model for each task, which often means a mix of models at different price points.

A workflow using Haiku for search, Sonnet for coding, and Opus for architecture produces the same quality output as using Opus for everything --- at a fraction of the cost.

The second insight: context size is a multiplier. Because the full context is sent on every API call, reducing context from 50K to 20K tokens saves 30K tokens on every subsequent call. Over a 30-call session, that is 900K tokens saved.

This is why the patterns from earlier sessions --- subagent isolation (Session 4), context compaction (Session 6), targeted tool use --- are not just about quality. They are directly about cost.

Hands-On Example

Configuring a Cost-Optimized Workflow

Step 1: Categorize your tasks

Feature: Add search functionality

Exploration (Haiku):
  - Find existing search-related code
  - List database tables
  - Check package.json for search libraries

Implementation (Sonnet):
  - Write search API endpoint
  - Create search index
  - Build search UI component, write tests

Architecture (Opus):
  - Decide: full-text vs fuzzy match vs external service
  - Design search index schema

Step 2: Keep CLAUDE.md lean

# Expensive (2000 tokens):
This project is a web application built with React 18,
TypeScript 5.3, Tailwind CSS 3.4, Prisma ORM 5.x...
[extensive descriptions, standards, deployment procedures]

# Lean (400 tokens):
Tech: React 18, TypeScript, Tailwind, Prisma, PostgreSQL
Style: Prettier defaults, no semicolons
Test: Vitest, run with "pnpm test"
Build: "pnpm build", deploys via GitHub Actions
Key dirs: src/api/, src/components/, src/db/

Step 3: Know the cost-quality tradeoff

SAVE money on:                INVEST money on:
├── File search               ├── Security-critical code
├── Directory listing          ├── Data migration logic
├── Running commands           ├── Architecture decisions
├── Boilerplate generation     ├── Complex debugging
└── Repetitive edits           └── API design

Rule of thumb: if getting it wrong costs more than getting it right, use the best model. If the task is mechanical and easily verified, use the cheapest.

What Changed

Unoptimized	Cost-Optimized
One model for everything	Right model per task
Large CLAUDE.md (2000+ tokens)	Lean CLAUDE.md (< 500 tokens)
Read all files in directory	Targeted, specific reads
Explore in main context	Explore via Haiku subagents
Context fills up, re-sent bloated	Context kept lean, lower per-call cost
No awareness of token cost	Conscious model and context decisions

Next Session

Session 22 covers Human-in-the-Loop --- how to design workflows where the AI operates autonomously on routine tasks but pauses for human approval on critical decisions, balancing speed and safety.