Error Recovery
Build retry logic, error classification, and fallback strategies.
What You’ll Learn
Errors are inevitable in agent workflows. APIs return 429s. Builds fail. Tests break. Files are missing. What separates a fragile script from a resilient agent is how it responds to failure --- not whether it encounters it.
By the end, you’ll understand:
- How to classify errors into actionable categories
- The retry pattern with exponential backoff
- Fallback strategies when the primary approach fails
- How Claude Code turns errors into information
- Common failure modes and their recovery patterns
The Problem
A naive agent treats every error the same way: it stops. But not all errors are equal.
Error: ENOENT: no such file or directory ← Wrong path, fixable
Error: 429 Too Many Requests ← Temporary, just wait
Error: Cannot find module 'express' ← Missing dependency, install it
Error: SyntaxError: Unexpected token ← Code bug, must fix
Error: EPERM: operation not permitted ← Needs user action
Each requires a completely different response. The key is classification: understanding what kind of error you are dealing with before deciding what to do.
How It Works
Error Classification
Every error falls into one of three categories:
┌──────────────────────────────────────────────────┐
│ Error Classification │
│ │
│ TRANSIENT → Retry with backoff │
│ │ Rate limits, network errors, timeouts, │
│ │ server 5xx. Will succeed if you wait. │
│ │ │
│ PERMANENT → Change approach │
│ │ File not found, syntax error, type mismatch, │
│ │ missing module. Retrying won't help. │
│ │ │
│ USER-ACTIONABLE → Ask the user │
│ Permission errors, auth required, config │
│ missing. The agent cannot resolve alone. │
│ │
└──────────────────────────────────────────────────┘
The Retry Pattern
For transient errors, use exponential backoff:
Attempt 1: Try → Fails (429) → Wait 1s
Attempt 2: Try → Fails (429) → Wait 2s
Attempt 3: Try → Fails (429) → Wait 4s
Attempt 4: Try → Succeeds!
Max retries: 3-5 for most operations
Max wait: cap at 30-60 seconds
Doubling wait prevents hammering rate-limited APIs
Fallback Strategies
When the primary approach fails permanently, try alternatives:
┌──────────────────────────────────────────────────┐
│ Fallback Chain │
│ │
│ Primary: npm install express │
│ │ FAILS (npm not found) │
│ ▼ │
│ Fallback 1: yarn add express │
│ │ FAILS (yarn not found) │
│ ▼ │
│ Fallback 2: pnpm add express │
│ │ SUCCEEDS → Continue with pnpm │
│ ▼ │
│ All failed → Report what was tried │
│ │
└──────────────────────────────────────────────────┘
Fallbacks work at every level:
File Reading: Try config.ts → config.js → config.json → search
Build Commands: Try pnpm build → npm build → yarn build → read package.json
Test Runners: Try pnpm test → npx jest → npx vitest → find config
How Claude Code Handles Tool Errors
When a tool call fails, the error is returned as a tool_result. The AI reads it and adapts:
┌──────────────────────────────────────────────────┐
│ Error as Tool Result │
│ │
│ AI calls: Bash("cat src/auth.ts") │
│ │
│ Tool returns: │
│ "cat: src/auth.ts: No such file or directory" │
│ is_error: true │
│ │
│ AI reasons: "Wrong path. Let me search." │
│ │
│ AI calls: Grep("auth", pattern="*.ts") │
│ │
│ Tool returns: │
│ "src/middleware/authenticate.ts" │
│ "src/lib/auth-utils.ts" │
│ │
│ AI adapts: reads the correct file │
│ │
└──────────────────────────────────────────────────┘
This is the “error as information” principle. The error is not a stop signal --- it is data that helps the AI make a better next decision.
The Error Recovery Loop
┌──────────────────────────────────────────┐
│ Error Recovery Loop │
│ │
│ 1. Attempt the operation │
│ │ │
│ ▼ │
│ 2. Succeed? YES → Continue │
│ NO → Classify │
│ │ │
│ ▼ │
│ 3. TRANSIENT → Retry (max 3-5x) │
│ PERMANENT → Try fallback │
│ USER-ACT → Report and ask │
│ │ │
│ ▼ │
│ 4. Recovery worked? YES → Continue │
│ NO → Escalate │
│ │
└──────────────────────────────────────────┘
Key Insight
Good error recovery is what separates a fragile script from a resilient agent. The AI should treat errors as learning opportunities, not stop conditions.
When an error occurs, it contains information:
- “File not found” tells you the path is wrong --- search for the right one
- “Module not found” tells you a dependency is missing --- install it
- “Type error” tells you the code has a bug --- read the types and fix it
- “Permission denied” tells you the operation needs elevated access --- ask the user
Each error narrows the solution space. The most common mistake in agent design is treating all errors as fatal. The second is retrying everything indiscriminately. Classification is the skill that makes recovery effective.
Hands-On Example
Building a Resilient Package Installation Flow
Install the "sharp" image processing library.
If it fails, diagnose the error and try alternative approaches.
The agent’s behavior with good error recovery:
Step 1: pnpm add sharp
→ Error: node-gyp build failed (missing libvips)
→ Classify: PERMANENT (missing system dependency)
Step 2: Fallback → Try pre-built binaries
→ pnpm add sharp --ignore-scripts && npx sharp-install
→ Still fails
Step 3: Fallback → Try alternative library
→ pnpm add jimp (pure JavaScript, no native deps)
→ Succeeds!
Step 4: Update code to use jimp instead of sharp
→ Adjust imports and API calls
→ Run tests to verify
Common Failure Modes and Recovery
| Failure | Classification | Recovery |
|---|---|---|
ENOENT: file not found | Permanent | Search for file, check spelling |
429 Too Many Requests | Transient | Exponential backoff, max 5 retries |
EACCES: permission denied | User-actionable | Report, suggest chmod or sudo |
Build failed: type error | Permanent | Read error, fix the type mismatch |
Test failed: assertion | Permanent | Read test, fix logic or update test |
Connection timeout | Transient | Retry with longer timeout |
Module not found | Permanent | Install missing dependency |
Git conflict | Permanent | Read conflict markers, resolve |
Error-Aware CLAUDE.md Instructions
Encode recovery strategies directly in your project instructions:
# Error Recovery Rules
- Build fails: read FULL error output, not just last line
- Type error: check the relevant type definitions
- Missing import: search for the correct path
- Dependency issue: run pnpm install first
- Test fails: run failing test in isolation, read both test and implementation
- Never retry more than 3 times for the same error
- If stuck after 3 attempts, report what you tried
What Changed
| Without Error Recovery | With Error Recovery |
|---|---|
| Agent stops at first error | Agent classifies and adapts |
| All errors treated the same | Transient, permanent, user-actionable |
| No retry logic | Exponential backoff for transient errors |
| No fallback options | Alternative approaches tried |
| Errors are failures | Errors are information |
| User must intervene constantly | Agent self-heals when possible |
Next Session
Session 21 covers Cost Optimization --- how to select the right model for each task, manage tokens efficiently, and design workflows that balance quality and cost.