Error Recovery

Build retry logic, error classification, and fallback strategies.

March 20, 2026 · 18 min read

What You’ll Learn

Errors are inevitable in agent workflows. APIs return 429s. Builds fail. Tests break. Files are missing. What separates a fragile script from a resilient agent is how it responds to failure --- not whether it encounters it.

By the end, you’ll understand:

How to classify errors into actionable categories
The retry pattern with exponential backoff
Fallback strategies when the primary approach fails
How Claude Code turns errors into information
Common failure modes and their recovery patterns

The Problem

A naive agent treats every error the same way: it stops. But not all errors are equal.

Error: ENOENT: no such file or directory     ← Wrong path, fixable
Error: 429 Too Many Requests                 ← Temporary, just wait
Error: Cannot find module 'express'          ← Missing dependency, install it
Error: SyntaxError: Unexpected token         ← Code bug, must fix
Error: EPERM: operation not permitted         ← Needs user action

Each requires a completely different response. The key is classification: understanding what kind of error you are dealing with before deciding what to do.

How It Works

Error Classification

Every error falls into one of three categories:

┌──────────────────────────────────────────────────┐
│            Error Classification                   │
│                                                   │
│  TRANSIENT → Retry with backoff                   │
│  │  Rate limits, network errors, timeouts,        │
│  │  server 5xx. Will succeed if you wait.         │
│  │                                                │
│  PERMANENT → Change approach                      │
│  │  File not found, syntax error, type mismatch,  │
│  │  missing module. Retrying won't help.          │
│  │                                                │
│  USER-ACTIONABLE → Ask the user                   │
│     Permission errors, auth required, config      │
│     missing. The agent cannot resolve alone.      │
│                                                   │
└──────────────────────────────────────────────────┘

The Retry Pattern

For transient errors, use exponential backoff:

Attempt 1: Try → Fails (429) → Wait 1s
Attempt 2: Try → Fails (429) → Wait 2s
Attempt 3: Try → Fails (429) → Wait 4s
Attempt 4: Try → Succeeds!

  Max retries: 3-5 for most operations
  Max wait: cap at 30-60 seconds
  Doubling wait prevents hammering rate-limited APIs

Fallback Strategies

When the primary approach fails permanently, try alternatives:

┌──────────────────────────────────────────────────┐
│           Fallback Chain                          │
│                                                   │
│  Primary: npm install express                     │
│     │  FAILS (npm not found)                      │
│     ▼                                             │
│  Fallback 1: yarn add express                     │
│     │  FAILS (yarn not found)                     │
│     ▼                                             │
│  Fallback 2: pnpm add express                     │
│     │  SUCCEEDS → Continue with pnpm              │
│     ▼                                             │
│  All failed → Report what was tried               │
│                                                   │
└──────────────────────────────────────────────────┘

Fallbacks work at every level:

File Reading:     Try config.ts → config.js → config.json → search
Build Commands:   Try pnpm build → npm build → yarn build → read package.json
Test Runners:     Try pnpm test → npx jest → npx vitest → find config

How Claude Code Handles Tool Errors

When a tool call fails, the error is returned as a tool_result. The AI reads it and adapts:

┌──────────────────────────────────────────────────┐
│        Error as Tool Result                       │
│                                                   │
│  AI calls: Bash("cat src/auth.ts")                │
│                                                   │
│  Tool returns:                                    │
│    "cat: src/auth.ts: No such file or directory"  │
│    is_error: true                                 │
│                                                   │
│  AI reasons: "Wrong path. Let me search."         │
│                                                   │
│  AI calls: Grep("auth", pattern="*.ts")           │
│                                                   │
│  Tool returns:                                    │
│    "src/middleware/authenticate.ts"               │
│    "src/lib/auth-utils.ts"                        │
│                                                   │
│  AI adapts: reads the correct file                │
│                                                   │
└──────────────────────────────────────────────────┘

This is the “error as information” principle. The error is not a stop signal --- it is data that helps the AI make a better next decision.

The Error Recovery Loop

┌──────────────────────────────────────────┐
│          Error Recovery Loop              │
│                                           │
│  1. Attempt the operation                 │
│          │                                │
│          ▼                                │
│  2. Succeed? YES → Continue               │
│              NO  → Classify               │
│          │                                │
│          ▼                                │
│  3. TRANSIENT → Retry (max 3-5x)         │
│     PERMANENT → Try fallback              │
│     USER-ACT  → Report and ask            │
│          │                                │
│          ▼                                │
│  4. Recovery worked? YES → Continue       │
│                      NO  → Escalate       │
│                                           │
└──────────────────────────────────────────┘

Key Insight

Good error recovery is what separates a fragile script from a resilient agent. The AI should treat errors as learning opportunities, not stop conditions.

When an error occurs, it contains information:

“File not found” tells you the path is wrong --- search for the right one
“Module not found” tells you a dependency is missing --- install it
“Type error” tells you the code has a bug --- read the types and fix it
“Permission denied” tells you the operation needs elevated access --- ask the user

Each error narrows the solution space. The most common mistake in agent design is treating all errors as fatal. The second is retrying everything indiscriminately. Classification is the skill that makes recovery effective.

Hands-On Example

Building a Resilient Package Installation Flow

Install the "sharp" image processing library.
If it fails, diagnose the error and try alternative approaches.

The agent’s behavior with good error recovery:

Step 1: pnpm add sharp
  → Error: node-gyp build failed (missing libvips)
  → Classify: PERMANENT (missing system dependency)

Step 2: Fallback → Try pre-built binaries
  → pnpm add sharp --ignore-scripts && npx sharp-install
  → Still fails

Step 3: Fallback → Try alternative library
  → pnpm add jimp (pure JavaScript, no native deps)
  → Succeeds!

Step 4: Update code to use jimp instead of sharp
  → Adjust imports and API calls
  → Run tests to verify

Common Failure Modes and Recovery

Failure	Classification	Recovery
`ENOENT: file not found`	Permanent	Search for file, check spelling
`429 Too Many Requests`	Transient	Exponential backoff, max 5 retries
`EACCES: permission denied`	User-actionable	Report, suggest `chmod` or `sudo`
`Build failed: type error`	Permanent	Read error, fix the type mismatch
`Test failed: assertion`	Permanent	Read test, fix logic or update test
`Connection timeout`	Transient	Retry with longer timeout
`Module not found`	Permanent	Install missing dependency
`Git conflict`	Permanent	Read conflict markers, resolve

Error-Aware CLAUDE.md Instructions

Encode recovery strategies directly in your project instructions:

# Error Recovery Rules
- Build fails: read FULL error output, not just last line
- Type error: check the relevant type definitions
- Missing import: search for the correct path
- Dependency issue: run pnpm install first
- Test fails: run failing test in isolation, read both test and implementation
- Never retry more than 3 times for the same error
- If stuck after 3 attempts, report what you tried

What Changed

Without Error Recovery	With Error Recovery
Agent stops at first error	Agent classifies and adapts
All errors treated the same	Transient, permanent, user-actionable
No retry logic	Exponential backoff for transient errors
No fallback options	Alternative approaches tried
Errors are failures	Errors are information
User must intervene constantly	Agent self-heals when possible

Next Session

Session 21 covers Cost Optimization --- how to select the right model for each task, manage tokens efficiently, and design workflows that balance quality and cost.