Task 1.7

Error Recovery & Resilience

Production agentic systems encounter failures at every level: API rate limits, tool execution errors, model hallucinations, network timeouts, and context window overflows. Building resilient agents means anticipating these failures and implementing recovery strategies that keep the system operational.

API Error Handling

Claude API errors fall into several categories. Rate limit errors (429) require backoff and retry. Overload errors (529) indicate temporary capacity issues. Authentication errors (401) are configuration problems. Context length errors mean your conversation exceeds the model's context window.

For rate limits and overload, implement exponential backoff with jitter. For context length errors, implement conversation summarization or truncation. For authentication errors, fail fast — retrying won't help.

Tool Failure Recovery

When a tool call fails, the agent needs to know about the failure so it can adapt. Return a clear error message as the tool result rather than throwing an exception that breaks the loop. The model can then decide whether to retry, try an alternative approach, or inform the user.

Common tool failure patterns: external API timeouts (implement tool-level timeouts), invalid parameters (validate before execution), permission errors (check before calling), and resource not found errors. Each should return a structured error that the model can act on.

Graceful Degradation

When components fail, the system should degrade gracefully rather than crash entirely. If a search tool is unavailable, the agent can still answer from its knowledge. If a database is down, the agent can queue operations for later. If the primary model is rate-limited, fall back to a secondary model.

Graceful degradation requires designing the system with fallback paths from the start. Each critical dependency should have an alternative or a way to continue with reduced functionality.

Key Concept

Feed Errors Back to the Model

When a tool call fails, return the error as a tool_result rather than crashing the loop. Claude can interpret error messages and adapt — retrying with different parameters, trying alternative tools, or explaining the issue to the user. Hiding errors from the model removes its ability to self-correct.

Exam Traps

EXAM TRAP

Retrying all errors with the same strategy

Different errors need different handling. Rate limits need backoff; auth errors need configuration fixes; context length errors need conversation management. The exam tests whether you can match error types to recovery strategies.

EXAM TRAP

Crashing the loop on tool errors

Tool errors should be returned to the model as error results, not thrown as exceptions. The model can often recover by trying a different approach.

EXAM TRAP

Infinite retry loops

Always set a maximum retry count. Exponential backoff without a maximum can delay responses indefinitely.

Check Your Understanding

An agent is using a web search tool that returns a 429 (rate limit) error. What is the correct recovery strategy?

Build Exercise

Build Resilient API Calls

Beginner30 minutes

What you'll learn

  • Implement exponential backoff with jitter
  • Handle different error types appropriately
  • Return tool errors to the model
  • Add circuit breaker pattern for persistent failures
  1. Create a retry wrapper function that implements exponential backoff with jitter for API calls.

    WHY: Exponential backoff is the foundation of resilient API communication.

    YOU SHOULD SEE: The wrapper retries with increasing delays: ~1s, ~2s, ~4s.

  2. Add error classification: retry on 429 and 529, fail fast on 401 and 400, and summarize context on context length errors.

    WHY: Different errors need different strategies. Retrying an auth error wastes time.

    YOU SHOULD SEE: Rate limits are retried; auth errors fail immediately with clear messages.

  3. Create a tool wrapper that catches errors and returns them as structured tool results instead of throwing.

    WHY: The model needs to see errors to adapt its strategy.

    YOU SHOULD SEE: When a tool fails, the model receives an error message and can decide what to do next.

  4. Implement a simple circuit breaker: after 5 consecutive failures for a tool, disable it for 60 seconds before allowing retries.

    WHY: Circuit breakers prevent repeated calls to a broken service, reducing load and improving recovery time.

    YOU SHOULD SEE: After 5 failures, the tool returns 'temporarily unavailable' without making the actual call.

Sources

Previous

Human-in-the-Loop