Rate Limiting & Quotas — Context Management & Reliability

Rate Limit Types

The Claude API has several rate limit dimensions: requests per minute (RPM), input tokens per minute (TPM), and output tokens per minute. These limits vary by model and API tier. Higher-tier plans have higher limits.

When a rate limit is exceeded, the API returns a 429 status code with a Retry-After header indicating how long to wait. Applications must handle this gracefully — not all 429s are the same duration.

Rate Limit Strategies

Strategies for working within rate limits include: request queuing (buffer requests and process them within the rate limit), exponential backoff with jitter (retry after increasing delays with randomization), request batching (combine multiple small requests where possible), and traffic shaping (spread requests evenly over time rather than bursting).

The most robust approach combines a request queue with exponential backoff for retries. The queue ensures requests are processed in order, and backoff handles transient rate limit hits.

Quota Management

Usage quotas limit total spending over a period (daily or monthly). Monitor usage proactively — don't discover you've hit the quota when a critical request fails. Implement usage tracking, alert thresholds (e.g., alert at 80% of quota), and graceful degradation when approaching limits.

For multi-tenant applications, implement per-tenant quotas to prevent one tenant from consuming the entire allocation. This requires tracking usage per tenant and enforcing limits before making API calls.

Key Concept

Design for the Limit, Not Against It

Rate limits are not obstacles to work around — they are constraints to design for. A well-designed application handles rate limits gracefully, queues requests efficiently, and degrades gracefully when approaching quotas. Attempting to circumvent rate limits (multiple API keys, aggressive retries) creates fragile systems that fail unpredictably.

Exam Traps

EXAM TRAP

Retrying immediately on rate limit errors

Immediate retries hit the rate limit again. Always use backoff (exponential with jitter). Check the Retry-After header for the recommended wait time.

EXAM TRAP

Not monitoring usage against quotas

Discovering you've hit a quota during a critical operation is a production incident. Proactive monitoring with alerts at 80% prevents surprises.

EXAM TRAP

Using multiple API keys to circumvent rate limits

This violates terms of service and creates complex, fragile systems. The correct approach is to request higher limits or optimize usage.

Check Your Understanding

Your application processes 100 documents per hour. Each document requires 3 API calls. Your rate limit is 60 RPM. At current throughput, you are hitting rate limits during peak processing. What is the best solution?

Build Exercise

Build a Rate Limiter

Intermediate45 minutes

What you'll learn

Implement a request queue with rate limiting
Handle 429 responses with exponential backoff
Track usage against quotas
Implement graceful degradation near limits

Create a rate-limited request queue that processes at most N requests per minute. Enqueue API calls and process them at the configured rate.
WHY: A request queue prevents bursts that trigger rate limits.
YOU SHOULD SEE: Requests are processed at a steady rate, not in bursts.
Add exponential backoff with jitter for 429 responses. When a rate limit is hit, wait and retry with increasing delays.
WHY: Exponential backoff is the standard recovery mechanism for rate limit errors.
YOU SHOULD SEE: 429 errors trigger delays of ~1s, ~2s, ~4s with random jitter.
Implement usage tracking: count requests and tokens per hour/day. Alert when usage reaches 80% of the quota.
WHY: Proactive monitoring prevents surprise quota exhaustion.
YOU SHOULD SEE: A usage dashboard showing current usage vs. limits, with alerts at 80%.
Implement graceful degradation: when approaching the quota limit, switch to a smaller model, reduce request frequency, or queue non-critical requests for later.
WHY: Graceful degradation ensures critical functionality continues even when approaching limits.
YOU SHOULD SEE: The system automatically reduces resource usage when approaching limits.

Sources

Rate Limits— Anthropic Documentation
Error Handling— Anthropic Documentation