Context Window Budget
The context window is a shared budget: system prompt + tool definitions + conversation history + response must all fit. Plan token budgets explicitly.
| Component | Typical Size | Optimization |
|---|
| System prompt | 500-2000 tokens | Keep focused; move dynamic content to user messages |
| Tool definitions | 100-500 per tool | Minimize description verbosity; limit tool count |
| Conversation history | Grows over time | Summarize, truncate, or use sliding window |
| Expected response | Reserve 1000-4000 | Set max_tokens to cap this allocation |
Prompt Caching Strategy
Place cache_control breakpoints after stable content that repeats across requests. Up to 4 breakpoints allowed. Cache TTL is 5 minutes, refreshed on each hit.
| Breakpoint Placement | Effectiveness |
|---|
| After system prompt | High — system prompt rarely changes |
| After tool definitions | High — tools are static per session |
| After few-shot examples | Medium — stable but may vary by task |
| After conversation history | Low — changes every turn |
Cost: 1.25x write cost for initial cache creation. 0.1x read cost on cache hits. Breakeven at ~1.4 subsequent reads.
Long Conversation Strategies
| Strategy | When to Use | Key Characteristic |
|---|
| Summarization | General long conversations | Replace old messages with a summary |
| Sliding window | Recent context matters most | Keep last N messages, drop oldest |
| Selective retention | Key facts scattered throughout | Keep important messages, summarize rest |
| Anti-Pattern | Why It Fails |
|---|
| Never truncating | Eventually hits context limit, request fails |
| Aggressive truncation | Loses critical context, Claude contradicts earlier statements |
| Summarizing too late | First failure is user-visible — summarize proactively |
Rate Limiting and Retries
| Status Code | Meaning | Action |
|---|
| 429 | Rate limited | Retry with exponential backoff + jitter |
| 500 | Server error | Retry (transient) |
| 529 | API overloaded | Retry with longer backoff |
| 400 | Bad request | Do not retry — fix the request |
| 401 | Unauthorized | Do not retry — fix authentication |
Exponential backoff formula: delay = min(base * 2^attempt + random_jitter, max_delay). Always add jitter to prevent thundering herd.
Production Reliability Patterns
| Pattern | When to Use | Key Characteristic |
|---|
| Retry with backoff | Transient failures (429, 500) | Automatic recovery from temporary issues |
| Circuit breaker | Sustained failures | Fail fast after threshold, avoid hammering a down service |
| Fallback model | Primary model unavailable | Switch to smaller/cheaper model for degraded service |
| Graceful degradation | Non-critical features fail | Return cached or simplified response |
| Health checks | Continuous monitoring | Detect issues before users do |
Key metrics to monitor: time-to-first-token, total latency, input/output token counts, error rate by status code, cost per request, and tool call frequency.