Quick Reference: Context Management & Reliability

Context Window Budget

The context window is a shared budget: system prompt + tool definitions + conversation history + response must all fit. Plan token budgets explicitly.

Component	Typical Size	Optimization
System prompt	500-2000 tokens	Keep focused; move dynamic content to user messages
Tool definitions	100-500 per tool	Minimize description verbosity; limit tool count
Conversation history	Grows over time	Summarize, truncate, or use sliding window
Expected response	Reserve 1000-4000	Set max_tokens to cap this allocation

Prompt Caching Strategy

Place cache_control breakpoints after stable content that repeats across requests. Up to 4 breakpoints allowed. Cache TTL is 5 minutes, refreshed on each hit.

Breakpoint Placement	Effectiveness
After system prompt	High — system prompt rarely changes
After tool definitions	High — tools are static per session
After few-shot examples	Medium — stable but may vary by task
After conversation history	Low — changes every turn

Cost: 1.25x write cost for initial cache creation. 0.1x read cost on cache hits. Breakeven at ~1.4 subsequent reads.

Long Conversation Strategies

Strategy	When to Use	Key Characteristic
Summarization	General long conversations	Replace old messages with a summary
Sliding window	Recent context matters most	Keep last N messages, drop oldest
Selective retention	Key facts scattered throughout	Keep important messages, summarize rest

Anti-Pattern	Why It Fails
Never truncating	Eventually hits context limit, request fails
Aggressive truncation	Loses critical context, Claude contradicts earlier statements
Summarizing too late	First failure is user-visible — summarize proactively

Rate Limiting and Retries

Status Code	Meaning	Action
429	Rate limited	Retry with exponential backoff + jitter
500	Server error	Retry (transient)
529	API overloaded	Retry with longer backoff
400	Bad request	Do not retry — fix the request
401	Unauthorized	Do not retry — fix authentication

Exponential backoff formula: delay = min(base * 2^attempt + random_jitter, max_delay). Always add jitter to prevent thundering herd.

Production Reliability Patterns

Pattern	When to Use	Key Characteristic
Retry with backoff	Transient failures (429, 500)	Automatic recovery from temporary issues
Circuit breaker	Sustained failures	Fail fast after threshold, avoid hammering a down service
Fallback model	Primary model unavailable	Switch to smaller/cheaper model for degraded service
Graceful degradation	Non-critical features fail	Return cached or simplified response
Health checks	Continuous monitoring	Detect issues before users do

Key metrics to monitor: time-to-first-token, total latency, input/output token counts, error rate by status code, cost per request, and tool call frequency.

Quick Reference: Domain 5 — Context Management & Reliability

Context Window Budget

Prompt Caching Strategy

Long Conversation Strategies

Rate Limiting and Retries

Production Reliability Patterns