Context Management & Reliability
Key terms and definitions for Domain 5. Each entry includes a concise definition, exam context, and links to the relevant lesson.
10 of 10 terms
cache_control
An API parameter that marks content as cacheable for prompt caching. Placed as a property on message content blocks with type: "ephemeral". Content up to and including the marked block is cached for reuse in subsequent requests.
Exam Context
Place cache_control breakpoints strategically: after the system prompt and after tool definitions. Up to 4 breakpoints allowed. Cache TTL is 5 minutes (refreshed on each hit).
Context Window
The maximum number of tokens Claude can process in a single API call, including both input and output. Claude's context window size varies by model. All content (system prompt, conversation history, tool definitions, and response) must fit within this limit.
Exam Context
Know that the context window is shared between input and output. Larger context windows cost more per token. Plan token budgets: system prompt + tools + history + expected output < window size.
Long Conversation
A conversation that approaches or exceeds the context window limit. Requires management strategies such as summarization, sliding window truncation, or context compression to continue meaningfully without losing critical information.
Exam Context
Know the three main strategies: summarization (compress old messages), truncation (drop oldest messages), and selective retention (keep important messages, compress others). Each has tradeoffs.
Monitoring
Observing and measuring the behavior of Claude-powered systems in production. Key metrics include latency, token usage, error rates, tool call patterns, and output quality scores. Essential for maintaining reliability and controlling costs.
Exam Context
Track: latency (time-to-first-token, total), cost (input/output tokens), errors (rate, type), and quality (user feedback, automated evals). Set alerts on anomalies.
Production Reliability
Patterns for building robust Claude-powered systems: retry logic with exponential backoff, circuit breakers, fallback models, graceful degradation, health checks, and deployment strategies like canary releases.
Exam Context
Know the retry pattern (exponential backoff + jitter), circuit breaker pattern (fail fast after threshold), and fallback pattern (switch to simpler model or cached response).
Prompt Caching
An API feature that caches repeated prompt prefixes to reduce cost and latency on subsequent calls. Marked with cache_control breakpoints. Cached content is reused when the prefix matches exactly, reducing input token costs by up to 90%.
Exam Context
Know the cache_control breakpoint placement strategy. Content before the breakpoint is cached. Any change to cached content invalidates the cache. Place breakpoints after stable content (system prompt, tool definitions).
Rate Limiting
API-enforced limits on request frequency and token throughput. Claude's API has rate limits per minute (RPM) and tokens per minute (TPM). Exceeding limits returns 429 status codes. Managed with queuing, backoff, and request batching.
Exam Context
Know the HTTP 429 response and retry-after header. Implement exponential backoff with jitter. Use request queuing for high-throughput applications. Different tiers have different rate limits.
Retry Strategy
A pattern for handling transient API failures by retrying requests with increasing delays. Exponential backoff (doubling wait time) with random jitter prevents thundering herd problems. Only retry on retryable errors (429, 500, 529).
Exam Context
Know which status codes are retryable: 429 (rate limit), 500 (server error), 529 (overloaded). Do not retry 400 (bad request) or 401 (auth). Always add jitter to prevent synchronized retries.
Summarization
A context management technique where older conversation messages are compressed into a concise summary. The summary replaces the original messages, freeing context window space while preserving key information for continued conversation.
Exam Context
Summarization trades fidelity for space. Summarize when conversation exceeds a token threshold. Keep the summary in a dedicated system or user message. Re-summarize periodically for very long conversations.
Token Counting
Measuring the number of tokens in prompts and responses to manage costs and stay within context window limits. Anthropic provides a token counting API endpoint. Accurate counting is essential for budget management and context window planning.
Exam Context
Use the token counting API for accurate estimates before sending expensive requests. Tokens are not characters or words: a token is roughly 3-4 characters for English text.