Context Management & Reliability

Key terms and definitions for Domain 5. Each entry includes a concise definition, exam context, and links to the relevant lesson.

Jump:

10 of 10 terms

cache_control

An API parameter that marks content as cacheable for prompt caching. Placed as a property on message content blocks with type: "ephemeral". Content up to and including the marked block is cached for reuse in subsequent requests.

Exam Context

Place cache_control breakpoints strategically: after the system prompt and after tool definitions. Up to 4 breakpoints allowed. Cache TTL is 5 minutes (refreshed on each hit).

5.2 Prompt Caching

Context Window

The maximum number of tokens Claude can process in a single API call, including both input and output. Claude's context window size varies by model. All content (system prompt, conversation history, tool definitions, and response) must fit within this limit.

Exam Context

Know that the context window is shared between input and output. Larger context windows cost more per token. Plan token budgets: system prompt + tools + history + expected output < window size.

5.1 Context Window Management

Long Conversation

A conversation that approaches or exceeds the context window limit. Requires management strategies such as summarization, sliding window truncation, or context compression to continue meaningfully without losing critical information.

Exam Context

Know the three main strategies: summarization (compress old messages), truncation (drop oldest messages), and selective retention (keep important messages, compress others). Each has tradeoffs.

5.3 Long Conversations

Monitoring

Observing and measuring the behavior of Claude-powered systems in production. Key metrics include latency, token usage, error rates, tool call patterns, and output quality scores. Essential for maintaining reliability and controlling costs.

Exam Context

Track: latency (time-to-first-token, total), cost (input/output tokens), errors (rate, type), and quality (user feedback, automated evals). Set alerts on anomalies.

5.5 Monitoring & Observability

Production Reliability

Patterns for building robust Claude-powered systems: retry logic with exponential backoff, circuit breakers, fallback models, graceful degradation, health checks, and deployment strategies like canary releases.

Exam Context

Know the retry pattern (exponential backoff + jitter), circuit breaker pattern (fail fast after threshold), and fallback pattern (switch to simpler model or cached response).

5.6 Production Reliability

Prompt Caching

An API feature that caches repeated prompt prefixes to reduce cost and latency on subsequent calls. Marked with cache_control breakpoints. Cached content is reused when the prefix matches exactly, reducing input token costs by up to 90%.

Exam Context

Know the cache_control breakpoint placement strategy. Content before the breakpoint is cached. Any change to cached content invalidates the cache. Place breakpoints after stable content (system prompt, tool definitions).

5.2 Prompt Caching

Rate Limiting

API-enforced limits on request frequency and token throughput. Claude's API has rate limits per minute (RPM) and tokens per minute (TPM). Exceeding limits returns 429 status codes. Managed with queuing, backoff, and request batching.

Exam Context

Know the HTTP 429 response and retry-after header. Implement exponential backoff with jitter. Use request queuing for high-throughput applications. Different tiers have different rate limits.

5.4 Rate Limiting & Quotas

Retry Strategy

A pattern for handling transient API failures by retrying requests with increasing delays. Exponential backoff (doubling wait time) with random jitter prevents thundering herd problems. Only retry on retryable errors (429, 500, 529).

Exam Context

Know which status codes are retryable: 429 (rate limit), 500 (server error), 529 (overloaded). Do not retry 400 (bad request) or 401 (auth). Always add jitter to prevent synchronized retries.

5.6 Production Reliability

Summarization

A context management technique where older conversation messages are compressed into a concise summary. The summary replaces the original messages, freeing context window space while preserving key information for continued conversation.

Exam Context

Summarization trades fidelity for space. Summarize when conversation exceeds a token threshold. Keep the summary in a dedicated system or user message. Re-summarize periodically for very long conversations.

5.3 Long Conversations

Token Counting

Measuring the number of tokens in prompts and responses to manage costs and stay within context window limits. Anthropic provides a token counting API endpoint. Accurate counting is essential for budget management and context window planning.

Exam Context

Use the token counting API for accurate estimates before sending expensive requests. Tokens are not characters or words: a token is roughly 3-4 characters for English text.

5.1 Context Window Management