Task 5.1

Context Window Management

The context window is the total amount of text (measured in tokens) that Claude can process in a single API call. It includes the system prompt, conversation history, tool definitions, and the model's response.

Context Window Composition

The context window is shared between input and output. Input includes: system prompt, conversation messages (user + assistant turns), tool definitions, and tool results. Output is the model's response. Larger context windows (200K tokens for Claude) accommodate more input but also cost more.

In agentic loops, the context grows with each iteration because each tool call and result is added to the conversation history. Without management, context exhaustion is inevitable for long-running agents.

Context Management Strategies

Key strategies include: conversation summarization (periodically compress the conversation history), sliding window (keep only recent N turns), selective pruning (remove irrelevant tool results), context-aware prompting (only include information relevant to the current step), and output-aware budgeting (reserve tokens for the model's response).

The best strategy depends on the application. Summarization preserves important information but loses detail. Sliding window is simple but may lose critical early context. Selective pruning is most flexible but requires understanding what is relevant.

Token Counting

Token counting is essential for context management. Use the token counting API or client library to measure conversation size before sending requests. This prevents context overflow errors and allows proactive management.

A common pattern is: before each API call, count the total tokens (system prompt + history + tools), compare against the model's limit minus a reserved buffer for output, and trigger context management (summarization, pruning) if the count exceeds the threshold.

Key Concept

Budget for Output, Not Just Input

When managing context, remember that the model needs room to generate its response. If you fill the context window with input, the model's response will be truncated. Always reserve a buffer for output tokens — typically 4K-8K tokens for complex responses, more for code generation. Context budget = max_tokens - system_prompt - tools - reserved_output.

Exam Traps

EXAM TRAP

Ignoring tool definitions in context budget

Tool definitions consume input tokens. With many tools, this can be significant. The exam may test whether you account for tool tokens in context calculations.

EXAM TRAP

Summarizing too aggressively

Over-summarization loses important details. The model may make incorrect decisions based on incomplete summaries. Balance compression with information retention.

EXAM TRAP

Not reserving output buffer

Filling the entire context window with input leaves no room for the response. Always reserve tokens for model output.

Check Your Understanding

An agentic loop has been running for 50 iterations. The context is at 180K tokens (limit: 200K). The agent needs to continue working. What is the best approach?

Build Exercise

Implement Context Management

Intermediate45 minutes

What you'll learn

  • Count tokens in conversation context
  • Implement conversation summarization
  • Build a sliding window with important turn retention
  • Test context management under load
  1. Create a function that counts the total tokens in a conversation (system prompt + messages + tools). Test with conversations of various sizes.

    WHY: Token counting is the foundation of context management — you cannot manage what you do not measure.

    YOU SHOULD SEE: Accurate token counts for conversations of different sizes.

  2. Implement a summarization function: when context exceeds 80% of the limit, summarize all but the last 5 messages into a single summary message.

    WHY: Proactive summarization prevents context overflow before it happens.

    YOU SHOULD SEE: The conversation is compressed while retaining recent context and a summary of earlier content.

  3. Implement selective pruning: remove large tool results that are no longer relevant (e.g., search results that have already been processed).

    WHY: Selective pruning is more efficient than full summarization because it targets specific bloat.

    YOU SHOULD SEE: Large, irrelevant tool results are replaced with brief summaries.

  4. Test your context management by running a long agentic loop (50+ iterations) and verifying the context stays within bounds while the agent remains effective.

    WHY: End-to-end testing validates that context management works without breaking agent behavior.

    YOU SHOULD SEE: The agent runs 50+ iterations without context overflow, maintaining task progress.

Sources

Previous

Output Validation