Task 5.2

Prompt Caching

Prompt caching reduces costs and latency by caching the processing of repeated prompt content. When the same system prompt, tools, or conversation prefix is sent across multiple requests, cached content is processed faster and at a reduced cost.

How Prompt Caching Works

When you add cache_control breakpoints to your prompt, the content up to that breakpoint is cached after the first request. Subsequent requests that include the same content prefix get a cache hit, reducing input token costs and latency.

Cached content must be identical — any change invalidates the cache. This is why system prompts and tool definitions (which rarely change) benefit most from caching, while conversation messages (which change every turn) benefit least.

Cache Breakpoint Strategy

Place cache_control breakpoints strategically: after the system prompt, after tool definitions, and after any large static content. The content before the breakpoint is cached; content after it is processed fresh each time.

For agentic loops, the system prompt and tools are the same every iteration — cache them. The conversation history changes every iteration — it cannot be fully cached. This pattern means the first request is more expensive (cache write) but subsequent requests are cheaper (cache read).

Cost Implications

Anthropic's caching pricing: cache writes cost 25% more than base input token price. Cache reads cost 90% less than base input token price. The breakeven point is approximately 1.4 reads per write — after that, caching saves money.

For applications with many requests sharing the same system prompt (chatbots, agents, batch processing), caching provides significant cost savings. For one-off requests with unique prompts, caching adds cost without benefit.

Key Concept

Cache What Doesn't Change

The rule of prompt caching is simple: cache content that is identical across requests. System prompts, tool definitions, few-shot examples, and reference documentation are ideal candidates. Conversation messages that change every turn are poor candidates. The bigger and more stable the cached content, the greater the savings.

Exam Traps

EXAM TRAP

Caching content that changes frequently

Any change in cached content invalidates the cache. Caching volatile content (like conversation history) provides no benefit and adds the 25% write surcharge.

EXAM TRAP

Thinking caching reduces output token costs

Prompt caching only reduces input token costs. Output tokens are always charged at full price regardless of caching.

EXAM TRAP

Not considering the minimum cache size

There is a minimum token count for cacheable content (1024 tokens for Claude Sonnet/Opus, 2048 for Haiku). Content below this threshold cannot be cached.

Check Your Understanding

An application sends 1000 requests per hour, all using the same 2000-token system prompt and 500-token tool definitions. Conversation messages average 1000 tokens and are unique per request. Where should cache breakpoints be placed?

Build Exercise

Implement Prompt Caching

Beginner30 minutes

What you'll learn

  • Add cache_control breakpoints to prompts
  • Measure cache hit rates and cost savings
  • Identify optimal caching strategies
  • Understand caching limitations
  1. Add cache_control breakpoints to a system prompt and send 5 consecutive requests. Track which requests get cache hits vs. misses.

    WHY: Hands-on experience with cache breakpoints reveals how caching works in practice.

    YOU SHOULD SEE: First request is a cache miss (write). Requests 2-5 are cache hits (read) with lower token costs.

  2. Calculate the cost savings: compare total input token costs with and without caching over 100 simulated requests.

    WHY: Quantifying savings helps justify caching implementation in production systems.

    YOU SHOULD SEE: Significant savings (50%+) on input tokens when system prompt is large relative to conversation messages.

  3. Experiment with cache invalidation: change one word in the system prompt and observe how it affects cache hits.

    WHY: Understanding cache invalidation prevents unexpected cost spikes in production.

    YOU SHOULD SEE: Any change to the cached content causes a cache miss and a new cache write.

  4. Implement a monitoring dashboard that tracks cache hit rate, cost savings, and cache invalidation events.

    WHY: Monitoring ensures caching continues to provide value and alerts you to unexpected invalidations.

    YOU SHOULD SEE: A dashboard showing cache hit rate, total savings, and any invalidation events.

Sources

Previous

Context Window Management