Failover and Redundancy
Production Claude applications should have failover strategies for API outages. Options include: multi-region API endpoints (if available), fallback to alternative models (e.g., Claude Sonnet as fallback for Opus), cached responses for common queries, and graceful degradation to non-AI functionality.
The Claude API is available through multiple providers (direct, AWS Bedrock, Google Vertex). Using multiple providers creates redundancy — if one provider experiences issues, traffic can be routed to another.
Deployment Best Practices
Deploy AI application changes carefully. Prompt changes can have unpredictable effects — always test against your evaluation set before deploying. Use staged rollouts (deploy to 5% of traffic, monitor, then expand). Keep rollback capability for prompt changes, not just code changes.
Version your prompts alongside your code. A prompt change is as impactful as a code change and should go through the same review and deployment process.
Incident Response for AI Systems
AI system incidents differ from traditional software incidents. A model can produce incorrect or harmful output without throwing any errors. Incident detection requires quality monitoring, not just error monitoring.
Incident response playbooks for AI systems should include: how to quickly switch to a safer model or disable AI features, how to review recent model outputs for quality issues, how to roll back prompt changes, and how to communicate AI-related incidents to users.
Key Concept
AI Reliability Requires Quality Monitoring, Not Just Uptime
A Claude application can be 'up' (no errors, fast responses) while simultaneously producing low-quality or harmful output. Traditional reliability metrics (uptime, error rate) are necessary but insufficient. Production reliability requires continuous quality monitoring: checking that the model's outputs remain accurate, relevant, and safe over time.
Exam Traps
Equating uptime with reliability
An AI system can have 100% uptime while producing increasingly poor results. Reliability includes output quality, not just availability.
Not versioning prompts
Prompt changes can break application behavior. Treating prompts as code (versioned, reviewed, staged) prevents production issues.
No fallback for API outages
The Claude API can experience outages. Production systems need fallback strategies — alternative models, cached responses, or graceful degradation to non-AI functionality.
Check Your Understanding
Your Claude-powered application suddenly starts producing lower-quality responses. The API is responding normally — no errors, normal latency. What is the most likely cause and appropriate response?
Build Exercise
Build a Production Reliability System
What you'll learn
- Implement model failover
- Build quality monitoring
- Create a prompt versioning system
- Design an incident response playbook
Implement a model failover system: primary model (Sonnet), with automatic fallback to a secondary model (Haiku) if the primary fails or exceeds latency thresholds.
WHY: Failover ensures your application continues functioning during model or provider issues.
YOU SHOULD SEE: When the primary model fails, requests automatically route to the fallback model.
Implement a quality monitor: for every Nth response, run a model-based evaluation that scores quality (relevance, accuracy, format). Alert if quality drops below threshold.
WHY: Quality monitoring catches degradation that error monitoring misses.
YOU SHOULD SEE: A quality score trend that alerts when scores drop below the threshold.
Create a prompt versioning system: store prompts in files with version numbers. Support switching between versions and rolling back.
WHY: Prompt versioning enables safe changes and quick rollbacks.
YOU SHOULD SEE: Prompts are loaded from versioned files. Version switches are instant.
Write an incident response playbook for AI quality incidents. Include detection, triage, mitigation, and communication steps.
WHY: A playbook ensures consistent, fast response when AI quality issues arise.
YOU SHOULD SEE: A structured playbook with clear steps for each phase of incident response.
Sources
- Error Handling— Anthropic Documentation
- Claude on AWS Bedrock— Anthropic Documentation
- Claude on Google Vertex AI— Anthropic Documentation