Task 1.3

Guardrails & Safety

Guardrails are programmatic checks that constrain an agent's behavior to prevent harmful, off-topic, or unintended actions. In Claude-powered systems, guardrails operate at multiple layers: input validation, output filtering, tool-level permissions, and system-level constraints.

Input Guardrails

Input guardrails validate and sanitize user inputs before they reach the model. Common techniques include content moderation (screening for harmful content), prompt injection detection (identifying attempts to override system instructions), and input validation (checking format, length, and type constraints).

For Claude-based systems, a lightweight classifier model can screen inputs before they are processed by the main agent. This adds latency but prevents the agent from acting on malicious or irrelevant inputs.

Output Guardrails

Output guardrails validate the model's responses before they reach the user or trigger actions. This includes checking for harmful content, verifying factual claims, ensuring responses stay within scope, and validating structured output format.

A critical output guardrail for agentic systems is tool call validation. Before executing a tool call, your application should verify that the parameters are within expected bounds, the tool is appropriate for the current context, and the action won't cause irreversible harm without confirmation.

The Guardrail Stack

Production systems typically implement guardrails at multiple layers: system prompt instructions (soft guardrails), input classification and filtering (pre-processing), tool parameter validation (execution time), output validation and filtering (post-processing), and rate limiting and usage controls (system level).

The system prompt is the weakest guardrail because it can be overridden by prompt injection. Hard-coded programmatic checks are the strongest because they cannot be bypassed by model behavior.

Key Concept

Defense in Depth

Never rely on a single guardrail layer. System prompt instructions can be circumvented by prompt injection. Model-based classifiers can be fooled by adversarial inputs. Only hard-coded programmatic checks (input validation, output filtering, tool parameter bounds) provide reliable protection. Layer multiple guardrail types for robust safety.

Exam Traps

EXAM TRAP

Treating system prompt instructions as reliable guardrails

System prompts are soft guardrails. They can be overridden by sophisticated prompt injection. The exam expects you to know that programmatic checks are needed in addition to prompt-level instructions.

EXAM TRAP

Applying guardrails only to user input

Guardrails must be applied to both inputs AND outputs. The model can generate harmful content or make dangerous tool calls even from benign inputs.

EXAM TRAP

Forgetting tool-level guardrails

In agentic systems, the most dangerous actions happen through tools. Tool parameter validation and confirmation steps for destructive actions are essential guardrails.

EXAM TRAP

Over-constraining the agent

Guardrails that are too restrictive prevent the agent from completing legitimate tasks. The exam may test your ability to balance safety with utility.

Check Your Understanding

An agent has access to a delete_file tool. A user asks the agent to clean up temporary files. Which guardrail approach is most appropriate?

Build Exercise

Build a Guardrail Pipeline

Intermediate45 minutes

What you'll learn

  • Implement input validation guardrails
  • Add tool parameter checking
  • Build output validation
  • Test guardrails against adversarial inputs
  1. Create a function that validates user input: check length limits, screen for common prompt injection patterns, and classify intent.

    WHY: Input guardrails are the first line of defense against misuse.

    YOU SHOULD SEE: The function rejects inputs that are too long, contain injection patterns, or have malicious intent.

  2. Create a tool wrapper that validates parameters before execution. For a file_write tool, check that the path is within an allowed directory.

    WHY: Tool-level guardrails prevent the agent from taking dangerous actions even when the model's judgment fails.

    YOU SHOULD SEE: The wrapper allows writes to /tmp/sandbox/ but rejects writes to /etc/ or /home/.

  3. Create an output validator that checks the model's response for PII patterns (emails, phone numbers, SSNs) and redacts them.

    WHY: Output guardrails prevent accidental data leakage even when the model doesn't realize it's sharing sensitive data.

    YOU SHOULD SEE: The validator detects and redacts email addresses, phone numbers, and SSN patterns in model output.

  4. Test your guardrail pipeline with adversarial inputs: prompt injections, path traversal attempts, and inputs designed to elicit PII.

    WHY: Guardrails that haven't been tested against adversarial inputs provide false security.

    YOU SHOULD SEE: All adversarial inputs are caught and handled appropriately by the pipeline.

Sources

Previous

Orchestration Patterns