Claude Prompt Caching: How to Cut API Costs (2026)

If your app sends the same large system prompt, tool definitions, or document context on every request, you’re paying full price to re-process those tokens every single time. Prompt caching lets Claude reuse the processed representation of a prompt prefix across requests — cache hits cost 90% less than normal input tokens. This guide covers how caching actually works, the pricing math for writes vs reads, where to place cache breakpoints, and worked cost examples for RAG apps and agents.

Table of Contents

Prerequisites

You need the anthropic Python SDK and an API key. See the Claude API Tutorial if you’re new to the Messages API, and Claude API Pricing Explained for base per-model pricing referenced below.

pip install anthropic

How Prompt Caching Works

Caching works on prefixes. A request is processed top to bottom — system prompt, then tool definitions, then messages. You mark a cache breakpoint by adding "cache_control": {"type": "ephemeral"} to a content block. Everything from the start of the prompt up to and including that block is cached as a unit.

On the next request, if the prefix up to a breakpoint is byte-for-byte identical to a cached prefix, Claude reads the cached version instead of reprocessing it — you only pay full price for whatever comes after the last matching breakpoint. If anything before a breakpoint changes (even one character), that cache entry misses and gets rewritten.

Cached prefixes expire after 5 minutes by default, refreshed on every cache hit
An extended 1-hour TTL is available for less frequent traffic (covered below)
Each cacheable block must be at least 1024 tokens for Sonnet and Opus models, or 2048 tokens for Haiku — shorter blocks are simply not cached (no error, no discount)
You can set up to 4 cache breakpoints per request

Pricing: Cache Writes vs Cache Reads

Caching adds two new token categories on top of normal input/output pricing. Using Claude Sonnet 4.6 ($3 / $15 per 1M tokens, input/output) as an example:

Base input tokens — priced normally: $3.00 / 1M
Cache write, 5-minute TTL (default) — 1.25× base price: $3.75 / 1M
Cache write, 1-hour TTL (extended) — 2× base price: $6.00 / 1M
Cache read (cache hit) — 0.1× base price, a 90% discount: $0.30 / 1M
Output tokens — unaffected by caching: $15.00 / 1M as usual

The same multipliers (1.25×, 2×, 0.1×) apply to every model’s own base price — for Haiku 4.5 ($0.25 / $1.25 per 1M) a cache read costs $0.025 / 1M; for Opus 4.7 ($15 / $75 per 1M) it costs $1.50 / 1M.

Cost Example: A RAG Chatbot

Say your RAG chatbot sends a 10,000-token retrieved-context block as the cached prefix, and handles 100 requests per day at a steady pace (each request within 5 minutes of the last, so the cache stays warm).

Without caching: 100 requests × 10,000 tokens × $3.00/1M = $3.00/day just for the context, before output costs.

With caching: the first request writes the cache (10,000 × $3.75/1M = $0.0375). The remaining 99 requests read from cache (99 × 10,000 × $0.30/1M = $0.297). Total: ≈ $0.33/day — about an 89% reduction, and the saving compounds with every additional request per day.

Basic Example: Caching a System Prompt

The system parameter accepts either a string or a list of content blocks. To cache it, pass a list and add cache_control to the block you want cached:

from anthropic import Anthropic

client = Anthropic()

# e.g. a knowledge base, persona instructions, or few-shot examples — 8,000+ tokens
LARGE_SYSTEM_PROMPT = open("knowledge_base.md").read()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Summarize the refund policy."}],
)

print(response.usage)
# Usage(input_tokens=12, cache_creation_input_tokens=8000, cache_read_input_tokens=0, ...)

The first call writes the cache — cache_creation_input_tokens shows the full 8,000 tokens billed at the write rate. Send the same system prompt again within 5 minutes, and the response shows cache_read_input_tokens=8000, cache_creation_input_tokens=0 — those 8,000 tokens now cost 90% less.

Caching Tool Definitions

Tool schemas count as input tokens on every call, including all the multi-step calls in a tool-use loop. If you maintain a large toolset, cache it by adding cache_control to the last tool in the tools list:

tools = [
    {"name": "get_stock_price", "description": "...", "input_schema": {...}},
    {"name": "get_exchange_rate", "description": "...", "input_schema": {...}},
    # ... more tools ...
    {
        "name": "get_current_time",
        "description": "Get the current date and time in UTC.",
        "input_schema": {"type": "object", "properties": {}},
        "cache_control": {"type": "ephemeral"},
    },
]

This caches all tool definitions up to and including get_current_time as a single prefix block. Every step of a tool-use loop re-sends the same tools array, so this is one of the highest-leverage places to cache for agents.

Caching Long Documents

For document Q&A or RAG, put the document in its own content block with cache_control, and keep the actual question in a separate, uncached block after it:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"<document>{long_document}</document>",
                    "cache_control": {"type": "ephemeral"},
                },
                {
                    "type": "text",
                    "text": "What are the payment terms in section 4?",
                },
            ],
        }
    ],
)

Follow-up questions about the same document, sent within 5 minutes, only pay full price for the new question — the document itself is read from cache.

Extended TTL: 1-Hour Cache

The default 5-minute TTL refreshes on every cache hit, so it works well for steady traffic. For workloads with longer gaps — batch jobs, low-traffic endpoints, or overnight processing — use the 1-hour TTL instead:

{
    "type": "text",
    "text": LARGE_SYSTEM_PROMPT,
    "cache_control": {"type": "ephemeral", "ttl": "1h"},
}

The 1-hour cache costs 2× the base input price to write (vs 1.25× for 5-minute), but reads are still 0.1×. Do the math for your traffic pattern: a 1-hour cache pays for itself once a cached prefix is reused at least twice within the hour. For traffic denser than every 5 minutes, the default TTL is usually cheaper since you never pay the higher write multiplier.

Multiple Cache Breakpoints

You can set up to 4 breakpoints per request — useful when different parts of a prompt change at different rates. A common pattern for an agent:

system=[
    {
        "type": "text",
        "text": STATIC_INSTRUCTIONS,           # rarely changes — shared across all users
        "cache_control": {"type": "ephemeral"},
    },
    {
        "type": "text",
        "text": USER_PROFILE_CONTEXT,          # changes per user, reused within their session
        "cache_control": {"type": "ephemeral"},
    },
]

Claude checks for the longest matching cached prefix. Ordering blocks from least-to-most volatile means a change to USER_PROFILE_CONTEXT only invalidates that block — STATIC_INSTRUCTIONS stays cached and keeps earning its discount across every user.

Best Practices

Put stable content first (system instructions, tool definitions, shared context) and volatile content last (the current user message) — only what comes after the last cache hit is billed at full price
Don’t bother caching anything under ~1024 tokens (2048 for Haiku) — there’s no discount below the minimum
For multi-turn conversations, cache the growing message history so each new turn only pays for the new message plus the model’s reply
For agents, cache the tools array — it’s resent unchanged on every step of the tool-use loop
Track usage.cache_read_input_tokens vs usage.cache_creation_input_tokens in production — a healthy hit rate for a steady-traffic, system-prompt-heavy app should approach 90%+
Switch to the 1-hour TTL only if requests are typically more than 5 minutes apart but still reuse the same prefix within an hour

Summary

Prompt caching reuses the processed representation of a prompt prefix, marked with cache_control: {"type": "ephemeral"}
Cache reads cost 0.1× the base input price (90% off); 5-minute writes cost 1.25×, 1-hour writes cost 2×
Minimum cacheable size is 1024 tokens (2048 for Haiku) — shorter blocks aren’t cached
Default TTL is 5 minutes, refreshed on every hit; use "ttl": "1h" for sparser traffic
Up to 4 breakpoints per request — order from least to most volatile
Best targets: large system prompts, tool definitions in agent loops, RAG documents, and growing conversation history
Check response.usage.cache_read_input_tokens and cache_creation_input_tokens to verify caching is actually hitting

Further reading: Claude API Pricing Explained for base per-model rates, and Claude API Function Calling for tool-use loops that benefit most from caching tool definitions.

Subscribe to my newsletter — practical guides on Claude API, AI agents, RAG, and automation.