In partnership with

Introducing the first AI-native CRM

Connect your email, and you’ll instantly get a CRM with enriched customer insights and a platform that grows with your business.

With AI at the core, Attio lets you:

Prospect and route leads with research agents
Get real-time insights during customer calls
Build powerful automations for your complex workflows

Join industry leaders like Granola, Taskrabbit, Flatfile and more.

👉 Try Attio Pro for free

Prompt Caching: The Optimization Most AI Developers Miss

Deep Dive

Prompt Caching: The Optimization Most AI Developers Miss

Your LLM is doing the same work over and over. Here is how to make it stop.

Every time you send a prompt to an LLM, the model processes your entire input from scratch. Your carefully crafted system prompt with detailed instructions? Processed again. Those tool definitions for your AI agent? Processed again. The 50-page document you uploaded for Q&A? Processed again. Every single request.

This is wildly inefficient. If you are building any production LLM application with repeated context, you are likely burning money on redundant computation.

One developer shared that they dropped their monthly API bill from $720 to $72 after implementing a simple change. Another reported going from $8,000 per month to $800. The technique they used is called prompt caching, and despite being available from major providers for over a year, adoption remains surprisingly low.

What Exactly Is Prompt Caching?

To understand prompt caching, you need to understand what happens when an LLM processes your prompt.

When you send text to a model like Claude or GPT, the model does not just "read" your prompt. It performs complex mathematical operations on each token, building internal representations called key-value tensors. These tensors capture how each token in your prompt relates to every other token. This is the "attention" mechanism that makes transformers work.

Computing these tensors is expensive. For a long prompt, this computation can take significant time and resources. And here is the problem: if you send the same system prompt with a different user question, the model recomputes those tensors for the entire system prompt from scratch.

Prompt caching solves this by storing the computed tensors after the first request. When you send another request that starts with the same prefix (the beginning portion of your prompt), the model retrieves the stored tensors instead of recomputing them. It then only needs to process the new, dynamic part of your prompt.

A critical distinction: This is not caching the response. The model still generates a fresh, unique output for each request. What gets cached is the computational work of understanding your prompt prefix. The quality and variability of responses remain unchanged.

How Prompt Caching Works

YOUR APP

➔

request

LLM API

⬇

⚡ CACHE CHECK

Hash prompt prefix, lookup stored KV tensors

⬃

⬂

✔ CACHE HIT

Load stored tensors

Process only new tokens

90% cheaper, 85% faster

✘ CACHE MISS

Compute all KV tensors

Store for future requests

Full price, full latency

⬂

⬃

GENERATE RESPONSE

Output is identical either way

Why This Matters for Your Applications

The impact is twofold: cost and latency.

Cost savings: Anthropic charges only 10% of the normal input token price for cached tokens. That is a 90% reduction. OpenAI offers 50% off. Google provides 75% savings. If you have a 20,000 token system prompt and you make 1,000 requests per day, you go from processing 20 million tokens daily to effectively processing 20,000 tokens once plus minimal overhead for cache reads.

Latency reduction: Because the model skips the computation for cached tokens, time-to-first-token drops dramatically. Anthropic reports up to 85% latency reduction for long prompts. In practice, this means your chatbot responds faster, your code assistant feels snappier, and your document Q&A system handles queries more responsively.

Consider a real scenario: You build a customer support bot with a 15,000 token system prompt containing company policies, product information, and response guidelines. Without caching, every customer interaction processes those 15,000 tokens. With 500 conversations per day, that is 7.5 million tokens daily just for the system prompt. With caching, you pay full price once, then 90% less for every subsequent conversation.

How Each Provider Implements It

Each major provider handles prompt caching differently. Understanding these differences helps you choose the right approach for your use case.

Anthropic (Claude)

Anthropic gives you explicit control. You add a cache_control parameter to mark exactly where your cacheable prefix ends. Cache writes cost 1.25x the base input price (a 25% premium), but cache reads cost only 0.1x (90% savings). The cache lasts 5 minutes by default, refreshing each time it is used. You can also pay for 1-hour cache duration at 2x the base price for the initial write. Minimum cacheable size is 1,024 tokens for most models.

OpenAI (GPT)

OpenAI makes it automatic. No code changes required. If your prompt exceeds 1,024 tokens, OpenAI automatically attempts to cache and reuse matching prefixes. Cache reads cost 50% of normal input pricing. There is no premium for cache writes. The cache typically lasts 5-60 minutes depending on usage patterns. The tradeoff is less control but zero implementation effort.

Google (Gemini)

Google calls it "context caching" and offers the most flexibility. You explicitly create a cache object with a configurable time-to-live that can extend from minutes to weeks. Pricing is storage-based rather than per-write. Cache reads cost 25% of normal input pricing. Gemini 2.5 models also support implicit caching similar to OpenAI. This hybrid approach suits applications with varying cache lifetime needs.

The Golden Rule: Static First, Dynamic Last

Prompt caching relies on exact prefix matching. This is not fuzzy matching or semantic similarity. If even a single character differs between your current prompt prefix and what is in the cache, you get a cache miss and pay full price.

This requirement dictates how you must structure your prompts. Everything static goes at the beginning, everything dynamic goes at the end.

Optimal Prompt Structure for Caching

1	Tool Definitions Function schemas, API specs

2	System Prompt Instructions, persona, rules, constraints

3	Reference Documents Knowledge base, manuals, code

4	Few-Shot Examples Input-output pairs for guidance

CACHE BREAKPOINT

5	User Query The actual question or task

CACHED

Computed once, reused across all requests

0.1x

cost per read

DYNAMIC

Processed fresh each request

1.0x

full cost

Where Prompt Caching Delivers the Most Value

Conversational agents and chatbots: Your system prompt defines the bot's personality, capabilities, and constraints. This prompt is identical for every user conversation. Cache it once, then every turn in every conversation benefits from reduced cost and latency. For high-volume customer service bots, this alone can cut costs by an order of magnitude.

Coding assistants: Tools like Cursor and GitHub Copilot work with large codebases. By caching the codebase summary or relevant file contents, each autocomplete suggestion and code question can reference the full context without reprocessing it. This is why modern AI coding tools feel fast despite working with massive contexts.

Document Q&A and RAG systems: When users ask questions about a document, the document itself stays constant while questions vary. Upload a 200-page manual once, cache it, and answer unlimited questions at a fraction of the cost. One enterprise team reported handling 50,000 document queries per month for $8,000 instead of $45,000.

Agentic workflows: AI agents that use tools make multiple API calls per task. Each call typically includes the same tool definitions and instructions. Caching this context means each step in a multi-step workflow runs faster and cheaper. For agents making 10-20 tool calls per task, the savings compound significantly.

Detailed instruction sets: Some applications benefit from extensive examples in the prompt. Without caching, including 20+ few-shot examples is prohibitively expensive. With caching, you can load the model with comprehensive examples that dramatically improve output quality without proportional cost increases.

Common Mistakes That Break Caching

Putting dynamic content at the start: This is the most common mistake. If you include timestamps, user IDs, session tokens, or any request-specific data at the beginning of your prompt, every single request will miss the cache. Move all dynamic content to the end.

Inconsistent formatting: Remember, prefix matching is exact. If your JSON tool definitions have different key ordering between requests, or if whitespace varies, the cache will miss. Use deterministic serialization. Sort your JSON keys. Trim whitespace consistently.

Prompts that are too short: Most providers require a minimum of 1,024 tokens for caching. If your system prompt is only 500 tokens, it will not cache regardless of your settings. Either expand your prompt with useful context or accept that caching will not help for this use case.

Infrequent requests: Caches expire. Anthropic's default is 5 minutes, OpenAI's varies from 5-60 minutes. If your application only makes a few requests per hour, the cache may expire between uses, and you will pay cache write costs without cache read benefits.

Not monitoring cache hits: Without tracking, you cannot know if your caching strategy works. Check the cached_tokens field in API responses. If it is always zero, something is wrong with your prompt structure. Debug by comparing your prompts character by character.

How to Implement Prompt Caching

For Anthropic: Add the cache_control parameter to your system message or the last content block you want cached. The value should be {"type": "ephemeral"} for 5-minute caching. Place this marker at the end of your static content, right before the dynamic user query begins.

For OpenAI: No code changes needed. Simply structure your prompt with static content first and dynamic content last. Ensure your prompt exceeds 1,024 tokens. OpenAI automatically handles caching behind the scenes. Monitor the cached_tokens field in responses to verify it is working.

For Google: Create a cache explicitly using CachedContent.create(), specifying your desired TTL. Then reference this cache when initializing the GenerativeModel. For simpler use cases, Gemini 2.5 models support automatic implicit caching similar to OpenAI.

Verification: After implementation, make two identical requests in quick succession. The second request should show cached tokens in the response metadata. If both show zero cached tokens, review your prompt structure and ensure the prefix is truly identical between requests.

When Does Caching Pay Off?

With Anthropic's pricing (1.25x for cache writes, 0.1x for cache reads), you break even after just two requests using the same cached prefix. The first request costs 1.25x, the second costs 0.1x, for a total of 1.35x versus 2.0x without caching. Every subsequent request adds to your savings.

With OpenAI's pricing (no write premium, 0.5x for reads), you save from the very first cache hit. There is no break-even threshold to reach.

The economics get more compelling at scale. If you make 1,000 requests per day with a 20,000 token prefix, you save roughly 18 million tokens worth of computation daily. At typical API prices, that translates to hundreds or thousands of dollars per month.

The Bottom Line

Prompt caching is one of the highest-ROI optimizations available for LLM applications. The implementation is straightforward, the savings are immediate and measurable, and there is no impact on output quality.

If you are building anything with repeated context, whether that is system prompts, tool definitions, document analysis, or complex instructions, you should be using prompt caching. The five minutes it takes to restructure your prompts could save you thousands of dollars annually.

The infrastructure exists. The APIs are ready. All that remains is to use them.

Prompt Caching: The Optimization Most AI Developers Miss

Introducing the first AI-native CRM

Prompt Caching: The Optimization Most AI Developers Miss

What Exactly Is Prompt Caching?

Why This Matters for Your Applications

How Each Provider Implements It

The Golden Rule: Static First, Dynamic Last

Where Prompt Caching Delivers the Most Value

Common Mistakes That Break Caching

How to Implement Prompt Caching

When Does Caching Pay Off?

The Bottom Line

Further Reading

Keep Reading

Quick Links

Stay Updated