|
Deep Dive
|
Prompt Caching: The Optimization Most AI Developers Miss
Your LLM is doing the same work over and over. Here is how to make it stop.
|
|
Every time you send a prompt to an LLM, the model processes your entire input from scratch. Your carefully crafted system prompt with detailed instructions? Processed again. Those tool definitions for your AI agent? Processed again. The 50-page document you uploaded for Q&A? Processed again. Every single request.
This is wildly inefficient. If you are building any production LLM application with repeated context, you are likely burning money on redundant computation.
One developer shared that they dropped their monthly API bill from $720 to $72 after implementing a simple change. Another reported going from $8,000 per month to $800. The technique they used is called prompt caching, and despite being available from major providers for over a year, adoption remains surprisingly low.
|
What Exactly Is Prompt Caching?
To understand prompt caching, you need to understand what happens when an LLM processes your prompt.
When you send text to a model like Claude or GPT, the model does not just "read" your prompt. It performs complex mathematical operations on each token, building internal representations called key-value tensors. These tensors capture how each token in your prompt relates to every other token. This is the "attention" mechanism that makes transformers work.
Computing these tensors is expensive. For a long prompt, this computation can take significant time and resources. And here is the problem: if you send the same system prompt with a different user question, the model recomputes those tensors for the entire system prompt from scratch.
Prompt caching solves this by storing the computed tensors after the first request. When you send another request that starts with the same prefix (the beginning portion of your prompt), the model retrieves the stored tensors instead of recomputing them. It then only needs to process the new, dynamic part of your prompt.
A critical distinction: This is not caching the response. The model still generates a fresh, unique output for each request. What gets cached is the computational work of understanding your prompt prefix. The quality and variability of responses remain unchanged.
|
|
How Prompt Caching Works
|
⚡ CACHE CHECK
Hash prompt prefix, lookup stored KV tensors
|
|
|
✔ CACHE HIT
Load stored tensors
Process only new tokens
90% cheaper, 85% faster
|
|
|
✘ CACHE MISS
Compute all KV tensors
Store for future requests
Full price, full latency
|
|
|
GENERATE RESPONSE
Output is identical either way
|
|
|
|
Why This Matters for Your Applications
The impact is twofold: cost and latency.
Cost savings: Anthropic charges only 10% of the normal input token price for cached tokens. That is a 90% reduction. OpenAI offers 50% off. Google provides 75% savings. If you have a 20,000 token system prompt and you make 1,000 requests per day, you go from processing 20 million tokens daily to effectively processing 20,000 tokens once plus minimal overhead for cache reads.
Latency reduction: Because the model skips the computation for cached tokens, time-to-first-token drops dramatically. Anthropic reports up to 85% latency reduction for long prompts. In practice, this means your chatbot responds faster, your code assistant feels snappier, and your document Q&A system handles queries more responsively.
Consider a real scenario: You build a customer support bot with a 15,000 token system prompt containing company policies, product information, and response guidelines. Without caching, every customer interaction processes those 15,000 tokens. With 500 conversations per day, that is 7.5 million tokens daily just for the system prompt. With caching, you pay full price once, then 90% less for every subsequent conversation.
|
How Each Provider Implements It
Each major provider handles prompt caching differently. Understanding these differences helps you choose the right approach for your use case.
Anthropic (Claude)
Anthropic gives you explicit control. You add a cache_control parameter to mark exactly where your cacheable prefix ends. Cache writes cost 1.25x the base input price (a 25% premium), but cache reads cost only 0.1x (90% savings). The cache lasts 5 minutes by default, refreshing each time it is used. You can also pay for 1-hour cache duration at 2x the base price for the initial write. Minimum cacheable size is 1,024 tokens for most models.
OpenAI (GPT)
OpenAI makes it automatic. No code changes required. If your prompt exceeds 1,024 tokens, OpenAI automatically attempts to cache and reuse matching prefixes. Cache reads cost 50% of normal input pricing. There is no premium for cache writes. The cache typically lasts 5-60 minutes depending on usage patterns. The tradeoff is less control but zero implementation effort.
Google (Gemini)
Google calls it "context caching" and offers the most flexibility. You explicitly create a cache object with a configurable time-to-live that can extend from minutes to weeks. Pricing is storage-based rather than per-write. Cache reads cost 25% of normal input pricing. Gemini 2.5 models also support implicit caching similar to OpenAI. This hybrid approach suits applications with varying cache lifetime needs.
|
The Golden Rule: Static First, Dynamic Last
Prompt caching relies on exact prefix matching. This is not fuzzy matching or semantic similarity. If even a single character differs between your current prompt prefix and what is in the cache, you get a cache miss and pay full price.
This requirement dictates how you must structure your prompts. Everything static goes at the beginning, everything dynamic goes at the end.
|
|
Optimal Prompt Structure for Caching
|
1
|
Tool Definitions
Function schemas, API specs
|
|
2
|
System Prompt
Instructions, persona, rules, constraints
|
|
3
|
Reference Documents
Knowledge base, manuals, code
|
|
4
|
Few-Shot Examples
Input-output pairs for guidance
|
|
5
|
User Query
The actual question or task
|
|
|
CACHED
Computed once, reused across all requests
0.1x
cost per read
|
|
|
|
DYNAMIC
Processed fresh each request
1.0x
full cost
|
|
|
|
|
Where Prompt Caching Delivers the Most Value
Conversational agents and chatbots: Your system prompt defines the bot's personality, capabilities, and constraints. This prompt is identical for every user conversation. Cache it once, then every turn in every conversation benefits from reduced cost and latency. For high-volume customer service bots, this alone can cut costs by an order of magnitude.
Coding assistants: Tools like Cursor and GitHub Copilot work with large codebases. By caching the codebase summary or relevant file contents, each autocomplete suggestion and code question can reference the full context without reprocessing it. This is why modern AI coding tools feel fast despite working with massive contexts.
Document Q&A and RAG systems: When users ask questions about a document, the document itself stays constant while questions vary. Upload a 200-page manual once, cache it, and answer unlimited questions at a fraction of the cost. One enterprise team reported handling 50,000 document queries per month for $8,000 instead of $45,000.
Agentic workflows: AI agents that use tools make multiple API calls per task. Each call typically includes the same tool definitions and instructions. Caching this context means each step in a multi-step workflow runs faster and cheaper. For agents making 10-20 tool calls per task, the savings compound significantly.
Detailed instruction sets: Some applications benefit from extensive examples in the prompt. Without caching, including 20+ few-shot examples is prohibitively expensive. With caching, you can load the model with comprehensive examples that dramatically improve output quality without proportional cost increases.
|
Common Mistakes That Break Caching
Putting dynamic content at the start: This is the most common mistake. If you include timestamps, user IDs, session tokens, or any request-specific data at the beginning of your prompt, every single request will miss the cache. Move all dynamic content to the end.
Inconsistent formatting: Remember, prefix matching is exact. If your JSON tool definitions have different key ordering between requests, or if whitespace varies, the cache will miss. Use deterministic serialization. Sort your JSON keys. Trim whitespace consistently.
Prompts that are too short: Most providers require a minimum of 1,024 tokens for caching. If your system prompt is only 500 tokens, it will not cache regardless of your settings. Either expand your prompt with useful context or accept that caching will not help for this use case.
Infrequent requests: Caches expire. Anthropic's default is 5 minutes, OpenAI's varies from 5-60 minutes. If your application only makes a few requests per hour, the cache may expire between uses, and you will pay cache write costs without cache read benefits.
Not monitoring cache hits: Without tracking, you cannot know if your caching strategy works. Check the cached_tokens field in API responses. If it is always zero, something is wrong with your prompt structure. Debug by comparing your prompts character by character.
|
How to Implement Prompt Caching
For Anthropic: Add the cache_control parameter to your system message or the last content block you want cached. The value should be {"type": "ephemeral"} for 5-minute caching. Place this marker at the end of your static content, right before the dynamic user query begins.
For OpenAI: No code changes needed. Simply structure your prompt with static content first and dynamic content last. Ensure your prompt exceeds 1,024 tokens. OpenAI automatically handles caching behind the scenes. Monitor the cached_tokens field in responses to verify it is working.
For Google: Create a cache explicitly using CachedContent.create(), specifying your desired TTL. Then reference this cache when initializing the GenerativeModel. For simpler use cases, Gemini 2.5 models support automatic implicit caching similar to OpenAI.
Verification: After implementation, make two identical requests in quick succession. The second request should show cached tokens in the response metadata. If both show zero cached tokens, review your prompt structure and ensure the prefix is truly identical between requests.
|
When Does Caching Pay Off?
With Anthropic's pricing (1.25x for cache writes, 0.1x for cache reads), you break even after just two requests using the same cached prefix. The first request costs 1.25x, the second costs 0.1x, for a total of 1.35x versus 2.0x without caching. Every subsequent request adds to your savings.
With OpenAI's pricing (no write premium, 0.5x for reads), you save from the very first cache hit. There is no break-even threshold to reach.
The economics get more compelling at scale. If you make 1,000 requests per day with a 20,000 token prefix, you save roughly 18 million tokens worth of computation daily. At typical API prices, that translates to hundreds or thousands of dollars per month.
|
The Bottom Line
Prompt caching is one of the highest-ROI optimizations available for LLM applications. The implementation is straightforward, the savings are immediate and measurable, and there is no impact on output quality.
If you are building anything with repeated context, whether that is system prompts, tool definitions, document analysis, or complex instructions, you should be using prompt caching. The five minutes it takes to restructure your prompts could save you thousands of dollars annually.
The infrastructure exists. The APIs are ready. All that remains is to use them.
|
Further Reading
Anthropic Documentation: The official guide covers implementation details, pricing, and best practices for Claude models. Visit docs.anthropic.com and search for "prompt caching" in the Build with Claude section.
OpenAI Cookbook: OpenAI provides a practical tutorial with code examples showing caching in action. Search "prompt caching 101" in the OpenAI Cookbook.
Google Cloud Documentation: For Gemini users, Google's Vertex AI documentation covers context caching setup and configuration options.
|
|
Found this useful? Forward it to someone building AI applications.
|
|