Guides
Prompt Caching
Prompt caching reduces repeated-context cost and latency when the same prompt segments are reused. In practice, the main work is prompt design: stable prefixes are easier to reuse than constantly changing prompts.
Prompt caching reduces repeated-context cost and latency when the same prompt segments are reused. In practice, the main work is prompt design: stable prefixes are easier to reuse than constantly changing prompts.
This guide focuses on application design. Exact cache behavior, eligibility, and savings can vary by model family, account setup, provider path, and plan, so treat your usage metrics as the source of truth.
Good use cases
- stable system prompts
- repeated long documents
- reusable instruction blocks
- shared prompt templates across a team
When caching is worth designing for
| Situation | Why it helps |
|---|---|
| Many requests share the same policy or system prompt | The stable prefix can be reused across calls |
| Users ask different questions against the same long reference | The reference can stay before volatile user input |
| Agents reuse the same tool or schema instructions | Tool descriptions and output constraints can remain stable |
| A team maintains one approved prompt template | Small changes are controlled instead of rebuilt per request |
Practical guidance
- keep reusable context stable
- separate static instructions from user-specific content
- place large shared context before volatile user input
- monitor actual usage data instead of assuming a fixed discount across all models
A good prompt layout
- Static policy and instructions.
- Reusable long-form reference material.
- Tool definitions or schema instructions.
- Request-specific user content.
Example layout
SYSTEM:
Stable product policy, tone, and safety rules.
REFERENCE:
Shared documentation, schemas, or long context reused by many requests.
TOOLS OR OUTPUT CONTRACT:
Stable tool definitions or JSON shape requirements.
USER:
The current user's question, file, image, or task-specific data.What breaks reuse
- injecting timestamps or random values into the static prefix
- rebuilding prompts in a slightly different order on every request
- mixing cached and non-cached content into one unstable blob
- formatting the same reference material differently for each request
- placing user-specific text before the shared reusable context
How to measure
| Metric | What to compare |
|---|---|
| Latency | Similar requests before and after stabilizing the prompt prefix |
| Cost | Actual billed usage for repeated workloads, not a theoretical discount |
| Cache eligibility | Whether the same model family and prompt layout are used consistently |
| Quality | Whether separating static and dynamic content changed output behavior |
Common mistakes
| Mistake | Better approach |
|---|---|
| Adding the current timestamp to the system prompt | Put volatile data near the request-specific user content |
| Rebuilding the prompt from unordered objects | Use a deterministic prompt assembly order |
| Optimizing for cache before validating quality | First prove the prompt works, then stabilize reusable sections |
| Assuming every model supports the same savings | Check usage metrics for the route and account you actually use |