Prompt Caching

Prompt caching reduces repeated-context cost and latency when the same prompt segments are reused. In practice, the main work is prompt design: stable prefixes are easier to reuse than constantly changing prompts.

This guide focuses on application design. Exact cache behavior, eligibility, and savings can vary by model family, account setup, provider path, and plan, so treat your usage metrics as the source of truth.

Good use cases

stable system prompts
repeated long documents
reusable instruction blocks
shared prompt templates across a team

When caching is worth designing for

Situation	Why it helps
Many requests share the same policy or system prompt	The stable prefix can be reused across calls
Users ask different questions against the same long reference	The reference can stay before volatile user input
Agents reuse the same tool or schema instructions	Tool descriptions and output constraints can remain stable
A team maintains one approved prompt template	Small changes are controlled instead of rebuilt per request

Practical guidance

keep reusable context stable
separate static instructions from user-specific content
place large shared context before volatile user input
monitor actual usage data instead of assuming a fixed discount across all models

A good prompt layout

Static policy and instructions.
Reusable long-form reference material.
Tool definitions or schema instructions.
Request-specific user content.

Example layout

SYSTEM:
  Stable product policy, tone, and safety rules.

REFERENCE:
  Shared documentation, schemas, or long context reused by many requests.

TOOLS OR OUTPUT CONTRACT:
  Stable tool definitions or JSON shape requirements.

USER:
  The current user's question, file, image, or task-specific data.

What breaks reuse

injecting timestamps or random values into the static prefix
rebuilding prompts in a slightly different order on every request
mixing cached and non-cached content into one unstable blob
formatting the same reference material differently for each request
placing user-specific text before the shared reusable context

How to measure

Metric	What to compare
Latency	Similar requests before and after stabilizing the prompt prefix
Cost	Actual billed usage for repeated workloads, not a theoretical discount
Cache eligibility	Whether the same model family and prompt layout are used consistently
Quality	Whether separating static and dynamic content changed output behavior

Common mistakes

Mistake	Better approach
Adding the current timestamp to the system prompt	Put volatile data near the request-specific user content
Rebuilding the prompt from unordered objects	Use a deterministic prompt assembly order
Optimizing for cache before validating quality	First prove the prompt works, then stabilize reusable sections
Assuming every model supports the same savings	Check usage metrics for the route and account you actually use

On this page