AnyInt Docs
Guides

Prompt Caching

Prompt caching reduces repeated-context cost and latency when the same prompt segments are reused. In practice, the main work is prompt design: stable prefixes are easier to reuse than constantly changing prompts.

Prompt caching reduces repeated-context cost and latency when the same prompt segments are reused. In practice, the main work is prompt design: stable prefixes are easier to reuse than constantly changing prompts.

This guide focuses on application design. Exact cache behavior, eligibility, and savings can vary by model family, account setup, provider path, and plan, so treat your usage metrics as the source of truth.

Good use cases

  • stable system prompts
  • repeated long documents
  • reusable instruction blocks
  • shared prompt templates across a team

When caching is worth designing for

SituationWhy it helps
Many requests share the same policy or system promptThe stable prefix can be reused across calls
Users ask different questions against the same long referenceThe reference can stay before volatile user input
Agents reuse the same tool or schema instructionsTool descriptions and output constraints can remain stable
A team maintains one approved prompt templateSmall changes are controlled instead of rebuilt per request

Practical guidance

  • keep reusable context stable
  • separate static instructions from user-specific content
  • place large shared context before volatile user input
  • monitor actual usage data instead of assuming a fixed discount across all models

A good prompt layout

  1. Static policy and instructions.
  2. Reusable long-form reference material.
  3. Tool definitions or schema instructions.
  4. Request-specific user content.

Example layout

SYSTEM:
  Stable product policy, tone, and safety rules.

REFERENCE:
  Shared documentation, schemas, or long context reused by many requests.

TOOLS OR OUTPUT CONTRACT:
  Stable tool definitions or JSON shape requirements.

USER:
  The current user's question, file, image, or task-specific data.

What breaks reuse

  • injecting timestamps or random values into the static prefix
  • rebuilding prompts in a slightly different order on every request
  • mixing cached and non-cached content into one unstable blob
  • formatting the same reference material differently for each request
  • placing user-specific text before the shared reusable context

How to measure

MetricWhat to compare
LatencySimilar requests before and after stabilizing the prompt prefix
CostActual billed usage for repeated workloads, not a theoretical discount
Cache eligibilityWhether the same model family and prompt layout are used consistently
QualityWhether separating static and dynamic content changed output behavior

Common mistakes

MistakeBetter approach
Adding the current timestamp to the system promptPut volatile data near the request-specific user content
Rebuilding the prompt from unordered objectsUse a deterministic prompt assembly order
Optimizing for cache before validating qualityFirst prove the prompt works, then stabilize reusable sections
Assuming every model supports the same savingsCheck usage metrics for the route and account you actually use

On this page