Prompt Caching

Supported Models and Savings

ProviderModelsTrigger ConditionSavingsYour Action
OpenAI / AzureGPT-4.1, GPT-4o, GPT-5, o3, o4-mini (all GPT models)Automatic when prefix >= 1024 tokens50% on cached input tokensNone required
AnthropicAll 9 Claude models (Opus 4, Sonnet 4, Haiku 3.5, etc.)Gateway auto-injects cache_control when system prompt >= 3000 characters90% on cached input tokensNone required (OpenAI protocol)
DeepSeekDeepSeek V3, DeepSeek R1, DeepSeek ChatAutomatic disk caching, 64-token alignment90% on cached input tokensNone required
GoogleGemini 2.5 Pro, Gemini 2.5 FlashImplicit caching, automatic90% on cached input tokensNone required
Alibaba (Bailian)GLM-5, MiniMax M2.5, Qwen modelsAutomaticVaries by modelNone required

How It Works

OpenAI / Azure Models

OpenAI caches automatically. When your request prefix (system prompt + early messages) is at least 1024 tokens and matches a previous request, OpenAI returns cached tokens at 50% of the input price.

You do not need to change anything. The gateway passes your request through, and the response usage field reports how many tokens were cached:

config.json
json
{
  "usage": {
    "prompt_tokens": 2048,
    "completion_tokens": 150,
    "total_tokens": 2198,
    "prompt_tokens_details": {
      "cached_tokens": 1024
    }
  }
}

Reading Cache Results

After each request, check the usage.prompt_tokens_details field in the response:

config.json
json
{
  "usage": {
    "prompt_tokens": 15000,
    "completion_tokens": 500,
    "total_tokens": 15500,
    "prompt_tokens_details": {
      "cached_tokens": 14000,
      "cache_creation_tokens": 0
    }
  }
}
  • cached_tokens: Tokens served from cache (charged at the discounted rate).
  • cache_creation_tokens: Tokens written to cache for the first time (Anthropic charges a small write premium; other providers do not).

Cost Comparison: Claude Code User

A typical Claude Code user sends ~50 requests per day, each with a ~20,000-token system prompt.

MetricWithout CachingWith Caching (90% savings)
Daily input tokens (system prompt only)1,000,0001,000,000
Daily input cost (Claude Sonnet, $3/M)$3.00$0.30
Monthly input cost (30 days)$90.00$9.00
Monthly savings$81.00

The first request of each 5-minute window pays full price and writes to cache. All subsequent requests within that window pay the cached rate.

Code Examples

example.py
python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.chuizi.ai/v1",
    api_key="ck-your-key-here",
)

# The gateway auto-injects cache_control for Anthropic models
# when the system prompt exceeds 3000 characters.
# For OpenAI/DeepSeek models, caching is fully automatic.
response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",
    messages=[
        {
            "role": "system",
            "content": "Your long system prompt here..." * 100,  # Must be >= 3000 chars for auto-injection
        },
        {"role": "user", "content": "Summarize the key points."},
    ],
    max_tokens=1024,
)

# Check cache usage
usage = response.usage
print(f"Total input tokens: {usage.prompt_tokens}")
print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}")

Tips

  • Keep your system prompt stable. Caching works on exact prefix matches. If you change a single character in your system prompt, the cache is invalidated.
  • Front-load static content. Place instructions and reference documents at the beginning of the conversation. Dynamic content (user messages) goes at the end.
  • Reuse conversations. Multi-turn conversations naturally benefit from caching -- the entire conversation prefix up to the latest message can be cached.
  • Monitor via the Generation API. Use GET /v1/generation to see the exact cost breakdown including cached vs. non-cached tokens.

Next Steps