Prompt Caching
Supported Models and Savings
| Provider | Models | Trigger Condition | Savings | Your Action |
|---|---|---|---|---|
| OpenAI / Azure | GPT-4.1, GPT-4o, GPT-5, o3, o4-mini (all GPT models) | Automatic when prefix >= 1024 tokens | 50% on cached input tokens | None required |
| Anthropic | All 9 Claude models (Opus 4, Sonnet 4, Haiku 3.5, etc.) | Gateway auto-injects cache_control when system prompt >= 3000 characters | 90% on cached input tokens | None required (OpenAI protocol) |
| DeepSeek | DeepSeek V3, DeepSeek R1, DeepSeek Chat | Automatic disk caching, 64-token alignment | 90% on cached input tokens | None required |
| Gemini 2.5 Pro, Gemini 2.5 Flash | Implicit caching, automatic | 90% on cached input tokens | None required | |
| Alibaba (Bailian) | GLM-5, MiniMax M2.5, Qwen models | Automatic | Varies by model | None required |
How It Works
OpenAI / Azure Models
OpenAI caches automatically. When your request prefix (system prompt + early messages) is at least 1024 tokens and matches a previous request, OpenAI returns cached tokens at 50% of the input price.
You do not need to change anything. The gateway passes your request through, and the response usage field reports how many tokens were cached:
config.json
json
{ "usage": { "prompt_tokens": 2048, "completion_tokens": 150, "total_tokens": 2198, "prompt_tokens_details": { "cached_tokens": 1024 } } }
Reading Cache Results
After each request, check the usage.prompt_tokens_details field in the response:
config.json
json
{ "usage": { "prompt_tokens": 15000, "completion_tokens": 500, "total_tokens": 15500, "prompt_tokens_details": { "cached_tokens": 14000, "cache_creation_tokens": 0 } } }
cached_tokens: Tokens served from cache (charged at the discounted rate).cache_creation_tokens: Tokens written to cache for the first time (Anthropic charges a small write premium; other providers do not).
Cost Comparison: Claude Code User
A typical Claude Code user sends ~50 requests per day, each with a ~20,000-token system prompt.
| Metric | Without Caching | With Caching (90% savings) |
|---|---|---|
| Daily input tokens (system prompt only) | 1,000,000 | 1,000,000 |
| Daily input cost (Claude Sonnet, $3/M) | $3.00 | $0.30 |
| Monthly input cost (30 days) | $90.00 | $9.00 |
| Monthly savings | — | $81.00 |
The first request of each 5-minute window pays full price and writes to cache. All subsequent requests within that window pay the cached rate.
Code Examples
example.py
python
from openai import OpenAI client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", ) # The gateway auto-injects cache_control for Anthropic models # when the system prompt exceeds 3000 characters. # For OpenAI/DeepSeek models, caching is fully automatic. response = client.chat.completions.create( model="anthropic/claude-sonnet-4-6", messages=[ { "role": "system", "content": "Your long system prompt here..." * 100, # Must be >= 3000 chars for auto-injection }, {"role": "user", "content": "Summarize the key points."}, ], max_tokens=1024, ) # Check cache usage usage = response.usage print(f"Total input tokens: {usage.prompt_tokens}") print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}")
Tips
- Keep your system prompt stable. Caching works on exact prefix matches. If you change a single character in your system prompt, the cache is invalidated.
- Front-load static content. Place instructions and reference documents at the beginning of the conversation. Dynamic content (user messages) goes at the end.
- Reuse conversations. Multi-turn conversations naturally benefit from caching -- the entire conversation prefix up to the latest message can be cached.
- Monitor via the Generation API. Use
GET /v1/generationto see the exact cost breakdown including cached vs. non-cached tokens.
Next Steps
- Cost Optimization — additional strategies beyond caching
- Cache Pricing — detailed cache hit/miss pricing per provider
- Chat Completions API —
cache_controlparameter reference