Prompt Caching
Supported Models and Savings
| Provider | Models | Trigger Condition | Savings | Your Action |
|---|---|---|---|---|
| OpenAI / Azure | GPT-4.1, GPT-4o, GPT-5, o3, o4-mini (all GPT models) | Automatic when prefix >= 1024 tokens | 50% on cached input tokens | None required |
| Anthropic | All 9 Claude models (Opus 4, Sonnet 4, Haiku 3.5, etc.) | Gateway auto-injects cache_control when system prompt >= 3000 characters | 90% on cached input tokens | None required (OpenAI protocol) |
| DeepSeek | DeepSeek V3, DeepSeek R1, DeepSeek Chat | Automatic disk caching, 64-token alignment | 90% on cached input tokens | None required |
| Gemini 2.5 Pro, Gemini 2.5 Flash | Implicit caching, automatic | 90% on cached input tokens | None required | |
| Alibaba (Bailian) | GLM-5, MiniMax M2.5, Qwen models | Automatic | Varies by model | None required |
How It Works
OpenAI / Azure Models
OpenAI caches automatically. When your request prefix (system prompt + early messages) is at least 1024 tokens and matches a previous request, OpenAI returns cached tokens at 50% of the input price.
You do not need to change anything. The gateway passes your request through, and the response usage field reports how many tokens were cached:
{ "usage": { "prompt_tokens": 2048, "completion_tokens": 150, "total_tokens": 2198, "prompt_tokens_details": { "cached_tokens": 1024 } } }
Anthropic / Claude Models
Anthropic requires explicit cache_control markers in the request to enable caching. When you use Chuizi.AI through the OpenAI-compatible protocol (/v1/chat/completions), the gateway automatically injects cache_control: { "type": "ephemeral" } on your system prompt when it exceeds 3000 characters. You get 90% savings with zero code changes.
When you use the Anthropic native protocol (/anthropic/v1/messages), the gateway does not modify your request body. You need to add cache_control yourself:
{ "model": "claude-sonnet-4-6", "max_tokens": 1024, "system": [ { "type": "text", "text": "Your long system prompt with instructions, context, documentation, etc. This needs to be at least 1024 tokens for Anthropic to cache it.", "cache_control": { "type": "ephemeral" } } ], "messages": [ { "role": "user", "content": "Your question here" } ] }
Cache entries persist for 5 minutes after the last access. Every cache hit resets the TTL.
DeepSeek Models
DeepSeek uses automatic disk-based caching. The input is segmented into 64-token blocks, and matching prefix blocks are served from cache at 90% discount. No action required on your side.
Gemini 2.5+ Models
Google Gemini 2.5 and later models use implicit caching. When the API detects repeated prefixes, it automatically caches them. Savings appear in the response usage field.
Reading Cache Results
After each request, check the usage.prompt_tokens_details field in the response:
{ "usage": { "prompt_tokens": 15000, "completion_tokens": 500, "total_tokens": 15500, "prompt_tokens_details": { "cached_tokens": 14000, "cache_creation_tokens": 0 } } }
cached_tokens: Tokens served from cache (charged at the discounted rate).cache_creation_tokens: Tokens written to cache for the first time (Anthropic charges a small write premium; other providers do not).
Cost Comparison: Claude Code User
A typical Claude Code user sends ~50 requests per day, each with a ~20,000-token system prompt.
| Metric | Without Caching | With Caching (90% savings) |
|---|---|---|
| Daily input tokens (system prompt only) | 1,000,000 | 1,000,000 |
| Daily input cost (Claude Sonnet, $3/M) | $3.00 | $0.30 |
| Monthly input cost (30 days) | $90.00 | $9.00 |
| Monthly savings | — | $81.00 |
The first request of each 5-minute window pays full price and writes to cache. All subsequent requests within that window pay the cached rate.
Code Examples
from openai import OpenAI client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", ) # The gateway auto-injects cache_control for Anthropic models # when the system prompt exceeds 3000 characters. # For OpenAI/DeepSeek models, caching is fully automatic. response = client.chat.completions.create( model="anthropic/claude-sonnet-4-6", messages=[ { "role": "system", "content": "Your long system prompt here..." * 100, # Must be >= 3000 chars for auto-injection }, {"role": "user", "content": "Summarize the key points."}, ], max_tokens=1024, ) # Check cache usage usage = response.usage print(f"Total input tokens: {usage.prompt_tokens}") print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}")
Tips
- Keep your system prompt stable. Caching works on exact prefix matches. If you change a single character in your system prompt, the cache is invalidated.
- Front-load static content. Place instructions and reference documents at the beginning of the conversation. Dynamic content (user messages) goes at the end.
- Reuse conversations. Multi-turn conversations naturally benefit from caching -- the entire conversation prefix up to the latest message can be cached.
- Monitor via the Generation API. Use
GET /v1/generationto see the exact cost breakdown including cached vs. non-cached tokens.
Next Steps
- Cost Optimization — additional strategies beyond caching
- Cache Pricing — detailed cache hit/miss pricing per provider
- Chat Completions API —
cache_controlparameter reference