Cache Discount Pricing
Prompt Caching allows repeated system prompts and long context prefixes to be cached on the provider side. Subsequent requests that hit the cache pay only a fraction of the input price. For long-conversation tools like Claude Code and Cursor, this saves 80-90% on input costs.
What Is Prompt Caching
When you send multiple requests with the same system prompt or conversation prefix, the provider caches the processed tokens. Cached tokens do not need to be reprocessed and are billed at a significantly reduced rate.
Key point: The Chuizi.AI gateway automatically injects cache_control markers for Anthropic models when the system prompt is 3000 characters or longer. You do not need to change any code.
Providers with Cache Support
| Provider | Cache Type | Trigger | Savings |
|---|---|---|---|
| Anthropic | Explicit caching | Gateway auto-injects cache_control | 90% |
| OpenAI | Automatic caching | Prefix >= 1024 tokens | 50% |
| DeepSeek | Disk caching | Automatic, 64-token alignment | 90% |
| Google Gemini | Implicit caching | Automatic | 90% |
Four Token Types
With caching enabled, each request's usage may include four token types:
| Token Type | Description | Price Multiplier |
|---|---|---|
input_tokens | Non-cached standard input tokens | 1x (standard price) |
output_tokens | Model-generated output tokens | 1x (standard price) |
cache_write_tokens | Tokens written to cache for the first time | 1.25x (125% of input price) |
cache_read_tokens | Tokens read from cache | 0.1x (10% of input price) |
Cache Price Table (Anthropic)
| Model | Input ($/1M) | Cache Write ($/1M) | Cache Read ($/1M) | Output ($/1M) |
|---|---|---|---|---|
| Claude Opus 4-6 | $15.00 | $18.75 | $1.50 | $75.00 |
| Claude Sonnet 4-6 | $3.00 | $3.75 | $0.30 | $15.00 |
| Claude Haiku 4-5 | $1.00 | $1.25 | $0.10 | $5.00 |
Prices above are upstream costs. Actual billing applies the x1.05 multiplier. See Billing Model for details on the multiplier.
Billing Formula
The complete cache billing formula:
cost = ( cache_write_tokens x cache_write_price + cache_read_tokens x cache_read_price + input_tokens x input_price + output_tokens x output_price ) x 1.05
Real-World Example: Claude Code Session
A typical Claude Code conversation:
| Step | Request Details | Cost (Sonnet 4-6) |
|---|---|---|
| Request 1 | 20K cache_write + 2K input + 1K output | $0.0907 |
| Request 2 | 20K cache_read + 3K input + 1.5K output | $0.0378 |
| Request 3 | 20K cache_read + 4K input + 2K output | $0.0480 |
| Request 10 | 20K cache_read + 8K input + 3K output | $0.0780 |
Request 2 without caching: (22K input + 1.5K output) x standard price = $0.0885
Request 2 with caching: $0.0378 (57% savings)
As the conversation continues, the proportion of cache_read tokens grows and savings increase.
View Cache Details in the Generation API
Use the Generation API to inspect cache token breakdowns for each request:
curl "https://api.chuizi.ai/v1/generation?id=gen-abc123" \ -H "Authorization: Bearer ck-your-key"
{ "id": "gen-abc123", "model": "anthropic/claude-sonnet-4-6", "input_tokens": 2000, "output_tokens": 500, "cache_read_tokens": 18000, "cache_write_tokens": 0, "cost": "0.01417500", "created_at": "2026-04-04T10:30:00Z" }
Maximizing Cache Hit Rate
- Keep system prompts identical — even a single character difference invalidates the cache
- Use long system prompts — the longer the cached prefix, the greater the savings
- Maintain request frequency — caches have a TTL (approximately 5 minutes for Anthropic), so infrequent requests may cause cache expiration
- Use Anthropic models — Chuizi.AI auto-injects cache_control with zero configuration
For more optimization strategies, see the Cost Optimization guide.
Next Steps
- Billing Model — Understand the three billing types and the 1.05x multiplier
- Cost Optimization — Strategies to reduce your overall API spend
- Prompt Caching Guide — Step-by-step guide to enabling caching in your requests