Cache Discount Pricing

Prompt Caching allows repeated system prompts and long context prefixes to be cached on the provider side. Subsequent requests that hit the cache pay only a fraction of the input price. For long-conversation tools like Claude Code and Cursor, this saves 80-90% on input costs.

What Is Prompt Caching

When you send multiple requests with the same system prompt or conversation prefix, the provider caches the processed tokens. Cached tokens do not need to be reprocessed and are billed at a significantly reduced rate.

Key point: The Chuizi.AI gateway automatically injects cache_control markers for Anthropic models when the system prompt is 3000 characters or longer. You do not need to change any code.

Providers with Cache Support

Provider	Cache Type	Trigger	Savings
Anthropic	Explicit caching	Gateway auto-injects cache_control	90%
OpenAI	Automatic caching	Prefix >= 1024 tokens	50%
DeepSeek	Disk caching	Automatic, 64-token alignment	90%
Google Gemini	Implicit caching	Automatic	90%

Four Token Types

With caching enabled, each request's usage may include four token types:

Token Type	Description	Price Multiplier
`input_tokens`	Non-cached standard input tokens	1x (standard price)
`output_tokens`	Model-generated output tokens	1x (standard price)
`cache_write_tokens`	Tokens written to cache for the first time	1.25x (125% of input price)
`cache_read_tokens`	Tokens read from cache	0.1x (10% of input price)

Cache Price Table (Anthropic)

Model	Input ($/1M)	Cache Write ($/1M)	Cache Read ($/1M)	Output ($/1M)
Claude Opus 4-6	$15.00	$18.75	$1.50	$75.00
Claude Sonnet 4-6	$3.00	$3.75	$0.30	$15.00
Claude Haiku 4-5	$1.00	$1.25	$0.10	$5.00

Prices above are upstream costs. Actual billing applies the x1.05 multiplier. See Billing Model for details on the multiplier.

Billing Formula

The complete cache billing formula:

cost = (
  cache_write_tokens x cache_write_price
  + cache_read_tokens x cache_read_price
  + input_tokens x input_price
  + output_tokens x output_price
) x 1.05

Real-World Example: Claude Code Session

A typical Claude Code conversation:

Step	Request Details	Cost (Sonnet 4-6)
Request 1	20K cache_write + 2K input + 1K output	$0.0907
Request 2	20K cache_read + 3K input + 1.5K output	$0.0378
Request 3	20K cache_read + 4K input + 2K output	$0.0480
Request 10	20K cache_read + 8K input + 3K output	$0.0780

Request 2 without caching: (22K input + 1.5K output) x standard price = $0.0885

Request 2 with caching: $0.0378 (57% savings)

As the conversation continues, the proportion of cache_read tokens grows and savings increase.

View Cache Details in the Generation API

Use the Generation API to inspect cache token breakdowns for each request:

terminal

bash

curl "https://api.chuizi.ai/v1/generation?id=gen-abc123" \
  -H "Authorization: Bearer ck-your-key"

config.json

json

{
  "id": "gen-abc123",
  "model": "anthropic/claude-sonnet-4-6",
  "input_tokens": 2000,
  "output_tokens": 500,
  "cache_read_tokens": 18000,
  "cache_write_tokens": 0,
  "cost": "0.01417500",
  "created_at": "2026-04-04T10:30:00Z"
}

Maximizing Cache Hit Rate

Keep system prompts identical — even a single character difference invalidates the cache
Use long system prompts — the longer the cached prefix, the greater the savings
Maintain request frequency — caches have a TTL (approximately 5 minutes for Anthropic), so infrequent requests may cause cache expiration
Use Anthropic models — Chuizi.AI auto-injects cache_control with zero configuration

For more optimization strategies, see the Cost Optimization guide.

Next Steps

Billing Model — Understand the three billing types and the 1.05x multiplier
Cost Optimization — Strategies to reduce your overall API spend
Prompt Caching Guide — Step-by-step guide to enabling caching in your requests