Cache Discount Pricing

Prompt Caching allows repeated system prompts and long context prefixes to be cached on the provider side. Subsequent requests that hit the cache pay only a fraction of the input price. For long-conversation tools like Claude Code and Cursor, this saves 80-90% on input costs.

What Is Prompt Caching

When you send multiple requests with the same system prompt or conversation prefix, the provider caches the processed tokens. Cached tokens do not need to be reprocessed and are billed at a significantly reduced rate.

Key point: The Chuizi.AI gateway automatically injects cache_control markers for Anthropic models when the system prompt is 3000 characters or longer. You do not need to change any code.

Providers with Cache Support

ProviderCache TypeTriggerSavings
AnthropicExplicit cachingGateway auto-injects cache_control90%
OpenAIAutomatic cachingPrefix >= 1024 tokens50%
DeepSeekDisk cachingAutomatic, 64-token alignment90%
Google GeminiImplicit cachingAutomatic90%

Four Token Types

With caching enabled, each request's usage may include four token types:

Token TypeDescriptionPrice Multiplier
input_tokensNon-cached standard input tokens1x (standard price)
output_tokensModel-generated output tokens1x (standard price)
cache_write_tokensTokens written to cache for the first time1.25x (125% of input price)
cache_read_tokensTokens read from cache0.1x (10% of input price)

Cache Price Table (Anthropic)

ModelInput ($/1M)Cache Write ($/1M)Cache Read ($/1M)Output ($/1M)
Claude Opus 4-6$15.00$18.75$1.50$75.00
Claude Sonnet 4-6$3.00$3.75$0.30$15.00
Claude Haiku 4-5$1.00$1.25$0.10$5.00

Prices above are upstream costs. Actual billing applies the x1.05 multiplier. See Billing Model for details on the multiplier.

Billing Formula

The complete cache billing formula:

cost = (
  cache_write_tokens x cache_write_price
  + cache_read_tokens x cache_read_price
  + input_tokens x input_price
  + output_tokens x output_price
) x 1.05

Real-World Example: Claude Code Session

A typical Claude Code conversation:

StepRequest DetailsCost (Sonnet 4-6)
Request 120K cache_write + 2K input + 1K output$0.0907
Request 220K cache_read + 3K input + 1.5K output$0.0378
Request 320K cache_read + 4K input + 2K output$0.0480
Request 1020K cache_read + 8K input + 3K output$0.0780

Request 2 without caching: (22K input + 1.5K output) x standard price = $0.0885

Request 2 with caching: $0.0378 (57% savings)

As the conversation continues, the proportion of cache_read tokens grows and savings increase.

View Cache Details in the Generation API

Use the Generation API to inspect cache token breakdowns for each request:

terminal
bash
curl "https://api.chuizi.ai/v1/generation?id=gen-abc123" \
  -H "Authorization: Bearer ck-your-key"
config.json
json
{
  "id": "gen-abc123",
  "model": "anthropic/claude-sonnet-4-6",
  "input_tokens": 2000,
  "output_tokens": 500,
  "cache_read_tokens": 18000,
  "cache_write_tokens": 0,
  "cost": "0.01417500",
  "created_at": "2026-04-04T10:30:00Z"
}

Maximizing Cache Hit Rate

  1. Keep system prompts identical — even a single character difference invalidates the cache
  2. Use long system prompts — the longer the cached prefix, the greater the savings
  3. Maintain request frequency — caches have a TTL (approximately 5 minutes for Anthropic), so infrequent requests may cause cache expiration
  4. Use Anthropic models — Chuizi.AI auto-injects cache_control with zero configuration

For more optimization strategies, see the Cost Optimization guide.

Next Steps