Prompt Caching

Supported Models and Savings

Provider	Models	Trigger Condition	Savings	Your Action
OpenAI / Azure	GPT-4.1, GPT-4o, GPT-5, o3, o4-mini (all GPT models)	Automatic when prefix >= 1024 tokens	50% on cached input tokens	None required
Anthropic	All 9 Claude models (Opus 4, Sonnet 4, Haiku 3.5, etc.)	Gateway auto-injects `cache_control` when system prompt >= 3000 characters	90% on cached input tokens	None required (OpenAI protocol)
DeepSeek	DeepSeek V3, DeepSeek R1, DeepSeek Chat	Automatic disk caching, 64-token alignment	90% on cached input tokens	None required
Google	Gemini 2.5 Pro, Gemini 2.5 Flash	Implicit caching, automatic	90% on cached input tokens	None required
Alibaba (Bailian)	GLM-5, MiniMax M2.5, Qwen models	Automatic	Varies by model	None required

How It Works

OpenAI / Azure Models

OpenAI caches automatically. When your request prefix (system prompt + early messages) is at least 1024 tokens and matches a previous request, OpenAI returns cached tokens at 50% of the input price.

You do not need to change anything. The gateway passes your request through, and the response usage field reports how many tokens were cached:

config.json

json

{
  "usage": {
    "prompt_tokens": 2048,
    "completion_tokens": 150,
    "total_tokens": 2198,
    "prompt_tokens_details": {
      "cached_tokens": 1024
    }
  }
}

Anthropic / Claude Models

Anthropic requires explicit cache_control markers in the request to enable caching. When you use Chuizi.AI through the OpenAI-compatible protocol (/v1/chat/completions), the gateway automatically injects cache_control: { "type": "ephemeral" } on your system prompt when it exceeds 3000 characters. You get 90% savings with zero code changes.

When you use the Anthropic native protocol (/anthropic/v1/messages), the gateway does not modify your request body. You need to add cache_control yourself:

config.json

json

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "system": [
    {
      "type": "text",
      "text": "Your long system prompt with instructions, context, documentation, etc. This needs to be at least 1024 tokens for Anthropic to cache it.",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "Your question here" }
  ]
}

Cache entries persist for 5 minutes after the last access. Every cache hit resets the TTL.

DeepSeek Models

DeepSeek uses automatic disk-based caching. The input is segmented into 64-token blocks, and matching prefix blocks are served from cache at 90% discount. No action required on your side.

Gemini 2.5+ Models

Google Gemini 2.5 and later models use implicit caching. When the API detects repeated prefixes, it automatically caches them. Savings appear in the response usage field.

Reading Cache Results

After each request, check the usage.prompt_tokens_details field in the response:

config.json

json

{
  "usage": {
    "prompt_tokens": 15000,
    "completion_tokens": 500,
    "total_tokens": 15500,
    "prompt_tokens_details": {
      "cached_tokens": 14000,
      "cache_creation_tokens": 0
    }
  }
}

cached_tokens: Tokens served from cache (charged at the discounted rate).
cache_creation_tokens: Tokens written to cache for the first time (Anthropic charges a small write premium; other providers do not).

Cost Comparison: Claude Code User

A typical Claude Code user sends ~50 requests per day, each with a ~20,000-token system prompt.

Metric	Without Caching	With Caching (90% savings)
Daily input tokens (system prompt only)	1,000,000	1,000,000
Daily input cost (Claude Sonnet, $3/M)	$3.00	$0.30
Monthly input cost (30 days)	$90.00	$9.00
Monthly savings	—	$81.00

The first request of each 5-minute window pays full price and writes to cache. All subsequent requests within that window pay the cached rate.

Code Examples

example.py

python

from openai import OpenAI

client = OpenAI(
    base_url="https://api.chuizi.ai/v1",
    api_key="ck-your-key-here",
)

# The gateway auto-injects cache_control for Anthropic models
# when the system prompt exceeds 3000 characters.
# For OpenAI/DeepSeek models, caching is fully automatic.
response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",
    messages=[
        {
            "role": "system",
            "content": "Your long system prompt here..." * 100,  # Must be >= 3000 chars for auto-injection
        },
        {"role": "user", "content": "Summarize the key points."},
    ],
    max_tokens=1024,
)

# Check cache usage
usage = response.usage
print(f"Total input tokens: {usage.prompt_tokens}")
print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}")

Tips

Keep your system prompt stable. Caching works on exact prefix matches. If you change a single character in your system prompt, the cache is invalidated.
Front-load static content. Place instructions and reference documents at the beginning of the conversation. Dynamic content (user messages) goes at the end.
Reuse conversations. Multi-turn conversations naturally benefit from caching -- the entire conversation prefix up to the latest message can be cached.
Monitor via the Generation API. Use GET /v1/generation to see the exact cost breakdown including cached vs. non-cached tokens.

Next Steps

Cost Optimization — additional strategies beyond caching
Cache Pricing — detailed cache hit/miss pricing per provider
Chat Completions API — cache_control parameter reference