Prompt Caching

Supported Models and Savings

ProviderModelsTrigger ConditionSavingsYour Action
OpenAI / AzureGPT-4.1, GPT-4o, GPT-5, o3, o4-mini (all GPT models)Automatic when prefix >= 1024 tokens50% on cached input tokensNone required
AnthropicAll 9 Claude models (Opus 4, Sonnet 4, Haiku 3.5, etc.)Gateway auto-injects cache_control when system prompt >= 3000 characters90% on cached input tokensNone required (OpenAI protocol)
DeepSeekDeepSeek V3, DeepSeek R1, DeepSeek ChatAutomatic disk caching, 64-token alignment90% on cached input tokensNone required
GoogleGemini 2.5 Pro, Gemini 2.5 FlashImplicit caching, automatic90% on cached input tokensNone required
Alibaba (Bailian)GLM-5, MiniMax M2.5, Qwen modelsAutomaticVaries by modelNone required

How It Works

OpenAI / Azure Models

OpenAI caches automatically. When your request prefix (system prompt + early messages) is at least 1024 tokens and matches a previous request, OpenAI returns cached tokens at 50% of the input price.

You do not need to change anything. The gateway passes your request through, and the response usage field reports how many tokens were cached:

config.json
json
{
  "usage": {
    "prompt_tokens": 2048,
    "completion_tokens": 150,
    "total_tokens": 2198,
    "prompt_tokens_details": {
      "cached_tokens": 1024
    }
  }
}

Anthropic / Claude Models

Anthropic requires explicit cache_control markers in the request to enable caching. When you use Chuizi.AI through the OpenAI-compatible protocol (/v1/chat/completions), the gateway automatically injects cache_control: { "type": "ephemeral" } on your system prompt when it exceeds 3000 characters. You get 90% savings with zero code changes.

When you use the Anthropic native protocol (/anthropic/v1/messages), the gateway does not modify your request body. You need to add cache_control yourself:

config.json
json
{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "system": [
    {
      "type": "text",
      "text": "Your long system prompt with instructions, context, documentation, etc. This needs to be at least 1024 tokens for Anthropic to cache it.",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "Your question here" }
  ]
}

Cache entries persist for 5 minutes after the last access. Every cache hit resets the TTL.

DeepSeek Models

DeepSeek uses automatic disk-based caching. The input is segmented into 64-token blocks, and matching prefix blocks are served from cache at 90% discount. No action required on your side.

Gemini 2.5+ Models

Google Gemini 2.5 and later models use implicit caching. When the API detects repeated prefixes, it automatically caches them. Savings appear in the response usage field.

Reading Cache Results

After each request, check the usage.prompt_tokens_details field in the response:

config.json
json
{
  "usage": {
    "prompt_tokens": 15000,
    "completion_tokens": 500,
    "total_tokens": 15500,
    "prompt_tokens_details": {
      "cached_tokens": 14000,
      "cache_creation_tokens": 0
    }
  }
}
  • cached_tokens: Tokens served from cache (charged at the discounted rate).
  • cache_creation_tokens: Tokens written to cache for the first time (Anthropic charges a small write premium; other providers do not).

Cost Comparison: Claude Code User

A typical Claude Code user sends ~50 requests per day, each with a ~20,000-token system prompt.

MetricWithout CachingWith Caching (90% savings)
Daily input tokens (system prompt only)1,000,0001,000,000
Daily input cost (Claude Sonnet, $3/M)$3.00$0.30
Monthly input cost (30 days)$90.00$9.00
Monthly savings$81.00

The first request of each 5-minute window pays full price and writes to cache. All subsequent requests within that window pay the cached rate.

Code Examples

example.py
python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.chuizi.ai/v1",
    api_key="ck-your-key-here",
)

# The gateway auto-injects cache_control for Anthropic models
# when the system prompt exceeds 3000 characters.
# For OpenAI/DeepSeek models, caching is fully automatic.
response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",
    messages=[
        {
            "role": "system",
            "content": "Your long system prompt here..." * 100,  # Must be >= 3000 chars for auto-injection
        },
        {"role": "user", "content": "Summarize the key points."},
    ],
    max_tokens=1024,
)

# Check cache usage
usage = response.usage
print(f"Total input tokens: {usage.prompt_tokens}")
print(f"Cached tokens: {usage.prompt_tokens_details.cached_tokens}")

Tips

  • Keep your system prompt stable. Caching works on exact prefix matches. If you change a single character in your system prompt, the cache is invalidated.
  • Front-load static content. Place instructions and reference documents at the beginning of the conversation. Dynamic content (user messages) goes at the end.
  • Reuse conversations. Multi-turn conversations naturally benefit from caching -- the entire conversation prefix up to the latest message can be cached.
  • Monitor via the Generation API. Use GET /v1/generation to see the exact cost breakdown including cached vs. non-cached tokens.

Next Steps

Prompt Caching — Chuizi AI Docs | Chuizi AI