Cost Optimization
Strategy 1: Prompt Caching (50-90% Savings)
The single biggest cost lever. If you send the same system prompt across many requests, prompt caching avoids re-processing those tokens.
| Provider | Savings | Trigger |
|---|---|---|
| OpenAI / Azure | 50% | Automatic, prefix >= 1024 tokens |
| Anthropic | 90% | Gateway auto-injects cache_control (system prompt >= 3000 chars) |
| DeepSeek | 90% | Automatic disk caching |
| Gemini 2.5+ | 90% | Implicit caching |
Impact example: A Claude Code user sending 50 requests/day with 20K-token system prompts saves ~$81/month on input tokens alone.
See the Prompt Caching guide for details.
Strategy 2: Optimal Routing
The same model is often available through multiple providers. Chuizi.AI automatically routes to the optimal provider for best cost and performance. You benefit from this without any configuration -- just use the model name (e.g., deepseek/deepseek-chat) and the gateway handles provider selection.
For models available through multiple providers, this automatic routing can result in significant cost savings compared to using a single provider directly.
Strategy 3: Model Tiering
Not every task needs the most capable model. Match model capability to task difficulty:
| Task Type | Recommended Model | Input Price | Output Price |
|---|---|---|---|
| Classification, tagging, simple extraction | openai/gpt-4.1-nano | $0.10/M | $0.40/M |
| Summarization, translation, Q&A | openai/gpt-4.1-mini | $0.40/M | $1.60/M |
| Code generation, analysis | anthropic/claude-sonnet-4-6 | $3.00/M | $15.00/M |
| Complex reasoning, research | openai/gpt-5 | $2.63/M | $15.00/M |
| Hardest problems, agentic workflows | anthropic/claude-opus-4-6 | $15.00/M | $75.00/M |
Cost difference: Processing 1M input + 100K output tokens:
- With GPT-4.1-nano: $0.14
- With Claude Opus 4-6: $22.50
- Ratio: 160x
Use a routing strategy in your application: classify the task first with a cheap model, then dispatch to the appropriate tier.
Strategy 4: Reduce Token Usage
Set max_tokens explicitly
Without max_tokens, some models generate until their context limit. Always set a reasonable cap:
{ "model": "openai/gpt-4.1", "messages": [{"role": "user", "content": "Translate to French: Hello"}], "max_tokens": 100 }
Trim conversation history
Long conversation histories inflate input tokens. Keep only the messages the model needs:
# Instead of sending the entire history: messages = conversation_history[-10:] # Keep last 10 turns # Or summarize older messages: messages = [ {"role": "system", "content": "Previous context summary: ..."}, *recent_messages, ]
Use detail: "low" for vision
When analyzing images, use low detail mode unless you need OCR or fine-grained analysis:
{ "type": "image_url", "image_url": { "url": "https://example.com/photo.jpg", "detail": "low" } }
This reduces image token cost from ~1000+ tokens to ~85 tokens.
Next Steps
- Prompt Caching — detailed caching guide for all providers
- Billing Model — understand how costs are calculated
- Balance Alerts — get notified before running out of credits