Cost Optimization

Strategy 1: Prompt Caching (50-90% Savings)

The single biggest cost lever. If you send the same system prompt across many requests, prompt caching avoids re-processing those tokens.

ProviderSavingsTrigger
OpenAI / Azure50%Automatic, prefix >= 1024 tokens
Anthropic90%Gateway auto-injects cache_control (system prompt >= 3000 chars)
DeepSeek90%Automatic disk caching
Gemini 2.5+90%Implicit caching

Impact example: A Claude Code user sending 50 requests/day with 20K-token system prompts saves ~$81/month on input tokens alone.

See the Prompt Caching guide for details.

Strategy 2: Optimal Routing

The same model is often available through multiple providers. Chuizi.AI automatically routes to the optimal provider for best cost and performance. You benefit from this without any configuration -- just use the model name (e.g., deepseek/deepseek-chat) and the gateway handles provider selection.

For models available through multiple providers, this automatic routing can result in significant cost savings compared to using a single provider directly.

Strategy 3: Model Tiering

Not every task needs the most capable model. Match model capability to task difficulty:

Task TypeRecommended ModelInput PriceOutput Price
Classification, tagging, simple extractionopenai/gpt-4.1-nano$0.10/M$0.40/M
Summarization, translation, Q&Aopenai/gpt-4.1-mini$0.40/M$1.60/M
Code generation, analysisanthropic/claude-sonnet-4-6$3.00/M$15.00/M
Complex reasoning, researchopenai/gpt-5$2.63/M$15.00/M
Hardest problems, agentic workflowsanthropic/claude-opus-4-6$15.00/M$75.00/M

Cost difference: Processing 1M input + 100K output tokens:

  • With GPT-4.1-nano: $0.14
  • With Claude Opus 4-6: $22.50
  • Ratio: 160x

Use a routing strategy in your application: classify the task first with a cheap model, then dispatch to the appropriate tier.

Strategy 4: Reduce Token Usage

Set max_tokens explicitly

Without max_tokens, some models generate until their context limit. Always set a reasonable cap:

config.json
json
{
  "model": "openai/gpt-4.1",
  "messages": [{"role": "user", "content": "Translate to French: Hello"}],
  "max_tokens": 100
}

Trim conversation history

Long conversation histories inflate input tokens. Keep only the messages the model needs:

example.py
python
# Instead of sending the entire history:
messages = conversation_history[-10:]  # Keep last 10 turns

# Or summarize older messages:
messages = [
    {"role": "system", "content": "Previous context summary: ..."},
    *recent_messages,
]

Use detail: "low" for vision

When analyzing images, use low detail mode unless you need OCR or fine-grained analysis:

config.json
json
{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/photo.jpg",
    "detail": "low"
  }
}

This reduces image token cost from ~1000+ tokens to ~85 tokens.

Next Steps