Production Best Practices
API Key Security
Never hardcode keys
Store API keys in environment variables, not in source code:
import os from openai import OpenAI # Do this client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key=os.environ["CHUIZI_API_KEY"], ) # Never this client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-abc123...", # Leaked if committed to git )
Use separate keys per environment
Create distinct API keys for development, staging, and production. This lets you:
- Revoke a compromised dev key without affecting production.
- Set different rate limits per environment.
- Track usage separately in the dashboard.
Restrict keys
In the dashboard, configure each key with:
- Allowed models: Limit which models the key can access.
- IP whitelist: Restrict to your server IPs.
- Daily spending limit: Cap maximum daily cost.
- RPM limit: Set per-key requests-per-minute.
Rotate keys regularly
Rotate production keys every 90 days. The process:
- Create a new key in the dashboard.
- Update your environment variables to the new key.
- Deploy the change.
- Verify the new key works in production.
- Deactivate the old key.
Timeout Configuration
Large models can take 30-120+ seconds to respond, especially for long outputs or complex reasoning tasks (o3, Opus 4).
Recommended timeouts
| Scenario | Timeout |
|---|---|
| Simple chat (GPT-4.1-mini, Haiku) | 30 seconds |
| Standard chat (GPT-4.1, Sonnet) | 60 seconds |
| Complex reasoning (o3, GPT-5, Opus 4) | 120 seconds |
| Image/video generation | 300 seconds |
| Streaming (any model) | 300 seconds (connection timeout) |
from openai import OpenAI client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", timeout=120.0, # 120 seconds )
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https://api.chuizi.ai/v1', apiKey: 'ck-your-key-here', timeout: 120 * 1000, // 120 seconds in milliseconds });
Concurrency Control
Rate limits
The default rate limit is 60 requests per minute (RPM) per API key. If you need higher throughput:
- Request a higher limit through the dashboard.
- Distribute requests across multiple API keys.
- Implement client-side request queuing.
Request queue pattern
import asyncio from openai import AsyncOpenAI client = AsyncOpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", ) # Semaphore limits concurrent requests semaphore = asyncio.Semaphore(10) # Max 10 concurrent requests async def chat(messages): async with semaphore: return await client.chat.completions.create( model="openai/gpt-4.1-mini", messages=messages, max_tokens=1024, ) async def process_batch(items): tasks = [chat([{"role": "user", "content": item}]) for item in items] return await asyncio.gather(*tasks, return_exceptions=True)
import OpenAI from 'openai'; import pLimit from 'p-limit'; const client = new OpenAI({ baseURL: 'https://api.chuizi.ai/v1', apiKey: 'ck-your-key-here', }); const limit = pLimit(10); // Max 10 concurrent requests async function processBatch(items) { const tasks = items.map((item) => limit(() => client.chat.completions.create({ model: 'openai/gpt-4.1-mini', messages: [{ role: 'user', content: item }], max_tokens: 1024, }) ) ); return Promise.allSettled(tasks); }
Request Monitoring
Generation ID tracking
Every response includes a generation ID in the x_chuizi field and the x-chuizi-generation-id response header. Log this for every request:
response = client.chat.completions.create( model="openai/gpt-4.1", messages=[{"role": "user", "content": "Hello"}], ) # Log the generation ID for debugging gen-id = response.model_extra.get("x_chuizi", {}).get("generation_id") logger.info(f"Request completed: gen-id={gen-id}, model={response.model}")
Querying request details
Use the Generation API to retrieve full cost and usage details:
curl https://api.chuizi.ai/v1/generation?id=gen-xxxxxxxxxxxxxxxx \ -H "Authorization: Bearer ck-your-key-here"
This returns input tokens, output tokens, cached tokens, cost, latency, and status code for the request.
Health checks
Monitor the gateway with periodic health checks:
# Simple health check curl -s -o /dev/null -w "%{http_code}" https://api.chuizi.ai/v1/models \ -H "Authorization: Bearer ck-your-key-here" # Returns 200 if healthy
Cost Guardrails
Daily spending limits
Set a daily spending cap on each API key in the dashboard. When the limit is reached, requests return 402 until the next day.
Max tokens per request
Always set max_tokens to prevent runaway generation:
{ "model": "openai/gpt-4.1", "messages": [{"role": "user", "content": "Write a summary."}], "max_tokens": 500 }
Budget alerts
Monitor your daily spend through the dashboard. Set up alerts for when spending exceeds thresholds (e.g., 80% of daily limit).
Model restrictions
Restrict API keys to specific models. A key that only has access to gpt-4.1-mini cannot accidentally be used with the 10x more expensive claude-opus-4-6.
Logging Best Practices
Structure your logs to include the data you need for debugging:
import logging import time logger = logging.getLogger(__name__) def chat_with_logging(messages, model="openai/gpt-4.1"): start = time.time() try: response = client.chat.completions.create( model=model, messages=messages, max_tokens=1024, ) latency = time.time() - start x_chuizi = response.model_extra.get("x_chuizi", {}) logger.info( "chat_completion", extra={ "generation_id": x_chuizi.get("generation_id"), "model": response.model, "prompt_tokens": response.usage.prompt_tokens, "completion_tokens": response.usage.completion_tokens, "cost": x_chuizi.get("cost"), "latency_ms": int(latency * 1000), }, ) return response except Exception as e: latency = time.time() - start logger.error( "chat_completion_error", extra={ "model": model, "error": str(e), "latency_ms": int(latency * 1000), }, ) raise
Checklist
Before going to production, verify:
- API keys are stored in environment variables, not source code
- Separate API keys for dev/staging/production
- Client timeout set to >= 120 seconds
- Retry logic with exponential backoff for 429/5xx errors
-
max_tokensset on all requests - Daily spending limit configured on production keys
- Generation IDs logged for every request
- Error responses handled and surfaced to users gracefully
- Streaming enabled for user-facing chat interfaces
- Health check monitoring configured
Next Steps
- Error Handling — retry strategies and error format reference
- API Key Best Practices — detailed key security guidance
- Cost Optimization — reduce costs with caching and model tiering