Production Best Practices

API Key Security

Never hardcode keys

Store API keys in environment variables, not in source code:

example.py
python
import os
from openai import OpenAI

# Do this
client = OpenAI(
    base_url="https://api.chuizi.ai/v1",
    api_key=os.environ["CHUIZI_API_KEY"],
)

# Never this
client = OpenAI(
    base_url="https://api.chuizi.ai/v1",
    api_key="ck-abc123...",  # Leaked if committed to git
)

Use separate keys per environment

Create distinct API keys for development, staging, and production. This lets you:

  • Revoke a compromised dev key without affecting production.
  • Set different rate limits per environment.
  • Track usage separately in the dashboard.

Restrict keys

In the dashboard, configure each key with:

  • Allowed models: Limit which models the key can access.
  • IP whitelist: Restrict to your server IPs.
  • Daily spending limit: Cap maximum daily cost.
  • RPM limit: Set per-key requests-per-minute.

Rotate keys regularly

Rotate production keys every 90 days. The process:

  1. Create a new key in the dashboard.
  2. Update your environment variables to the new key.
  3. Deploy the change.
  4. Verify the new key works in production.
  5. Deactivate the old key.

Timeout Configuration

Large models can take 30-120+ seconds to respond, especially for long outputs or complex reasoning tasks (o3, Opus 4).

ScenarioTimeout
Simple chat (GPT-4.1-mini, Haiku)30 seconds
Standard chat (GPT-4.1, Sonnet)60 seconds
Complex reasoning (o3, GPT-5, Opus 4)120 seconds
Image/video generation300 seconds
Streaming (any model)300 seconds (connection timeout)
example.py
python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.chuizi.ai/v1",
    api_key="ck-your-key-here",
    timeout=120.0,  # 120 seconds
)
index.mjs
javascript
import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://api.chuizi.ai/v1',
  apiKey: 'ck-your-key-here',
  timeout: 120 * 1000, // 120 seconds in milliseconds
});

Concurrency Control

Rate limits

The default rate limit is 60 requests per minute (RPM) per API key. If you need higher throughput:

  1. Request a higher limit through the dashboard.
  2. Distribute requests across multiple API keys.
  3. Implement client-side request queuing.

Request queue pattern

example.py
python
import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://api.chuizi.ai/v1",
    api_key="ck-your-key-here",
)

# Semaphore limits concurrent requests
semaphore = asyncio.Semaphore(10)  # Max 10 concurrent requests


async def chat(messages):
    async with semaphore:
        return await client.chat.completions.create(
            model="openai/gpt-4.1-mini",
            messages=messages,
            max_tokens=1024,
        )


async def process_batch(items):
    tasks = [chat([{"role": "user", "content": item}]) for item in items]
    return await asyncio.gather(*tasks, return_exceptions=True)
index.mjs
javascript
import OpenAI from 'openai';
import pLimit from 'p-limit';

const client = new OpenAI({
  baseURL: 'https://api.chuizi.ai/v1',
  apiKey: 'ck-your-key-here',
});

const limit = pLimit(10); // Max 10 concurrent requests

async function processBatch(items) {
  const tasks = items.map((item) =>
    limit(() =>
      client.chat.completions.create({
        model: 'openai/gpt-4.1-mini',
        messages: [{ role: 'user', content: item }],
        max_tokens: 1024,
      })
    )
  );
  return Promise.allSettled(tasks);
}

Request Monitoring

Generation ID tracking

Every response includes a generation ID in the x_chuizi field and the x-chuizi-generation-id response header. Log this for every request:

example.py
python
response = client.chat.completions.create(
    model="openai/gpt-4.1",
    messages=[{"role": "user", "content": "Hello"}],
)

# Log the generation ID for debugging
gen-id = response.model_extra.get("x_chuizi", {}).get("generation_id")
logger.info(f"Request completed: gen-id={gen-id}, model={response.model}")

Querying request details

Use the Generation API to retrieve full cost and usage details:

terminal
bash
curl https://api.chuizi.ai/v1/generation?id=gen-xxxxxxxxxxxxxxxx \
  -H "Authorization: Bearer ck-your-key-here"

This returns input tokens, output tokens, cached tokens, cost, latency, and status code for the request.

Health checks

Monitor the gateway with periodic health checks:

terminal
bash
# Simple health check
curl -s -o /dev/null -w "%{http_code}" https://api.chuizi.ai/v1/models \
  -H "Authorization: Bearer ck-your-key-here"
# Returns 200 if healthy

Cost Guardrails

Daily spending limits

Set a daily spending cap on each API key in the dashboard. When the limit is reached, requests return 402 until the next day.

Max tokens per request

Always set max_tokens to prevent runaway generation:

config.json
json
{
  "model": "openai/gpt-4.1",
  "messages": [{"role": "user", "content": "Write a summary."}],
  "max_tokens": 500
}

Budget alerts

Monitor your daily spend through the dashboard. Set up alerts for when spending exceeds thresholds (e.g., 80% of daily limit).

Model restrictions

Restrict API keys to specific models. A key that only has access to gpt-4.1-mini cannot accidentally be used with the 10x more expensive claude-opus-4-6.

Logging Best Practices

Structure your logs to include the data you need for debugging:

example.py
python
import logging
import time

logger = logging.getLogger(__name__)


def chat_with_logging(messages, model="openai/gpt-4.1"):
    start = time.time()
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            max_tokens=1024,
        )
        latency = time.time() - start
        x_chuizi = response.model_extra.get("x_chuizi", {})

        logger.info(
            "chat_completion",
            extra={
                "generation_id": x_chuizi.get("generation_id"),
                "model": response.model,
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
                "cost": x_chuizi.get("cost"),
                "latency_ms": int(latency * 1000),
            },
        )
        return response
    except Exception as e:
        latency = time.time() - start
        logger.error(
            "chat_completion_error",
            extra={
                "model": model,
                "error": str(e),
                "latency_ms": int(latency * 1000),
            },
        )
        raise

Checklist

Before going to production, verify:

  • API keys are stored in environment variables, not source code
  • Separate API keys for dev/staging/production
  • Client timeout set to >= 120 seconds
  • Retry logic with exponential backoff for 429/5xx errors
  • max_tokens set on all requests
  • Daily spending limit configured on production keys
  • Generation IDs logged for every request
  • Error responses handled and surfaced to users gracefully
  • Streaming enabled for user-facing chat interfaces
  • Health check monitoring configured

Next Steps