Provider-Specific Behavior

While Chuizi.AI normalizes API responses to a consistent format, upstream providers have behavioral differences that can affect your application. This page documents the key differences you should be aware of.

Azure OpenAI

Content Filtering

Azure OpenAI applies a mandatory content filter that is stricter than the standard OpenAI API. Requests that succeed on OpenAI may be rejected on Azure.

Behavior	Details
Filter categories	Hate, self-harm, sexual content, violence
Default severity	Medium threshold for all categories
Response on block	`400 bad_request` with filter category in the error message
Impact	Some creative writing, medical, or educational prompts may be blocked

When a request is blocked by the content filter, the error message includes which category triggered the block. You cannot disable the filter through the gateway.

config.json

json

{
  "error": {
    "message": "Content blocked by Azure content filter: violence (severity: medium)",
    "type": "invalid_request_error",
    "code": "bad_request"
  }
}

Responses API

Azure supports the OpenAI Responses API for select models (codex and pro variants). The gateway translates Responses API requests to the appropriate Azure endpoints. Note that tool_choice and some parameters may behave differently compared to direct OpenAI access.

Amazon Bedrock (Anthropic Models)

Max Tokens

Some Claude models have lower max_tokens limits when served through Bedrock compared to the Anthropic direct API.

Model	Anthropic Direct	Bedrock
Claude Opus 4.6	32,768	4,096 (default)
Claude Sonnet 4.6	8,192	4,096 (default)

If you need higher output limits on Bedrock-routed models, explicitly set max_tokens in your request. The gateway passes this through but cannot exceed Bedrock's hard limits.

Streaming Format

Bedrock uses a different streaming wire format (AWS event stream) compared to Anthropic's SSE. The gateway handles this conversion transparently. You always receive standard SSE regardless of the upstream provider.

Google Gemini

Safety Settings

Gemini applies safety filters by default that are more conservative than most other providers. Content may be blocked across multiple categories.

Category	Default Threshold
Harassment	`BLOCK_MEDIUM_AND_ABOVE`
Hate speech	`BLOCK_MEDIUM_AND_ABOVE`
Sexually explicit	`BLOCK_MEDIUM_AND_ABOVE`
Dangerous content	`BLOCK_MEDIUM_AND_ABOVE`

When content is blocked, the response includes a finish_reason of safety instead of stop. The gateway maps this to the standard response format but preserves the SAFETY finish reason.

Grounding and Citations

Gemini models may include grounding metadata and web citations in responses. The gateway preserves this information in the response when present but does not add it for non-Gemini models.

Function Calling

Gemini's function calling uses a different schema format internally. The gateway translates between OpenAI's tools format and Gemini's native format. Minor behavioral differences may occur.

Difference	Details
`tool_choice: "required"`	Supported but may produce different selection behavior.
Parallel tool calls	Gemini may return multiple tool calls in one response more aggressively.
Tool result format	Handled by gateway. Send results in OpenAI format.

DeepSeek

Reasoning Tokens

DeepSeek models that support chain-of-thought reasoning return reasoning_tokens in the usage object. These tokens represent the model's internal reasoning process.

config.json

json

{
  "usage": {
    "prompt_tokens": 150,
    "completion_tokens": 200,
    "total_tokens": 350,
    "reasoning_tokens": 80
  }
}

Reasoning tokens are billed as output tokens. The completion_tokens count includes reasoning tokens.

FIM (Fill-in-the-Middle)

DeepSeek Coder models support FIM completion via the standard completions endpoint. Use the suffix parameter. This is not available through the chat completions endpoint.

Alibaba Qwen (via DashScope)

Function Calling

Qwen supports function calling but with subtle behavioral differences.

Difference	Details
`tool_choice`	Supports `"auto"` and `"none"`. Named tool choice (`{"type": "function", "function": {"name": "xxx"}}`) may not always force the specified tool.
Tool descriptions	Qwen is more sensitive to tool description quality. Vague descriptions may produce unexpected tool selections.
Streaming tool calls	Tool call arguments may arrive in differently sized chunks.

Long Context

Qwen-Long models support up to 10M tokens of context. Requests with very long context may have higher latency. The gateway does not impose additional context limits beyond what the upstream model supports.

Volcengine Doubao

Response Format

Doubao's structured output support (response_format: { type: "json_object" }) is available but less reliable than OpenAI's implementation for complex schemas. Test thoroughly with your specific use case.

Streaming Differences

Doubao may send larger chunks during streaming compared to OpenAI or Anthropic. This does not affect the content but may cause more "bursty" streaming behavior in your UI.

General Differences

Streaming Chunk Size

Different providers send SSE chunks at different granularities.

Provider	Typical Chunk Size	Notes
OpenAI	1-3 tokens	Very granular, smooth streaming
Anthropic	1-5 tokens	Slightly larger chunks
Google	5-20 tokens	Larger, less frequent chunks
DeepSeek	1-3 tokens	Similar to OpenAI
Chinese providers	Variable	Tends toward larger chunks

The gateway passes through chunks as-is without rebuffering. Streaming smoothness depends on the upstream provider.

Model Version Pinning

Some providers use date-based model versions. The gateway maps aliases to the latest stable version.

Alias	Resolves To	Provider
`gpt-4o`	Latest stable GPT-4o snapshot	OpenAI
`claude-sonnet-4-6`	`claude-sonnet-4-6`	Anthropic
`gemini-2.5-flash`	Latest stable 2.5 Flash	Google

To pin a specific version, use the full versioned model name (e.g., openai/gpt-4o).

Token Counting Variations

Different providers count tokens differently for the same input text. The gateway reports whatever the upstream provider returns in the usage field.

Factor	Impact
Tokenizer differences	The same text may be 100 tokens on OpenAI and 110 on Anthropic.
System prompt handling	Some providers count system prompts differently.
Image tokens	Token costs for vision inputs vary significantly between providers.

Billing is always based on the upstream provider's reported token counts, not an independent count by the gateway. See Billing Model for how token counts translate to costs.

Next Steps

Status Code Mapping — How upstream errors are normalized to gateway error codes
Choose a Model — Pick the right model considering provider-specific trade-offs
Streaming Guide — Handle varying chunk sizes across providers