Provider-Specific Behavior
While Chuizi.AI normalizes API responses to a consistent format, upstream providers have behavioral differences that can affect your application. This page documents the key differences you should be aware of.
Azure OpenAI
Content Filtering
Azure OpenAI applies a mandatory content filter that is stricter than the standard OpenAI API. Requests that succeed on OpenAI may be rejected on Azure.
| Behavior | Details |
|---|---|
| Filter categories | Hate, self-harm, sexual content, violence |
| Default severity | Medium threshold for all categories |
| Response on block | 400 bad_request with filter category in the error message |
| Impact | Some creative writing, medical, or educational prompts may be blocked |
When a request is blocked by the content filter, the error message includes which category triggered the block. You cannot disable the filter through the gateway.
{ "error": { "message": "Content blocked by Azure content filter: violence (severity: medium)", "type": "invalid_request_error", "code": "bad_request" } }
Responses API
Azure supports the OpenAI Responses API for select models (codex and pro variants). The gateway translates Responses API requests to the appropriate Azure endpoints. Note that tool_choice and some parameters may behave differently compared to direct OpenAI access.
Amazon Bedrock (Anthropic Models)
Max Tokens
Some Claude models have lower max_tokens limits when served through Bedrock compared to the Anthropic direct API.
| Model | Anthropic Direct | Bedrock |
|---|---|---|
| Claude Opus 4.6 | 32,768 | 4,096 (default) |
| Claude Sonnet 4.6 | 8,192 | 4,096 (default) |
If you need higher output limits on Bedrock-routed models, explicitly set max_tokens in your request. The gateway passes this through but cannot exceed Bedrock's hard limits.
Streaming Format
Bedrock uses a different streaming wire format (AWS event stream) compared to Anthropic's SSE. The gateway handles this conversion transparently. You always receive standard SSE regardless of the upstream provider.
Google Gemini
Safety Settings
Gemini applies safety filters by default that are more conservative than most other providers. Content may be blocked across multiple categories.
| Category | Default Threshold |
|---|---|
| Harassment | BLOCK_MEDIUM_AND_ABOVE |
| Hate speech | BLOCK_MEDIUM_AND_ABOVE |
| Sexually explicit | BLOCK_MEDIUM_AND_ABOVE |
| Dangerous content | BLOCK_MEDIUM_AND_ABOVE |
When content is blocked, the response includes a finish_reason of safety instead of stop. The gateway maps this to the standard response format but preserves the SAFETY finish reason.
Grounding and Citations
Gemini models may include grounding metadata and web citations in responses. The gateway preserves this information in the response when present but does not add it for non-Gemini models.
Function Calling
Gemini's function calling uses a different schema format internally. The gateway translates between OpenAI's tools format and Gemini's native format. Minor behavioral differences may occur.
| Difference | Details |
|---|---|
tool_choice: "required" | Supported but may produce different selection behavior. |
| Parallel tool calls | Gemini may return multiple tool calls in one response more aggressively. |
| Tool result format | Handled by gateway. Send results in OpenAI format. |
DeepSeek
Reasoning Tokens
DeepSeek models that support chain-of-thought reasoning return reasoning_tokens in the usage object. These tokens represent the model's internal reasoning process.
{ "usage": { "prompt_tokens": 150, "completion_tokens": 200, "total_tokens": 350, "reasoning_tokens": 80 } }
Reasoning tokens are billed as output tokens. The completion_tokens count includes reasoning tokens.
FIM (Fill-in-the-Middle)
DeepSeek Coder models support FIM completion via the standard completions endpoint. Use the suffix parameter. This is not available through the chat completions endpoint.
Alibaba Qwen (via DashScope)
Function Calling
Qwen supports function calling but with subtle behavioral differences.
| Difference | Details |
|---|---|
tool_choice | Supports "auto" and "none". Named tool choice ({"type": "function", "function": {"name": "xxx"}}) may not always force the specified tool. |
| Tool descriptions | Qwen is more sensitive to tool description quality. Vague descriptions may produce unexpected tool selections. |
| Streaming tool calls | Tool call arguments may arrive in differently sized chunks. |
Long Context
Qwen-Long models support up to 10M tokens of context. Requests with very long context may have higher latency. The gateway does not impose additional context limits beyond what the upstream model supports.
Volcengine Doubao
Response Format
Doubao's structured output support (response_format: { type: "json_object" }) is available but less reliable than OpenAI's implementation for complex schemas. Test thoroughly with your specific use case.
Streaming Differences
Doubao may send larger chunks during streaming compared to OpenAI or Anthropic. This does not affect the content but may cause more "bursty" streaming behavior in your UI.
General Differences
Streaming Chunk Size
Different providers send SSE chunks at different granularities.
| Provider | Typical Chunk Size | Notes |
|---|---|---|
| OpenAI | 1-3 tokens | Very granular, smooth streaming |
| Anthropic | 1-5 tokens | Slightly larger chunks |
| 5-20 tokens | Larger, less frequent chunks | |
| DeepSeek | 1-3 tokens | Similar to OpenAI |
| Chinese providers | Variable | Tends toward larger chunks |
The gateway passes through chunks as-is without rebuffering. Streaming smoothness depends on the upstream provider.
Model Version Pinning
Some providers use date-based model versions. The gateway maps aliases to the latest stable version.
| Alias | Resolves To | Provider |
|---|---|---|
gpt-4o | Latest stable GPT-4o snapshot | OpenAI |
claude-sonnet-4-6 | claude-sonnet-4-6 | Anthropic |
gemini-2.5-flash | Latest stable 2.5 Flash |
To pin a specific version, use the full versioned model name (e.g., openai/gpt-4o).
Token Counting Variations
Different providers count tokens differently for the same input text. The gateway reports whatever the upstream provider returns in the usage field.
| Factor | Impact |
|---|---|
| Tokenizer differences | The same text may be 100 tokens on OpenAI and 110 on Anthropic. |
| System prompt handling | Some providers count system prompts differently. |
| Image tokens | Token costs for vision inputs vary significantly between providers. |
Billing is always based on the upstream provider's reported token counts, not an independent count by the gateway. See Billing Model for how token counts translate to costs.
Next Steps
- Status Code Mapping — How upstream errors are normalized to gateway error codes
- Choose a Model — Pick the right model considering provider-specific trade-offs
- Streaming Guide — Handle varying chunk sizes across providers