Streaming

Enabling Streaming

Set "stream": true in your request body:

config.json
json
{
  "model": "anthropic/claude-sonnet-4-6",
  "messages": [{"role": "user", "content": "Write a haiku about APIs."}],
  "stream": true,
  "max_tokens": 256
}

The response arrives as a series of Server-Sent Events (SSE) over a single HTTP connection.

SSE Event Format

Each event is a line prefixed with data: , followed by a JSON object. Events are separated by two newlines. The stream ends with data: [DONE].

data: {"id":"gen-xxx","object":"chat.completion.chunk",...}

data: {"id":"gen-xxx","object":"chat.completion.chunk",...}

data: [DONE]

Protocol Differences

OpenAI Protocol (/v1/chat/completions)

This is the default protocol. Each chunk contains a delta object with partial content:

data: {"id":"gen-abc123","object":"chat.completion.chunk","created":1712000000,"model":"openai/gpt-4.1","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"gen-abc123","object":"chat.completion.chunk","created":1712000000,"model":"openai/gpt-4.1","choices":[{"index":0,"delta":{"content":"In"},"finish_reason":null}]}

data: {"id":"gen-abc123","object":"chat.completion.chunk","created":1712000000,"model":"openai/gpt-4.1","choices":[{"index":0,"delta":{"content":" the"},"finish_reason":null}]}

data: {"id":"gen-abc123","object":"chat.completion.chunk","created":1712000000,"model":"openai/gpt-4.1","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"completion_tokens":17,"total_tokens":31}}

data: [DONE]

Anthropic Protocol (/anthropic/v1/messages)

The Anthropic protocol uses typed events with an event: field:

event: message_start
data: {"type":"message_start","message":{"id":"msg_abc123","type":"message","role":"assistant","model":"claude-sonnet-4-6","usage":{"input_tokens":25,"output_tokens":1}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"In"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" the"}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":17}}

event: message_stop
data: {"type":"message_stop"}

Gemini Protocol (/gemini/v1beta/models/*/streamGenerateContent)

Gemini streaming uses its native format. Each chunk contains partial candidates:

config.json
json
{"candidates":[{"content":{"parts":[{"text":"In"}],"role":"model"}}],"usageMetadata":{"promptTokenCount":10,"candidatesTokenCount":1}}

Getting Usage in Streaming Responses

By default, the OpenAI protocol does not include token usage in streaming responses. To receive usage data in the final chunk, add stream_options:

config.json
json
{
  "model": "openai/gpt-4.1",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}

The last chunk before [DONE] will contain a usage field with prompt_tokens, completion_tokens, and total_tokens.

For the Anthropic protocol, usage is always included in the message_start and message_delta events.

Code Examples

terminal
bash
curl -X POST https://api.chuizi.ai/v1/chat/completions \
  -H "Authorization: Bearer ck-your-key-here" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "anthropic/claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Write a haiku about APIs."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Error Handling in Streams

If an error occurs mid-stream, the server sends an error event before closing the connection:

data: {"error":{"message":"Upstream provider timeout","type":"server_error","code":"504"}}

data: [DONE]

Your client should handle connection drops and partial responses gracefully. See the Error Handling guide for retry strategies.

Tips

  • Always set max_tokens when streaming. Without it, some models may generate excessively long responses.
  • Use stream_options.include_usage if you need to track token consumption per request. Without it, the OpenAI protocol does not report usage in streaming mode.
  • Buffer for rendering. Rendering every single token update to the DOM can cause layout thrashing. Buffer a few tokens before updating the UI.
  • Set appropriate timeouts. Streaming connections can last several minutes for long responses. Set your HTTP client timeout to at least 120 seconds.

Next Steps

Streaming — Chuizi AI Docs | Chuizi AI