Vision
Supported Models
Chuizi.AI supports 80 models with vision capabilities, including:
| Provider | Models |
|---|---|
| OpenAI / Azure | GPT-4.1, GPT-4.1-mini, GPT-4o, GPT-5, o3, o4-mini |
| Anthropic | Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.5, Claude Haiku 3.5 |
| Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.0 Flash | |
| DeepSeek | DeepSeek V3 (via compatible endpoints) |
| Qwen | Qwen-VL-Max, Qwen-VL-Plus |
| Other | Llama 4 Scout, Llama 4 Maverick, Nova Pro/Lite |
Check GET /v1/models for the full list -- models with "vision": true in their capabilities support image input.
Request Format
Images are sent as part of the content array in a message, alongside text:
config.json
json
{ "model": "openai/gpt-4.1", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What is in this image?" }, { "type": "image_url", "image_url": { "url": "https://example.com/photo.jpg" } } ] } ], "max_tokens": 1024 }
Image Input Methods
Public URL
Pass a publicly accessible URL. The model fetches the image directly:
config.json
json
{ "type": "image_url", "image_url": { "url": "https://example.com/photo.jpg" } }
Supported formats: JPEG, PNG, GIF, WebP.
Base64 Data URL
Encode the image as base64 and embed it in the request using a data URL:
config.json
json
{ "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQ..." } }
Use base64 when:
- The image is not publicly accessible.
- You want to avoid an extra HTTP round-trip.
- The image is generated dynamically.
Multiple Images
You can include multiple images in a single message:
config.json
json
{ "role": "user", "content": [ {"type": "text", "text": "Compare these two screenshots and list the differences."}, {"type": "image_url", "image_url": {"url": "https://example.com/before.png"}}, {"type": "image_url", "image_url": {"url": "https://example.com/after.png"}} ] }
Detail Parameter
The detail parameter controls image resolution and token cost:
| Value | Behavior | Token Cost |
|---|---|---|
"auto" | Model chooses based on image size (default) | Varies |
"low" | Resized to 512x512. Faster, cheaper. | ~85 tokens |
"high" | Full resolution analyzed in tiles. More accurate. | ~174 tokens per tile |
config.json
json
{ "type": "image_url", "image_url": { "url": "https://example.com/diagram.png", "detail": "high" } }
Use "low" for simple classification or presence detection. Use "high" for OCR, small text, or detailed analysis.
Code Examples
example.py
python
from openai import OpenAI import base64 client = OpenAI( base_url="https://api.chuizi.ai/v1", api_key="ck-your-key-here", ) # Using a public URL response = client.chat.completions.create( model="openai/gpt-4.1", messages=[ { "role": "user", "content": [ {"type": "text", "text": "Describe this image in detail."}, { "type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}, }, ], } ], max_tokens=1024, ) print(response.choices[0].message.content) # Using base64 with open("screenshot.png", "rb") as f: image_data = base64.b64encode(f.read()).decode("utf-8") response = client.chat.completions.create( model="anthropic/claude-sonnet-4-6", messages=[ { "role": "user", "content": [ {"type": "text", "text": "Extract all text from this screenshot."}, { "type": "image_url", "image_url": { "url": f"data:image/png;base64,{image_data}", "detail": "high", }, }, ], } ], max_tokens=2048, ) print(response.choices[0].message.content)
Tips
- Image size limits. Most models accept images up to 20 MB. Larger images are rejected. Resize before sending if needed.
- Token cost scales with resolution. A high-detail 2048x2048 image can consume 1000+ tokens. Use
"detail": "low"when full resolution is not needed. - Combine vision with tools. You can use vision and function calling together -- for example, analyze a receipt image and call a function to log the expense.
- Not all models support all formats. Some models only accept JPEG and PNG. Check the model's documentation if you encounter format errors.
Next Steps
- Chat Completions API — full parameter reference for vision requests
- Image Generation — generate images from text prompts
- Choose a Model — compare vision-capable models