Vision

Supported Models

Chuizi.AI supports 80 models with vision capabilities, including:

Provider	Models
OpenAI / Azure	GPT-4.1, GPT-4.1-mini, GPT-4o, GPT-5, o3, o4-mini
Anthropic	Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.5, Claude Haiku 3.5
Google	Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.0 Flash
DeepSeek	DeepSeek V3 (via compatible endpoints)
Qwen	Qwen-VL-Max, Qwen-VL-Plus
Other	Llama 4 Scout, Llama 4 Maverick, Nova Pro/Lite

Check GET /v1/models for the full list -- models with "vision": true in their capabilities support image input.

Request Format

Images are sent as part of the content array in a message, alongside text:

config.json

json

{
  "model": "openai/gpt-4.1",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/photo.jpg"
          }
        }
      ]
    }
  ],
  "max_tokens": 1024
}

Image Input Methods

Public URL

Pass a publicly accessible URL. The model fetches the image directly:

config.json

json

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/photo.jpg"
  }
}

Supported formats: JPEG, PNG, GIF, WebP.

Base64 Data URL

Encode the image as base64 and embed it in the request using a data URL:

config.json

json

{
  "type": "image_url",
  "image_url": {
    "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQ..."
  }
}

Use base64 when:

The image is not publicly accessible.
You want to avoid an extra HTTP round-trip.
The image is generated dynamically.

Multiple Images

You can include multiple images in a single message:

config.json

json

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Compare these two screenshots and list the differences."},
    {"type": "image_url", "image_url": {"url": "https://example.com/before.png"}},
    {"type": "image_url", "image_url": {"url": "https://example.com/after.png"}}
  ]
}

Detail Parameter

The detail parameter controls image resolution and token cost:

Value	Behavior	Token Cost
`"auto"`	Model chooses based on image size (default)	Varies
`"low"`	Resized to 512x512. Faster, cheaper.	~85 tokens
`"high"`	Full resolution analyzed in tiles. More accurate.	~174 tokens per tile

config.json

json

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/diagram.png",
    "detail": "high"
  }
}

Use "low" for simple classification or presence detection. Use "high" for OCR, small text, or detailed analysis.

Code Examples

example.py

python

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://api.chuizi.ai/v1",
    api_key="ck-your-key-here",
)

# Using a public URL
response = client.chat.completions.create(
    model="openai/gpt-4.1",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image in detail."},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/photo.jpg"},
                },
            ],
        }
    ],
    max_tokens=1024,
)
print(response.choices[0].message.content)

# Using base64
with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract all text from this screenshot."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_data}",
                        "detail": "high",
                    },
                },
            ],
        }
    ],
    max_tokens=2048,
)
print(response.choices[0].message.content)

Tips

Image size limits. Most models accept images up to 20 MB. Larger images are rejected. Resize before sending if needed.
Token cost scales with resolution. A high-detail 2048x2048 image can consume 1000+ tokens. Use "detail": "low" when full resolution is not needed.
Combine vision with tools. You can use vision and function calling together -- for example, analyze a receipt image and call a function to log the expense.
Not all models support all formats. Some models only accept JPEG and PNG. Check the model's documentation if you encounter format errors.

Next Steps

Chat Completions API — full parameter reference for vision requests
Image Generation — generate images from text prompts
Choose a Model — compare vision-capable models