
In the newly released SpatialBench benchmark, Alibaba's Qwen3-VL and Qwen2.5-VL visual models secured the top two spots with scores of 13.5 and 12.9 respectively, significantly outperforming Gemini 3.0 Pro Preview (9.6) and GPT-5.1 (7.5), bringing them closer to the human baseline of 80 points. SpatialBench, a leading benchmark focusing on 2D/3D spatial reasoning, covers complex tasks such as circuit analysis and CAD engineering, and is hailed as a "litmus test for embodied intelligence." Its evaluation results are considered a core indicator of AI's spatial understanding capabilities.
Technically, Qwen3-VL achieves upgraded 3D detection through rotating bounding box output and a depth estimation head, improving accuracy in occluded scenes by 18% and accurately determining object orientation and viewpoint changes. Its innovative visual programming function supports generating runnable Python code from input sketches or short videos, achieving a "what you see is what you get" experience. Furthermore, the model offers diverse scale options from 2B to 235B, outperforming Gemini 2.5-Pro by an average of 6.4 points in 32 core tests.
The open-source plan shows that Qwen2.5-VL is fully open-source, while Qwen3-VL will release its weights and toolchain in the second quarter of 2025, simultaneously launching on the Qianwen App for free trial. Alibaba Cloud revealed that the model has been validated in scenarios such as logistics robots and AR assembly, with a spatial positioning error of less than 2cm, and plans to launch a "vision-action" end-to-end model in 2026, providing robots with real-time visual servoing capabilities.
This achievement marks a breakthrough for Chinese AI in the multimodal field. Industry evaluations indicate that the Qwen-VL series has surpassed GPT-4V in tasks such as document analysis and Chinese image understanding, forming a global top three alongside Gemini and GPT.