Alibaba's Qwen 3.7 Max ran autonomously for 35 hours, fired 1,158 tool calls without human input, and delivered a completed GPU kernel optimisation — that internal demonstration, disclosed at the Alibaba Cloud Summit on 20 May 2026, captures what the company is positioning as a decisive shift in frontier AI design.
What Was Released and When
Qwen 3.7 Max is Alibaba's current flagship in the Qwen 3.7 series. API access opened around 19–20 May 2026 via DashScope, Alibaba's developer platform — sources differ slightly, with Artificial Analysis recording 19 May and OpenRouter listing the model on 21 May; 20 May is the date most review sources cite. The model carries a 1-million-token context window — up from 256K on its predecessor — with reworked long-context attention designed to sustain retrieval at long ranges. Extended thinking is built in natively. Weights are closed; no open-weight release has been announced.
Benchmark Position: Highest-Ranked Chinese Model at Launch
On the Artificial Analysis Intelligence Index v4.0, Qwen 3.7 Max scored 56.6 at launch, placing it fifth overall across the 150-plus models measured at the time — and the highest-ranked Chinese model on that leaderboard at that snapshot. Live leaderboards shift continuously; the ranking has moved since launch. On task-specific tests, the figures Alibaba cited are vendor-stated: SWE-Bench Pro 60.6, SWE-Bench Verified 80.4, Terminal-Bench 2.0 at 69.7, and GPQA Diamond at 92.4. SWE-Bench Pro and SWE-Bench Verified are distinct benchmarks; the 60.6 figure refers to the harder Pro variant, which evaluates complex multi-step coding tasks across professional repositories. Independent replication of the full benchmark suite was pending at the time of writing. These place it ahead of DeepSeek V4 Pro on the same index, which scored 52.0 (vendor-stated agentic coding benchmarks for DeepSeek V4 Pro place its SWE-Bench Pro score at 59.0, per multiple review sources).
The Agent-First Design
Alibaba is explicit that Qwen 3.7 Max was not built to win a single-prompt leaderboard. The model is engineered for long-horizon autonomous execution: running a multi-step task pipeline, calling external tools repeatedly, and course-correcting without a human in the loop. The 35-hour demonstration — 1,158 tool calls on an in-house accelerator project — is a marketing claim and has not been independently reproduced. That caveat aside, the architecture choices (extended context, native thinking mode, tool-call optimisation) are consistent with agent-first priorities.
Pricing and Competitive Context
At US$2.50 per million input tokens and US$7.50 per million output tokens, Qwen 3.7 Max is priced above DeepSeek V4 Pro. DeepSeek permanently reduced its V4 Pro pricing to US$0.435/US$0.87 per million tokens on 22 May 2026 — a rate now confirmed as the standing price. DeepSeek also offers open weights, which matters to enterprises that need on-premise deployment. OpenRouter routes Qwen 3.7 Max requests at US$1.25/US$3.75 per million tokens (a 50% discount from the DashScope list rate). Alibaba's counter-argument is capability: on the Artificial Analysis Intelligence Index, Qwen 3.7 Max scored 4.6 points higher than DeepSeek V4 Pro at launch. GPT-5.5 (60.2) and Claude Opus 4.7 (57.3) were ranked above it on the same index at the time, though both carry higher per-token prices at comparable tiers.
What This Signals for China's AI Race
For most of 2024 and early 2025, Chinese frontier labs competed primarily on benchmark scores for chat and reasoning tasks. Qwen 3.7 Max's positioning — and the framing of that 35-hour run as the headline achievement — suggests the competitive axis has moved. Long-horizon autonomous agents are where enterprise software automation deals are being won. Scoring at the top of a major global leaderboard matters commercially; it signals to procurement teams that Chinese models are now in the same capability tier as Western counterparts for the use case that drives enterprise contracts. Whether the benchmark leads translate to production reliability at scale is a different question, and one that only deployment data will answer.