Skip to content

Alibaba acknowledges that Qwen3's hybrid-thinking approach was ill-conceived

Chinese e-commerce heavyweight reverts to instruction-focused and intelligence-centric models, emphasizing quality over convenience.

Alibaba confirms Qwen3's hybrid-thinking approach was a bad idea
Alibaba confirms Qwen3's hybrid-thinking approach was a bad idea

Alibaba acknowledges that Qwen3's hybrid-thinking approach was ill-conceived

Alibaba has recently released updated versions of its Qwen3 models, each tailored for specific tasks. The new models, identifiable by the 2507 date code in their names, come in dedicated instruct-tuned and thinking-tuned variants.

Comparative Performance

The dedicated instruct-tuned Qwen3 model significantly outperforms leading models from OpenAI and DeepSeek on benchmarks involving instruction following, logical reasoning, coding, science, and tool use. It can handle very long context windows up to 256,000 tokens, a massive increase over earlier versions, enabling it to manage larger inputs efficiently and perform better on complex tasks.

On the other hand, the thinking-tuned models show improvements in specialized math-heavy benchmarks such as the AIME25 test, scoring between 13% and 54% better than the original Qwen3 release. However, the uplift here is described as less dramatic compared to the gains seen with the non-thinking, instruct-tuned models.

Original Release vs. Upgraded Models

The original Qwen3 releases had smaller context windows (32k tokens) and showed solid performance but were significantly outclassed by the newest instruct-tuned 235B parameter models in both accuracy and task range. Upgraded models improve on instruction fidelity, logical reasoning, coding, and support multi-domain tool usage with increased parameter tuning and context capacity.

Future Plans for Hybrid Thinking Mode

Alibaba initially introduced a "hybrid thinking mode" in Qwen3 that aimed to blend instruction-following and deep reasoning abilities in one unified model. Despite some initial enthusiasm, the hybrid thinking mode underperformed or was "dumb" in certain respects. Consequently, development on this mode is currently paused. The Alibaba team stated they have not abandoned the idea but are continuing research to resolve quality issues before reintroducing hybrid thinking functionality in future Qwen models.

Additional Notes

Alibaba also unveiled Qwen3 Coder, a specialized AI model for coding tasks with a massive 480 billion total parameters and ability to process context lengths up to 1 million tokens, surpassing even GPT-4.1 in coding benchmarks. The overall trend of Qwen3 releases is improving domain-specific capabilities (math, code, reasoning) and expanding the model's effective context length to support longer, more complex interactions.

The new releases of Qwen models have a larger context window, which has been extended from 32k tokens to 256k. The updated models are being made available in both their native BF16 and quantized FP8 datatypes.

The team hasn't specified whether they will also release updated versions of the smaller 30 billion parameter mixture of experts (MoE) model. The team hasn't provided details on the performance improvements of the updated thinking-tuned models compared to the non-thinking versions.

Stay tuned for more updates as the Qwen team plans to roll out additional versions of its Qwen3 models in the coming days.

  1. The instruct-tuned Qwen3 model, with a larger context window, demonstrates superior performance on benchmarks involving instruction following, logical reasoning, coding, science, and tool use compared to leading models from OpenAI and DeepSeek, surpassing even the latest 235B parameter models.
  2. Alibaba's thinking-tuned Qwen3 models, particularly the ones specialized in math-heavy benchmarks like the AIME25 test, show improvements, scoring between 13% and 54% better than the original Qwen3 release, although the gains are less dramatic compared to the instruct-tuned models.

Read also:

    Latest