GLM-5.1 Tops GPT-5.4 on SWE-Bench Pro — Open-Weight AI
GLM-5.1 beats GPT-5.4 on SWE-Bench Pro: 58.4% vs 57.7%. MIT-licensed open-source AI, trained on Huawei chips. No Nvidia. Weights free on HuggingFace.
On April 7, 2026, an open-weight AI model called GLM-5.1 quietly claimed the #1 position on SWE-Bench Pro — a benchmark widely used to measure real-world coding ability by testing whether an AI can locate and fix actual bugs in live software repositories. Its score: 58.4%. GPT-5.4 scored 57.7%. Claude Opus 4.6 scored 57.3%. The model's maker, Z.ai (a Chinese AI research lab that listed on the Hong Kong Stock Exchange in January at a $31.3 billion valuation), published the full model weights under the MIT license with zero commercial restrictions. And it was built without a single Nvidia chip.
GLM-5.1's SWE-Bench Pro Score: Topping the AI Coding Leaderboard
SWE-Bench Pro is considered one of the hardest coding benchmarks available. Unlike tests that measure single-function generation, SWE-Bench Pro feeds AI models real GitHub bug reports and requires them to navigate entire codebases, identify root causes, and submit working patches — a test of end-to-end software engineering capability, not just autocomplete speed.
Here is where GLM-5.1 landed across key evaluation categories:
- SWE-Bench Pro: 58.4% — #1 globally (GPT-5.4: 57.7%, Claude Opus 4.6: 57.3%, Gemini 3.1 Pro: 54.2%)
- CyberGym: 68.7% vs. Claude Opus 4.6's 66.6% — cybersecurity task completion
- Terminal-Bench 2.0: 63.5% — command-line environment problem solving
- NL2Repo: 42.7% — generating full software repositories from plain-language descriptions
- AIME 2026: 95.3% (GPT-5.4 leads at 98.7%)
- GPQA-Diamond: 86.2% (Gemini 3.1 Pro leads at 94.3%) — graduate-level science questions
- BrowseComp: 68.0% (79.3% with enhanced context management)
Independent reviewer Elena Marchetti at Awesome Agents rated GLM-5.1 at 8.1/10, noting that the model can "maintain goal alignment across hundreds of tool calls without strategy drift" — meaning it stays focused on the original objective across very long autonomous sessions without gradually drifting off-task the way earlier agentic systems often did.
Architecture: 754 Billion Parameters, 40 Billion Active at a Time
GLM-5.1 is built on a Mixture-of-Experts architecture (MoE — a design where specialized sub-networks activate selectively for different tasks rather than firing all 754 billion parameters simultaneously). Only approximately 40 billion parameters are active during any single token generation, which dramatically reduces compute demand compared to a dense model of equivalent total size. The context window (the amount of text the model can process in a single session) is 200,000 tokens — roughly 150,000 words of code in one sitting — with a maximum output of 128,000 tokens per response.
No Nvidia Chips. Zero. The Hardware Twist Nobody Expected.
The benchmark score was notable. The training stack was the larger surprise. GLM-5.1 was developed entirely on Huawei Ascend chips, with zero Nvidia hardware involved at any stage. This matters because the US government has restricted export of Nvidia's highest-performance AI processors (including the A100 and H100 series — the GPUs that power most frontier AI training globally) to China since 2022, intensifying those controls through 2024 and 2025. The premise behind those restrictions was that blocking access to leading-edge chips would slow the development of frontier AI in China.
GLM-5.1's performance challenges that assumption directly. Z.ai — which went public on the Hong Kong Stock Exchange in January 2026 at a $31.3 billion valuation — produced a model that outperforms GPT-5.4 on the most-cited software engineering benchmark in AI, using only domestically manufactured Chinese hardware. The result is drawing attention in policy circles as evidence that US chip export restrictions may redirect AI development rather than halt it.
Eight Hours of Autonomous AI Coding — 655 Iterations, No Supervision
Beyond benchmark scores, GLM-5.1's most commercially significant capability is extended autonomous operation. The model can execute an internal "experiment, analyze, optimize" loop (a cycle where it tests its own output, diagnoses failures, and revises its approach — without a human reviewing each step) for up to eight consecutive hours without interruption. In a publicly documented demonstration, GLM-5.1 built a complete Linux desktop environment autonomously across 655 self-directed iterations, planning and correcting its own work throughout.
Performance gains from the optimization loops were substantial in quantitative evaluations:
- 3.6x geometric mean speedup in machine learning workloads (KernelBench Level 3)
- 6.9x throughput improvement in vector database queries (databases optimized for AI-style similarity searches)
- ~1,700 autonomous steps sustained on long-horizon coding projects
For development teams exploring AI automation pipelines, this opens a different workflow: hand GLM-5.1 a complex multi-file refactoring task or performance optimization problem at end-of-day, and return to reviewed, tested output the next morning. Teams pursuing vibe coding approaches — where developers describe intent and let AI handle implementation — will find this autonomous capability especially valuable. No prompt-and-reply back-and-forth required.
GLM-5.1 API Pricing and Enterprise Hardware Requirements
The practical ceiling is real. Running the full GLM-5.1 model locally requires approximately 1.49 terabytes of storage and enterprise-grade GPU clusters — not a workstation deployment, and certainly not a laptop. For most individual developers and smaller teams, cloud API access is the realistic path. Z.ai currently prices API usage at $1.00 per million input tokens and $3.20 per million output tokens, with cached tokens (previously processed content held in memory so it does not need to be reprocessed) at $0.26 per million — competitive with mid-tier pricing from OpenAI and Anthropic.
The weights — released as the actual trained model files that any organization can download — are available on HuggingFace in both standard and FP8 quantized (a compressed format that reduces memory requirements with minimal quality loss) versions. The MIT license allows commercial use, fine-tuning, redistribution, and integration into proprietary products without licensing fees. Deployment-ready support covers SGLang, vLLM, xLLM, Transformers, and KTransformers (all widely used open-source inference infrastructure tools).
For teams currently evaluating AI coding tools — including Claude Code users benchmarking open-source alternatives — GLM-5.1 belongs in the comparison shortlist alongside Claude Opus 4.6 and GPT-5.4. Its SWE-Bench Pro lead over GPT-5.4 is narrow — 0.7 points — but it is a genuine #1 on that leaderboard, not statistical noise. Access the API at Z.ai's developer platform or download the weights from HuggingFace directly. Read our AI tools evaluation guide to learn how to benchmark models against your own codebase before committing to a switch.
Related Content — Get Started | Guides | More News
Sources
Stay updated on AI news
Simple explanations of the latest AI developments