Mistral 128B AI Coding Agent Beats 397B Rival on SWE-Bench
Mistral Medium 3.5 scores 77.6% on SWE-Bench, outperforming a 3× larger rival. Cloud AI agents now complete GitHub tasks and file PRs automatically.
Mistral AI just answered one of software development's most persistent frustrations: having to sit and watch an AI agent work through your codebase. Mistral Medium 3.5 — a new 128B parameter model (128 billion internal decision weights) built for AI automation coding workflows — scored 77.6% on SWE-Bench Verified, the benchmark that tests AI on real GitHub bug reports. It beats Qwen3.5 397B A17B, a model nearly three times its size, and ships alongside remote cloud agents for Vibe — Mistral's vibe coding platform — that complete multi-step coding tasks and automatically open pull requests (proposed code changes queued for human review) while developers focus elsewhere.
That last point is the bigger story. AI coding assistants have mostly required developers to stay present — approving each step, handling errors, watching the terminal. Mistral's remote agents change the equation: kick off a task from your CLI or browser, and the cloud handles execution until the PR is ready for human review.
77.6% on SWE-Bench: Mistral's AI Coding Agent Beats Bigger Rivals
SWE-Bench Verified (Software Engineering Benchmark — a standardized test using real-world GitHub issues from production open-source projects) measures whether an AI can understand existing codebases, locate the root cause of bugs, write working fixes, and pass automated tests. It's widely considered the most credible proxy for practical engineering ability because it uses real issues rather than synthetic puzzles.
Mistral Medium 3.5's 77.6% score beats two notable competitors:
- Devstral 2 — Mistral's own previous coding-specialized model
- Qwen3.5 397B A17B — Alibaba's latest flagship, with 397 billion total parameters through a Mixture-of-Experts architecture (MoE — a design where only a fraction of the total parameters activate per request, making large models feasible on available hardware). Mistral's model uses roughly one-third the parameters and still wins on this benchmark.
The model also scores 91.4 on τ³-Telecom — a domain-specific benchmark for telecommunications tasks, relevant for teams in networking and infrastructure. Its context window (total text processable per session without forgetting earlier content) sits at 256,000 tokens, equivalent to approximately 200,000 words — roughly two full-length novels, or an entire large codebase loaded at once without truncation.
Vibe Remote Agents: AI Automation That Runs While You Sleep
Earlier versions of Vibe ran coding agents locally — on the developer's own machine, tying up compute and requiring someone to stay present to handle errors and approve steps. Remote agents shift execution entirely to the cloud. The practical difference:
- Agents run in isolated cloud sandboxes (sealed temporary environments with their own filesystem, tools, and network access — preventing session interference and protecting your local machine)
- Execution is fully asynchronous — start a task, close your laptop, return hours later to a finished pull request
- Local CLI sessions can be "teleported" to the cloud mid-task without losing any conversation history or file state — the session continues seamlessly where it left off
- Completed tasks automatically open pull requests on GitHub, ready for human code review
- Multiple agents run in parallel, eliminating your developer machine as the bottleneck for concurrent workstreams
Vibe agents integrate natively with GitHub, Linear (a project management tool popular with engineering teams), Jira (Atlassian's issue tracking platform), Sentry (error monitoring software that captures production bugs with full stack traces), Slack, and Microsoft Teams. An agent resolving a Jira ticket can read the issue context, navigate the relevant codebase, write the fix, run tests, and open a PR — all linked back to the original ticket — without a human touching the keyboard.
As MarkTechPost's analysis describes it: "Think of it as a junior developer that never gets tired and can operate across your codebase." Agents are triggered from the Mistral Vibe CLI (command-line interface — a terminal-based tool for interacting with Mistral services) or from Le Chat, Mistral's web interface. The orchestration layer runs on Mistral Studio, which coordinates multi-step task sequences and tool selection behind the scenes.
Le Chat Work Mode: AI Automation Across Multiple Apps
Alongside the engineering-focused Vibe update, Le Chat gained Work mode — an operating state that enables multi-step agentic task execution across business tools simultaneously. Unlike standard chat assistants where connecting to tools requires manual setup per conversation, Work mode has all configured connectors active by default.
A single instruction can now span email, calendar, documents, Jira issues, and Slack messages at once — with the agent reasoning about the right sequence of actions across all of them. Every tool call and decision point is displayed in real time, giving full visibility into what the agent is doing and why.
Sensitive actions require explicit user approval. Before the agent sends a message, creates a document, or modifies a data record, it surfaces the pending action for confirmation. This permission model (a system that grades actions by risk level and gates higher-risk operations behind human sign-off) prevents silent, irreversible changes — the most common concern with fully autonomous agents operating across production systems.
Open Weights and Per-Request Reasoning Control
Mistral Medium 3.5's weights are publicly available on Hugging Face — teams can download, self-host, and fine-tune the model independently of Mistral's cloud. This matters for organizations with data residency requirements or those that need domain-specific customization. It also means the model can be evaluated against your own codebase before committing to any cloud integration.
Two architectural decisions stand out from a practical standpoint:
Vision encoder trained from scratch. The vision component (the model layer that converts images into data the language model can reason about) was built without relying on CLIP-based pre-trained encoders (CLIP — Contrastive Language-Image Pretraining, the image-text alignment architecture used as a vision backbone in most multimodal models). Training from scratch means the model handles variable image sizes and aspect ratios natively, making it more reliable for screenshots, architecture diagrams, and design documents that don't conform to standard dimensions.
Configurable reasoning effort per request. For the first time in Mistral's lineup, reasoning depth is adjustable at the API call level (API — Application Programming Interface, the connection layer through which software sends instructions to the model). A quick chat reply doesn't need the same compute budget as debugging a complex race condition. Teams managing inference at scale can now match compute to task complexity without maintaining separate model deployments for different use cases — a meaningful cost lever for high-volume applications.
Getting Started with AI Automation: Entry Points for Dev Teams
For teams already using GitHub, Jira, or Linear as their primary workflow, Vibe's remote agents offer the closest thing yet to a fully delegatable engineering task layer. A 77.6% SWE-Bench score means the model handles nearly four out of five real GitHub issues correctly — not a replacement for senior engineers, but genuinely useful for triage, first-pass fixes, and dependency updates that humans review before merging.
- Start with model evaluation: Download the weights from Hugging Face and run Mistral Medium 3.5 against your own issue backlog before enabling any cloud integrations
- Test Work mode incrementally: Connect one read-only integration first (Slack read access is low-risk), observe the approval flow, and expand permissions as confidence builds
- Set approval gates before parallel runs: Remote agents that open bad PRs are easy to close; agents with unchecked write access to production are not — configure permission levels before scaling to parallel sessions
New to agentic workflows? Our AI automation setup guide walks through configuring your first cloud agent integration from scratch. For deeper frameworks, explore our AI automation guides on integrating agentic tools into existing engineering workflows without disrupting what already works. The open weights are free, Le Chat's Work mode is live now, and Vibe's remote agent infrastructure is available via the Mistral platform — the lowest-friction entry point is a single Hugging Face download and one test task in an isolated branch.
Related Content — Get Started | Guides | More News
Stay updated on AI news
Simple explanations of the latest AI developments