Cursor Composer 2 Scores 61 — Just Trained on Your Code
Cursor's Composer 2 hits 61.3 on CursorBench, runs 4x faster, and is trained on real developer sessions — your committed code shapes the next model.
Cursor Does Not Just Release a Model — It Trains on What You Wrote Last Month
Most AI companies release a new model and declare it better. Cursor does something different: it builds its own benchmark, trains on real developer sessions, and then shows you exactly how much better the model got. This feedback loop is what separates Cursor's approach from nearly every other AI coding tool on the market — and the latest results from Composer 2 and the CursorBench evaluation framework make for a compelling case study in how to build AI products that actually improve at the work developers do every day.
As of March 2026, Cursor is a dominant force in AI-assisted software development. The company — Anysphere — crossed $2 billion in ARR (Annual Recurring Revenue — the total yearly revenue a company would earn if all current subscriptions continued at their current rate) in February 2026, having gone from $500 million ARR in June 2025 to $1 billion in November 2025 to $2 billion just three months later. It now serves more than 1 million daily active users and is used by half of the Fortune 500. In November 2025, the company raised a Series D round of $2.3 billion at a valuation of $29.3 billion. These numbers are not incidental context — they explain why Cursor has the resources to build proprietary training infrastructure at scale.
What CursorBench Actually Measures
Most AI benchmarks measure performance on synthetic tasks — curated problems that may or may not reflect what engineers do at work. CursorBench takes a fundamentally different approach. It is built on real developer sessions captured through a system called Cursor Blame — a feedback mechanism that traces committed code back to the original AI request that generated it.
Here is how it works: when a developer uses Cursor and then commits code to their repository, Cursor Blame links the committed code to the conversation that produced it. Over time, this creates a large dataset of (prompt, output, acceptance) tuples — real examples of what developers actually asked for, what the AI produced, and whether that output was good enough to commit. CursorBench then uses this data to construct evaluation tasks that mirror real workloads.
The score progression tells a clear story: the original Composer scored 38.0 on CursorBench. Composer 1.5 improved to 44.2. And Composer 2 now scores 61.3 — a 61% improvement over the original model, achieved in a relatively short development cycle. On Terminal-Bench 2.0 — a separate evaluation for terminal and command-line tasks — Composer 2 scores 61.7, compared to Composer 1.5's 47.9. On SWE-bench Multilingual — a benchmark that measures the ability to fix real-world bugs submitted to open-source GitHub repositories — Composer 2 scores 73.7 vs 65.9 for the previous version.
Speed has also improved dramatically. Composer 2 runs at approximately 250 tokens per second — making it roughly 4x faster than Composer 1.5. Most tasks complete in under 30 seconds. For a developer waiting on an AI to scaffold a component or refactor a function, the difference between 30 seconds and 2 minutes is the difference between staying in flow and losing context entirely.
Bugbot: Eleven Versions in Six Months
Composer 2 is the flagship, but it is not the only model Cursor has been rapidly iterating. Bugbot — Cursor's automated code review system that proactively finds bugs in pull requests — shipped 11 versions between July 2025 and January 2026. The resolution rate (the percentage of flagged bugs that developers agree are real and fix) climbed from 52% to over 70% during that period.
This kind of iteration velocity is only possible when you have a tight feedback loop. Bugbot knows which of its suggestions developers accepted, which they dismissed, and which led to follow-up conversations. Each of those signals feeds back into the training process via asynchronous reinforcement learning (a training method where the model learns from delayed feedback — such as whether code was committed or rejected — rather than immediate supervision) running across thousands of Nvidia GPUs.
The pattern is consistent across Cursor's product line: build an evaluation that measures real developer behavior, deploy to production, collect feedback at scale, train on that feedback, and publish the results publicly. This is not just good engineering — it is a strategic moat. The more developers use Cursor, the more training data Cursor collects, and the better the models become for those exact workflows.
What Enterprise Customers Are Actually Seeing
Benchmarks are useful for comparing models, but the most compelling evidence comes from large-scale deployments. NVIDIA has integrated Cursor across its engineering organization — according to NVIDIA, the result is three times more code produced across 30,000 developers. Salesforce reports a 30%+ increase in development velocity after adopting Cursor at scale.
These are the numbers that explain a $29.3 billion valuation. Enterprise software buyers are not paying for benchmark scores — they are paying for measurable improvements in the speed and quality of their engineering output. When a company with 30,000 developers produces three times as much code, the ROI calculation on a developer tool subscription becomes straightforward.
The broader implication for the AI developer tools market is significant. Cursor is demonstrating that a purpose-built model trained specifically on coding tasks — and more specifically on the exact coding tasks real developers perform — can outperform general-purpose foundation models on those same tasks. This is the same insight that drove the success of specialized models in domains like medical imaging and legal document analysis, now applied to software engineering.
The Flywheel and What It Means for You
If you are a Cursor user, the most important thing to understand is that your usage is not just producing output for you — it is generating training signal for the next version of the model. The code you commit, the suggestions you accept or reject, the prompts you write and refine — all of this flows back into CursorBench and ultimately into the training data for Composer 3.
This creates a genuine network effect that is different from the typical "more users = more revenue" flywheel. More users means better training data, which means better models, which means more useful output, which means more users commit more code. Each cycle compresses the gap between what AI can do and what professional engineers do every day.
For developers evaluating AI code editors, the question is no longer just "which model scores higher on SWE-bench?" It is: "which product has the tightest feedback loop between user behavior and model improvement?" Based on the public evidence, Cursor's answer to that question is currently the most developed in the industry.
Sources: Cursor Blog — CursorBench | Cursor Blog — Composer 2 | Cursor Blog — Building Bugbot
Related Content — Get Started with Easy Claude Code | Free Learning Guides | More AI News
Stay updated on AI news
Simple explanations of the latest AI developments