AI for Automation
Back to AI News
2026-03-17AI BenchmarkEnterprise AIAI AgentsClaudeGPTProductivity

We Gave AI Real Office Tasks — Only 37% Were Completed Successfully in a 1,150-Task Test

A ServiceNow research team tested AI on 1,150 real enterprise tasks including email, HR, and IT support. Even the best AI (Claude Opus 4.5) succeeded only 37.4% of the time. The bottleneck wasn't tool usage — it was a lack of strategic planning ability.


Summary: How well can an AI assistant handle real office work like sorting emails, responding to customer inquiries, and managing HR tasks? ServiceNow Research and Mila, Montreal's AI institute, tested 1,150 enterprise tasks across 8 domains — and even the best AI available, Claude Opus 4.5, succeeded only 37.4% of the time. This study delivers a sobering reality check on the expectation that "AI will soon replace office workers."

From Email to HR — We Gave AI Real Office Work

EnterpriseOps-Gym is not a simple chatbot test. It's a virtual office that faithfully replicates the tasks that happen every day in a real company. It contains 164 database tables, 512 workflow tools, and 1,150 expert-crafted tasks.

EnterpriseOps-Gym benchmark architecture — enterprise task simulation environment and AI agent evaluation pipeline

The test covers 8 domains:

Email — Sorting, forwarding, replying
Calendar — Scheduling meetings, rescheduling
Team Collaboration — Chat management, messaging
File Management — Organizing documents, sharing settings
Customer Service (CSM) — Handling inquiries, knowledge linking
Human Resources (HR) — Employee records, policy checks
IT Support (ITSM) — Incident tickets, system configuration
Cross-Department — Complex tasks spanning multiple teams

Each task requires an average of 9.15 steps to complete. Complex HR tasks need up to 34 steps, and each task must pass an average of 5.3 verification conditions to be counted as a success.

Even the Best AI Fails 6 Out of 10 Times

The research team tested all 16 major AI models currently available. Here are the results:

AI Model Success Rates (Overall Average)

AI ModelSuccess RateCost per Task
Claude Opus 4.5 (Anthropic)37.4% 🥇~$0.34
Gemini 3 Flash (Google)31.9%~$0.03
GPT-5.2 High (OpenAI)31.8%
Claude Sonnet 4.530.9%
GPT-5 (OpenAI)29.8%
DeepSeek V3.2 (Top Open Source)24.5%~$0.01

Even the best-performing Claude Opus 4.5 succeeded only 4 out of 10 times. It achieved around 50% success on relatively simple tasks like email and file management, but dropped significantly on complex work requiring policy compliance, such as IT support (23.8%) and cross-department tasks (30.7%).

AI's Real Weakness: Planning, Not Tools

The most surprising finding was this: when researchers gave AI a human-written task plan in advance, success rates jumped by 14–35 percentage points. Meanwhile, increasing the number of available tools from 512 to add distractors only reduced success by about 1%.

In other words, AI is good at choosing which tools to use, but struggles to strategically plan what to do and in what order. It's like a new employee who's great with tools but doesn't understand the workflow.

Even more striking was the small AI + good plan combination. When the 4B-parameter lightweight model Qwen3-4B was given a human-crafted plan, it matched or outperformed much larger models. Creating a good workflow manual is more effective than using an expensive AI.

When Given Impossible Tasks, AI Just Tries Anyway

The researchers deliberately included 30 impossible tasks — for example, editing records for a nonexistent employee or accessing a system without authorization. A proper AI assistant should refuse, saying "This cannot be done."

The results were alarming. Even the best-performing model at refusal (GPT-5.2 Low) only rejected 53.9% — barely better than a coin flip. Claude Opus 4.5 scored exactly 50%. The other half of the time, the AI forced its way through impossible tasks, corrupting data in the process. In a real work environment, this could lead to incorrect customer records or unauthorized system changes — serious problems.

AI Success Rates Plummet as Tasks Get Longer

On simple 4-step tasks, AI succeeded about 35% of the time, but on complex tasks with 16+ steps, the rate dropped below 20%. As steps increase, small early mistakes snowball. The decline was even steeper for open-source models.

Teaming Up Multiple AIs Doesn't Solve the Problem

"If one AI can't do it, why not use several?" The researchers tested this too. They tried splitting roles between a planning AI and an execution AI, or breaking tasks into subtasks distributed across multiple AIs. While slightly better than the baseline, these approaches fell far short of human-crafted plans. In some cases, splitting tasks actually hurt performance because it broke the contextual thread.

Best Value for Money: Google's Gemini 3 Flash

In terms of cost-effectiveness, Gemini 3 Flash was the most efficient at ~$0.03 per task with a 31.9% success rate. Claude Opus 4.5 had a higher success rate but cost ~$0.34 per task — 12 times more expensive. Among open-source models, DeepSeek V3.2 was the most economical at ~$0.01 per task with 24.5% success.

For Those Looking to Deploy AI Assistants at Work

The core message of this research is clear:

1. Before delegating work to AI, create a manual first. AI handles tools well but can't plan on its own. Providing step-by-step procedures dramatically improves success rates.

2. Good prompts matter more than expensive AI. A small, free AI paired with well-crafted instructions can outperform premium models.

3. Don't blindly trust AI's "Done!" message. AI attempts impossible tasks half the time without flagging them. Always have a human verify results on important work.

4. Start with simple tasks like email and file management. AI achieves 50%+ success rates in these areas, but complex tasks involving regulations and policies are still too risky to delegate.

The research team has open-sourced the paper and the full benchmark. If you'd like to explore it yourself, check out the code and data on the GitHub repository.

To learn more about AI and vibe coding, check out our Free Learning Guide.

Related ContentGet Started with AI | Free Learning Guide

Stay updated on AI news

Simple explanations of the latest AI developments