2026-04-22synthetic data generationAI training dataGoogle AI researchmachine learningLLM fine-tuningAI automationsynthetic datasetQwen 3.6

Google Simula Generates 512K Synthetic AI Training Examples

No training data? No problem. Google's Simula generates 512K synthetic AI training examples from scratch — and closed 83% of a cybersecurity accuracy gap.

Google and EPFL researchers just solved one of AI's most stubborn problems in synthetic data generation: where do you get training data when none exists publicly? Their answer is Simula — a framework that generates up to 512,000 specialized AI training examples entirely from scratch, with no seed data, no hand-crafted prompts, and no evolutionary algorithms required. For teams building AI automation workflows in cybersecurity, legal reasoning, or healthcare, this changes the math on what's possible.

The Synthetic Data Wall Blocking Specialized AI Training

ChatGPT and Gemini trained on the public internet — billions of web pages, books, and forums. That works for general conversation. But when you need an AI to map cybersecurity vulnerabilities, reason through Swiss federal law, or analyze medical records, that training data simply doesn't exist publicly. What does exist sits behind privacy walls, non-disclosure agreements, or government classification.

The conventional workaround — synthetic data (artificially generated examples designed to mimic real training data) — has its own failures. Most existing approaches ask a large language model to "generate examples" and hope for variety. The result is mode collapse (a phenomenon where the AI keeps producing nearly identical outputs instead of genuinely varied ones), coverage gaps, and no reliable way to control difficulty. Standard LLM-generated synthetic data also lacks fine-grained quality guarantees and tends to reflect whatever distribution the base model already knows best.

Google Simula synthetic data generation framework overview — four-step controllable AI training pipeline

Simula treats the problem as a mechanism design challenge (a game theory concept — instead of just prompting an AI and hoping for the best, you engineer the rules of the system so that the right data naturally emerges). The framework controls three independent axes simultaneously:

Quality — semantic and syntactic correctness, verified by a dual-critic AI system
Diversity — global domain coverage plus local variation to prevent repetitive clustering
Complexity — fine-grained control over how difficult each example is, from foundational to expert-level

The key finding: most existing synthetic data methods optimize for only one or two of these axes. Simula controls all three independently — and the research confirmed that combining global and local diversification is critical. Either approach alone produced significantly worse results across every tested domain.

Simula's Four-Step Generation Engine

The framework breaks generation into four distinct, controllable steps — each targeting a specific property of the final dataset:

Hierarchical taxonomies for global diversity — builds a structured knowledge map of the target domain to ensure broad coverage. The research found that real-world reference datasets almost always covered less of their own domain than Simula-generated versions, even when standard embedding-based cosine distance metrics (scores measuring how "different" text samples look numerically) suggested otherwise.
Meta-prompts for local diversity — generates varied instruction strategies within each taxonomy node, preventing clusters of near-identical examples from forming.
Complexification — systematically increases example difficulty using Calibrated Complexity Scoring (an Elo-rating system — the same ranking method used in competitive chess — applied to training examples through pairwise comparisons to objectively rank difficulty level).
Dual-critic quality verification — two AI critics independently check each example by asking both "is this answer correct?" and "is this answer incorrect?" — a dual-check designed to reduce sycophancy bias (the tendency of AI critics to agree with whatever they're reviewing rather than genuinely verifying accuracy).

The test configuration used Gemini 2.5 Flash as the teacher model (the large, capable model that generates and labels training examples) and Gemma 3 4B as the student model (the smaller, specialized model being trained on those examples), with 10 rounds of LoRA fine-tuning (a technique that trains only a small fraction of a model's weights — dramatically reducing compute requirements while still adapting behavior) per configuration.

The Numbers — and the Critical Warning Every Team Needs to Read

Simula was tested across five specialized domains with concrete, benchmarked results:

GSM8k (grade-school math reasoning): High Complexity training data produced a 10% accuracy gain over Low Complexity at 64,000 data items. The student model peaked at 75% accuracy while the teacher reached 88% — significant headroom remains for continued scaling.
CTI-RCM (cybersecurity vulnerability mapping): Bridged 83% of the accuracy gap between baseline performance (40%) and teacher model ceiling (70%). Performance saturated at 128,000 data points — adding more examples beyond that threshold produced no measurable improvement.
LEXam (Swiss, EU, and international law reasoning): Critic rejection rate hit 61% — 61 out of every 100 generated legal examples were discarded as incorrect. By contrast, CTI-MCQ (cybersecurity multiple choice) had only a 2% rejection rate. The gap directly mirrors teacher model accuracy: just 57% on legal reasoning versus far higher on structured security tasks.

Google Simula benchmark accuracy results — AI training performance across math, cybersecurity, and legal reasoning domains

The most important finding buried in these results: complexity only helps when the teacher model is strong enough to generate reliable labels for harder examples. In the LEXam case, increasing difficulty actively hurt performance — the teacher model was producing harder examples with confident but wrong labels, and the student learned from those mistakes. Any team using Simula must audit teacher model accuracy on their specific domain before ramping up complexity. Without that check, you risk training on systematically harder versions of wrong answers — a failure mode that's expensive to diagnose after the fact.

The cost reality: Simula uses up to 5x more inference calls per data point than baseline approaches. The researchers argue this is offset by needing fewer total samples to hit target performance, making the full training lifecycle more cost-effective — but teams running API-based models at scale should factor this overhead into compute budgets before starting a large generation run. All results were reported with 95% confidence intervals across every benchmark.

Qwen 3.6's Deployment Guide: Filling the Production Gap

Alongside the Simula research, a practical implementation guide dropped for Qwen 3.6-35B-A3B — a multimodal Mixture-of-Experts (MoE) model (an architecture where different specialized sub-networks activate selectively for different types of input, so a 35-billion-parameter model runs with the efficiency of a much smaller one) supporting vision input, tool calling, structured JSON output, and session persistence.

Most Qwen tutorials cover basic chat. This guide goes further — production deployment with:

Adaptive GPU loading: bf16 precision (full float16 quality) at ≥75GB VRAM, int8 quantization (a compression method that cuts memory roughly in half with minor accuracy trade-offs) at ≥40GB, int4 quantization at <40GB
Thinking-budget control: set a token cap on how long the model reasons before answering — balancing inference speed versus reasoning depth per request
MoE routing inspection: see which expert sub-networks activate for which inputs, useful for debugging domain coverage gaps or unexpected outputs
Flash Attention 2 (a hardware-optimized attention computation that significantly speeds up inference on long input sequences) and BitsAndBytes quantization support
RAG integration (Retrieval-Augmented Generation — connecting the model to external knowledge sources at inference time) with persistent session memory across conversation turns

pip install --upgrade pip
pip install --upgrade transformers>=4.48.0 accelerate>=1.2.0   bitsandbytes>=0.44.0 pillow requests sentencepiece   qwen-vl-utils[decord] sentence-transformers jsonschema

# Model loads adaptively based on available GPU:
# GPU ≥75GB VRAM  →  bf16  (A100 80GB / H100)
# GPU ≥40GB VRAM  →  int8  (A100 40GB / L4)
# GPU <40GB VRAM  →  int4  (RTX 3090 / 4090 / smaller)

An A100 or L4 GPU is recommended for practical inference at this model size. Teams evaluating LLM deployment options can review our LLM deployment guides for context on matching model size to production hardware constraints before committing to a specific configuration.

Synthetic Data Generation Is Becoming AI Automation Infrastructure

Simula represents a shift in how AI development works in regulated and specialized industries. Instead of treating training data as something you collect from the world, it becomes something you engineer — with controllable parameters, verifiable coverage, and documented quality metrics. The framework's two new evaluation tools — Taxonomic Coverage (a structured measure of how much of a target domain a dataset actually spans, as opposed to how diverse it merely sounds under cosine distance metrics) and Calibrated Complexity Scoring — give practitioners measurement tools that previously didn't exist for validating synthetic dataset quality before committing to a full training run.

If your team is building specialized AI in cybersecurity, legal tech, healthcare, or multilingual STEM — and hitting the training data wall — the Simula methodology is now the most rigorous publicly documented approach to generating that data yourself. The full research is available at Google Research's blog. Study the teacher-accuracy-vs-complexity trade-off before your next training run. Getting that wrong doesn't just slow you down — it trains your model to be confidently wrong on harder problems, which is exactly what specialized AI can't afford.

Related Content — Get Started | Guides | More News

Sources

Stay updated on AI news

Simple explanations of the latest AI developments