2026-04-05video-editingVFXNetflixopen-sourcephysics-AIvideo-generation

Netflix just solved video editing's gravity problem

Netflix and INSAIT's VOID removes objects from video while preserving physics — props fall instead of float. Beats Runway, ProPainter, and 4 rivals.

Video editing has a dirty secret no one discusses in polite company: removing an object from footage has always been brutally hard. Professionals call it video inpainting (filling in the visual "hole" left after deleting something from a scene), and studios have been quietly struggling with it for three decades.

The fundamental problem isn't pixels — it's physics. When you delete a person holding a coffee cup, every existing tool leaves the cup floating in mid-air, suspended at chest height, defying gravity. Remove a hand pushing a crate and the crate freezes in space. The AI fills in the background beautifully but completely ignores the laws of nature. That disconnect is what makes the result look fake.

Netflix and the INSAIT research institute just published a model that treats this as a physics problem, not a painting problem. It's called VOID — Video Object and Interaction Deletion — and early benchmarks show it outperforming every major tool in the category, including Runway, ProPainter, and four other competitors.

Why Every Tool Before This Got Physics Wrong

Standard video inpainting tools use binary masks — essentially a yes/no map that marks "object is here" or "object is not here" — to identify what to remove. They then fill the gap with generated pixels that blend smoothly into the background. In a static screenshot, the results can look convincing.

The breakdown happens the moment interactions are involved. If Person A is holding Object B, removing Person A requires the AI to understand that Object B's position was caused by Person A. Remove the cause, and the effect must change — the object should fall, roll, or react to its new physical reality. No existing tool thinks this way. They fill pixels without modeling what those pixels represent in the world.

Netflix and INSAIT designed VOID specifically to reason about causality — not just visual gaps, but the physical consequences of removal.

VOID video object removal comparison — physics-aware deletion versus standard inpainting floating artifacts

Inside VOID: The Quadmask That Changes Everything

VOID is built on top of CogVideoX, Alibaba PAI's video generation architecture — a 3D Transformer-based model (meaning it processes spatial and temporal information simultaneously, understanding how frames relate across time, not just how pixels relate within a single frame). The base model carries 5 billion parameters and can process a maximum sequence of 197 frames at 384×672 resolution.

The key innovation is what the researchers call the quadmask: a 4-value mask system that replaces the traditional binary (on/off) approach.

Think of a binary mask as a light switch — object here or not here, nothing in between. VOID's quadmask is a dimmer with labeled positions. The four values encode different interaction states across a scene:

The primary object to be removed
The interaction zone (where the object is in contact with or affecting nearby elements)
The background that must be preserved unchanged
The transition regions where physical consequences must be inferred

This richer signal allows VOID to ask and answer a question no previous model asked: what should happen here after the object is gone? When a glass is removed from a hand, the model understands that the transition zone beneath it must show the glass falling — not static, not floating.

How Temporal Consistency Is Maintained

One of the most persistent visual artifacts in video inpainting is temporal flickering — where the filled-in area pulses or shifts between frames because each frame is processed independently. CogVideoX processes temporal relationships natively, treating the video as a 3D volume (width × height × time) rather than a sequence of separate images. VOID inherits this architecture, which is why its results maintain coherent motion across the full 197-frame sequence rather than stuttering.

VOID quadmask diagram illustrating 4-value interaction regions enabling physics-aware video object deletion

Six Competitors Tested — VOID Wins on Physics

The paper includes direct head-to-head comparisons against 6 other video removal methods:

ProPainter — one of the most widely-used open-source video inpainting tools
DiffuEraser — a diffusion-based video eraser (diffusion models generate images by learning to reverse a "noise" process — they start from random noise and gradually refine toward a coherent image)
Runway — the commercial AI video editing platform used by major film studios
MiniMax-Remover — MiniMax's dedicated object removal model
ROSE — a recent research-level video inpainting model
Gen-Omnimatte — a generative omnimatte approach (omnimatte separates video into object layers with their associated visual effects, like shadows and reflections)

VOID outperformed all six on scenes involving significant object interactions — collisions, prop handling, and scenarios where removed elements were in active physical contact with the surrounding scene. The other tools produced physically inconsistent results: floating objects, props frozen mid-motion, and backgrounds with visible "painted-on" artifacts.

Beating Runway on physics consistency is not a minor academic footnote. Runway is a commercial product used in professional studio pipelines with a market valuation in the billions. VOID doing this as an open research release means the gap between academic research and commercial capability has, once again, closed faster than the industry expected.

The Real-World Cost This Disrupts

Physics-consistent object removal has historically required skilled compositors (specialists who combine visual elements from multiple sources into a seamless shot) combined with manual rotoscoping (frame-by-frame hand-tracing of object boundaries). For a single complex removal shot — an actor dropping a prop, a character releasing a held object mid-scene — this work could consume hours of professional time.

At professional post-production rates averaging $150–$300 per hour for compositing work, even a 3-second interaction shot could run $1,000–$3,000 before color grading and finishing. For a full scene, costs reach into five figures. Independent filmmakers typically don't take this shot. They rewrite the scene, simplify the action, or accept an imperfect result.

VOID lowers this barrier substantially. The model runs on consumer hardware using BF16 precision with FP8 quantization (techniques that reduce GPU memory usage by representing numbers with fewer bits — allowing a 5-billion-parameter model to run on hardware that would normally be insufficient). The primary requirement is downloading the CogVideoX-Fun-V1.5-5b-InP checkpoint from Hugging Face before running VOID inference.

Current Limitations to Know Before You Download

VOID is genuinely impressive, but several practical constraints matter for real workflows:

Resolution ceiling: The model runs at 384×672 pixels — well below 1080p or 4K. Professional use requires AI upscaling the output as a second step.
8-second sequence limit: 197 frames at 24fps (standard cinematic frame rate) equals approximately 8.2 seconds. Longer shots must be split, processed in segments, and re-stitched — adding complexity.
Two-step install: Requires downloading the separate CogVideoX base model checkpoint first. Not a one-click experience compared to cloud tools like Runway.
GPU memory requirements: BF16 + FP8 still requires a capable GPU. This is a desktop workstation workflow, not a laptop one.

For creators with modern desktop hardware and patience for setup, VOID is accessible today. For casual users expecting a simple drag-and-drop web interface, it remains a technical workflow — for now.

How to Access VOID

The full paper is available at arxiv.org/abs/2604.02296. The PDF includes the complete architecture, training details, and qualitative comparisons. Code and model weights are expected on Hugging Face under the CogVideoX-Fun repository pattern. The initial setup flow looks like this:

# Step 1: Download the base model checkpoint from Hugging Face
#   Repository: Alibaba PAI / CogVideoX-Fun-V1.5-5b-InP
#
# Step 2: Clone the VOID repository (link in paper)
#
# Step 3: Prepare source video + quadmask
#
# Step 4: Run VOID inference on your masked sequence

Specific commands will be in the official repository README once the code releases publicly. The paper itself is the definitive reference for the architecture until then.

Netflix's pattern of open research publishing — it has contributed to audio processing, subtitle generation, and recommendation systems over the past three years — reflects both talent recruitment and a broader ecosystem bet. The more content creation tools improve, the larger the pool of quality productions available for licensing and commissioning. VOID fits that logic precisely: it solves a real production bottleneck and publishes the solution for everyone.

Related Content — Get Started | Guides | More News

Sources

Stay updated on AI news

Simple explanations of the latest AI developments