Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

DITTO

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Abstract

Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.

Global Editing

Global editing transforms the entire video with comprehensive changes that affect every frame. This kind of editing applies consistent modifications across the complete temporal sequence, enabling dramatic style transfers, color grading, and atmospheric adjustments that maintain visual coherence throughout the entire video.

Local Editing

Local editing focuses on specific regions or objects within the video, applying precise modifications to targeted areas while preserving the surrounding content. This kind of editing enables selective enhancement, object replacement, and regional adjustments that maintain the integrity of the overall composition.

More Dataset Samples

Effectiveness of Denoising Enhancer

Here we demonstrate the effectiveness of denoising enhancer where the raw edited video is provided on the left, and the enhanced one is put on the right (please consider zooming in to see the details). The denoising enhancer can effectively mitigate the generation quality degradation introduced by quantized and distilled models at a low cost.

DITTO

Abstract

Dataset: Ditto-1M

Global Editing

Local Editing

More Dataset Samples

Model: Editto

Model Results

Additional Application

Effectiveness of Denoising Enhancer