SiD-DiT

Score Distillation of Flow Matching Models

arxiv 2025

Mingyuan Zhou, Yi Gu, Huangjie Zheng, Liangchen Song, Guande He, Yizhe Zhang,
Wenze Hu, Yinfei Yang

Paper Code Huggingface Demos Tutorial
SiD SiD-LSG SiDA

Acknowledgements

We are deeply grateful to Yinhong Liu for designing and building this website, developing the accelerate-based codebase , and preparing the Diffusers-style pipelines that power all of these demos.

About SiD-DiT

Diffusion models achieve high-quality image generation but are limited by slow iterative sampling, while distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation—based on Bayes’ rule and conditional expectations—that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score Identity Distillation (SiD) to pretrained text-to-image flow-matching models and show that, score distillation applies broadly and stably to flow-matching generators, providing a principled foundation for unifying acceleration techniques across diffusion- and flow-based models. We also provided diffusers style pipelines, click here for the tutorial.

Core Highlights

Unified data-free distillation framework for flow-matching T2I models

Leveraging this unified perspective, SiD-DiT provides a practical, out-of-the-box distillation framework for flow-matching text-to-image models without any requirement on real image data. Our implementation is compatible with most open-source DiT-style architectures (including Efficient DiT and MMDiT variants) and supports diverse noise schedules such as Rectified Flow and Trig Flow. Using a single codebase and largely shared hyperparameters, SiD-DiT successfully distills a broad family of pretrained flow-matching T2I models, including SANA and SANA-Sprint, SD3 and SD3.5, and FLUX.1-Dev, across scales from 0.6B to 12B parameters. The framework is designed to be easily extensible to future architectures and flow-matching variants, turning SiD into a unified, data-free distillation toolkit for modern text-to-image models.

Data-Free Distillation for High-Quality Image Generation

Our SiD-DiT operates in a fully data-free manner, relying solely on the pretrained model’s internal predictions without real images or teacher finetuning. It consistently produces high-fidelity and diverse generations. Detailed comparison with teacher model and DiT distillation baselines can be found in Table 1 and Table 2 for detailed quantitative metrics on FiD, CLIP, GenEval and human prefernce scores (e.g. PickScore).

Rapid convergence

We find that data-free score distillation is both stable and effective, enabling efficient few-step generation across modern large-scale flow-matching models. SiD-DiT inherits the Rapid convergence behavior of Score Identity Distillation (SiD): the FID of the distilled generator decreases roughly exponentially with training iterations. Although this effect was originally demonstrated on small- and medium-sized diffusion models, our large-scale flow-matching variants (e.g., DiT-based text-to-image backbones) retain the same rapid convergence.  

FID and CLIP trajectories for SiD-DiT
Figure 1. Evolution of FID and CLIP score for SiD-DiT across architectures and model sizes.

Adversarial Score Identity Distillation

We further integrate Adversarial Score Identity Distillation (SiDA), implemented as the SiD 2 α variant, which continues training the distilled model via joint distillation–GAN optimization using an auxiliary dataset, further enhancing image quality and diversity. In our experiments, we use the Midjourney-v6-LLaVA dataset as auxiliary data, and users are encouraged to explore other high-quality datasets within the same framework.

Quantitative Evaluation of Performance

Extensive experiments demonstrate that SiD-DiT achieves state-of-the-art efficiency and competitiveness across diverse text-to-image flow-matching models. The framework is evaluated under both data-free and data-aided settings on representative DiT-based architectures, including SANA, SD3, SD3.5, and FLUX, covering models from 0.6B to 12B parameters. Across all benchmarks, SiD-DiT reduces generation from 20–40 iterative steps to only 4 steps, while maintaining or surpassing the teacher models’ performance in standard quantitative metrics such as FID, CLIP score, GenEval, PickScore, ImageReward, etc. In data-free scenarios, the distilled generators already match the visual fidelity of the original pretrained models. When augmented with lightweight adversarial refinement, they achieve further FID reductions and improved text–image alignment, surpassing strong baselines such as SANA-Sprint and SD3.5-Turbo.

SANA / SANA-Sprint results for SiD-DiT
Table 1. Main quantitative comparison of SiD-DiT and baselines across text-to-image backbones.
SD3 / Flux results for SiD-DiT
Table 2. Detailed results on SANA and SANA-Sprint (Rectified Flow and TrigFlow) with SiD-DiT.

BibTeX

@misc{zhou2025scoredistillationflowmatching,
      title={Score Distillation of Flow Matching Models}, 
      author={Mingyuan Zhou and Yi Gu and Huangjie Zheng and Liangchen Song and Guande He and Yizhe Zhang and Wenze Hu and Yinfei Yang},
      year={2025},
      eprint={2509.25127},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.25127}, 
    }