I am a Postdoctoral Associate working with SUNY Distinguished Professor Siwei Lyu (IEEE/IAPR Fellow) at the University at Buffalo, State University of New York (SUNY). I received my Ph.D. from Zhejiang University in Jun 2024. My Ph.D. thesis, titled Knowledge Distillation on Deep Neural Networks won the Outstanding Doctoral Dissertation award. My Google Scholar citations reached 2025 in 2025.
I am working on diffusion-based generative models (theoretical understanding, accelerated sampling), and knowledge distillation. I have reviewed over 100 papers for top-tier conferences and journals, including serving in senior roles. I lived in Hangzhou (Paradise on Earth) and Wenzhou (Cradle of Mathematicians) for more than 25 years.
Diffusion-based generative models employ stochastic differential equations (SDEs) and their equivalent probability flow ordinary differential equations (ODEs) to establish a smooth transformation between complex high-dimensional data distributions and tractable prior distributions. In this paper, we reveal a striking geometric regularity in the deterministic sampling dynamics: each simulated sampling trajectory lies within an extremely low-dimensional subspace, and all trajectories exhibit an almost identical ”boomerang” shape, regardless of the model architecture, applied conditions, or generated content. We characterize several intriguing properties of these trajectories, particularly under closed-form solutions based on kernel-estimated data modeling. We also demonstrate a practical application of the discovered trajectory regularity by proposing a dynamic programming-based scheme to better align the sampling time schedule with the underlying trajectory structure. This simple strategy requires minimal modification to existing ODE-based numerical solvers, incurs negligible computational overhead, and achieves superior image generation performance, especially in regions with only 5-10 function evaluations.
@article{chen2025geometric,title={Geometric Regularity in Deterministic Sampling of Diffusion-based Generative Models},author={Chen, Defang and Zhou, Zhenyu and Wang, Can and Lyu, Siwei},journal={J. Stat. Mech.},year={2025},}
26-AAAI
Diffusion
DICE: Distilling Classifier-Free Guidance into Text Embeddings
Text-to-image diffusion models are capable of generating high-quality images, but these images often fail to align closely with the given text prompts. Classifier-free guidance (CFG) is a popular and effective technique for improving text-image alignment in the generative process. However, using CFG introduces significant computational overhead and deviates from the established theoretical foundations of diffusion models. In this paper, we present DIstilling CFG by enhancing text Embeddings (DICE), a novel approach that removes the reliance on CFG in the generative process while maintaining the benefits it provides. DICE distills a CFG-based text-to-image diffusion model into a CFG-free version by refining text embeddings to replicate CFG-based directions. In this way, we avoid the computational and theoretical drawbacks of CFG, enabling high-quality, well-aligned image generation at a fast sampling speed. Extensive experiments on multiple Stable Diffusion v1.5 variants, SDXL and PixArt-αdemonstrate the effectiveness of our method. Furthermore, DICE supports negative prompts for image editing to improve image quality further.
@inproceedings{zhou2026dice,title={DICE: Distilling Classifier-Free Guidance into Text Embeddings},author={Zhou, Zhenyu and Chen, Defang and Wang, Can and Chen, Chun and Lyu, Siwei},booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},year={2025}}
Diffusion-based generative models have demonstrated their powerful performance across various tasks, but this comes at a cost of the slow sampling speed. To achieve both efficient and high-quality synthesis, various distillation-based accelerated sampling methods have been developed recently. However, they generally require time-consuming fine tuning with elaborate designs to achieve satisfactory performance in a specific number of function evaluation (NFE), making them difficult to employ in practice. To address this issue, we propose Simple and Fast Distillation (SFD) of diffusion models, which simplifies the paradigm used in existing methods and largely shortens their fine-tuning time up to 1000 . We begin with a vanilla distillation-based sampling method and boost its performance to state of the art by identifying and addressing several small yet vital factors affecting the synthesis efficiency and quality. Our method can also achieve sampling with variable NFEs using a single distilled model. Extensive experiments demonstrate that SFD strikes a good balance between the sample quality and fine-tuning costs in few-step image generation task. For example, SFD achieves 4.53 FID (NFE=2) on CIFAR-10 with only 0.64 hours of fine-tuning on a single NVIDIA A100 GPU.
@inproceedings{zhou2024simple,title={Simple and fast distillation of diffusion models},author={Zhou, Zhenyu and Chen, Defang and Wang, Can and Chen, Chun and Lyu, Siwei},booktitle={{Advances in Neural Information Processing Systems}},pages={40831--40860},year={2024},}
25-TMLR
Survey
Conditional Image Synthesis with Diffusion Models: A Survey
Zheyuan Zhan, Defang Chen†, Jian-Ping Mei, Zhenghe Zhao, Jiawei Chen, and 3 more authors
Conditional image synthesis based on user-specified requirements is a key component in creating complex visual content. In recent years, diffusion-based generative modeling has become a highly effective way for conditional image synthesis, leading to exponential growth in the literature. However, the complexity of diffusion-based modeling, the wide range of image synthesis tasks, and the diversity of conditioning mechanisms present significant challenges for researchers to keep up with rapid developments and to understand the core concepts on this topic. In this survey, we categorize existing works based on how conditions are integrated into the two fundamental components of diffusion-based modeling, i.e., the denoising network and the sampling process. We specifically highlight the underlying principles, advantages, and potential challenges of various conditioning approaches during the training, re-purposing, and specialization stages to construct a desired denoising network. We also summarize six mainstream conditioning mechanisms in the sampling process. All discussions are centered around popular applications. Finally, we pinpoint several critical yet still unsolved problems and suggest some possible solutions for future research.
@article{zhan2025conditional,title={Conditional Image Synthesis with Diffusion Models: A Survey},author={Zhan, Zheyuan and Chen, Defang and Mei, Jian-Ping and Zhao, Zhenghe and Chen, Jiawei and Chen, Chun and Lyu, Siwei and Wang, Can},journal={Transactions on Machine Learning Research},year={2025},}
Recent research on knowledge distillation has increasingly focused on logit distillation because of its simplicity, effectiveness, and versatility in model compression. In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions, creating an exacerbated divergence between the standard distillation loss and the cross-entropy loss, which can undermine the consistency of the student model’s learning objectives. Previous attempts to use labels to empirically correct teacher predictions may undermine the class correlations. In contrast, our RLD employs labeling information to dynamically refine teacher logits. In this way, our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations, thus enhancing the value and efficiency of distilled knowledge. Experimental results on CIFAR-100 and ImageNet demonstrate its superiority over existing methods.
@inproceedings{sun2025knowledge,title={Knowledge distillation with refined logits},author={Sun, Wujie and Chen, Defang and Lyu, Siwei and Chen, Genlang and Chen, Chun and Wang, Can},booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},pages={1110--1119},year={2025}}
24-ICML
Diffusion
On the Trajectory Regularity of ODE-based Diffusion Sampling
Diffusion-based generative models use stochastic differential equations (SDEs) and their equivalent ordinary differential equations (ODEs) to establish a smooth connection between a complex data distribution and a tractable prior distribution. In this paper, we identify several intriguing trajectory properties in the ODE-based sampling process of diffusion models. We characterize an implicit denoising trajectory and discuss its vital role in forming the coupled sampling trajectory with a strong shape regularity, regardless of the generated content. We also describe a dynamic programming-based scheme to make the time schedule in sampling better fit the underlying trajectory structure. This simple strategy requires minimal modification to any given ODE-based numerical solvers and incurs negligible computational cost, while delivering superior performance in image generation, especially in 5∼10 function evaluations.
@inproceedings{chen2024trajectory,title={On the Trajectory Regularity of ODE-based Diffusion Sampling},author={Chen, Defang and Zhou, Zhenyu and Wang, Can and Shen, Chunhua and Lyu, Siwei},booktitle={International Conference on Machine Learning},pages={7905--7934},year={2024},}
24-CVPR
Diffusion
Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
Sampling from diffusion models can be treated as solving the corresponding ordinary differential equations (ODEs), with the aim of obtaining an accurate solution with as few number of function evaluations (NFE) as possible. Recently, various fast samplers utilizing higher-order ODE solvers have emerged and achieved better performance than the initial first-order one. However, these numerical methods inherently result in certain approximation errors, which significantly degrades sample quality with extremely small NFE (e.g., around 5). In contrast, based on the geometric observation that each sampling trajectory almost lies in a two-dimensional subspace embedded in the ambient space, we propose Approximate MEan-Direction Solver (AMED-Solver) that eliminates truncation errors by directly learning the mean direction for fast diffusion sampling. Besides, our method can be easily used as a plugin to further improve existing ODE-based samplers. Extensive experiments on image synthesis with the resolution ranging from 32 to 512 demonstrate the effectiveness of our method. With only 5 NFE, we achieve 6.61 FID on CIFAR-10, 10.74 FID on ImageNet 64\times64, and 13.20 FID on LSUN Bedroom.
@inproceedings{zhou2024fast,title={Fast ODE-based Sampling for Diffusion Models in Around 5 Steps},author={Zhou, Zhenyu and Chen, Defang and Wang, Can and Chen, Chun},booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},pages={7777--7786},year={2024},}
22-CVPR
Distillation
Knowledge Distillation with the Reused Teacher Classifier
Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, and 1 more author
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years, generally with elaborately designed knowledge representations, which in turn increase the difficulty of model development and interpretation. In contrast, we empirically show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap. We directly reuse the discriminative classifier from the pre-trained teacher model for student inference and train a student encoder through feature alignment with a single \ell_2 loss. In this way, the student model is able to achieve exactly the same performance as the teacher model provided that their extracted features are perfectly aligned. An additional projector is developed to help the student encoder match with the teacher classifier, which renders our technique applicable to various teacher and student architectures. Extensive experiments demonstrate that our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector
@inproceedings{chen2022simkd,title={Knowledge Distillation with the Reused Teacher Classifier},author={Chen, Defang and Mei, Jian-Ping and Zhang, Hailin and Wang, Can and Feng, Yan and Chen, Chun},booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},pages={11933--11942},year={2022},}
21-AAAI
Distillation
Cross-Layer Distillation with Semantic Calibration
Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, and 2 more authors
In Proceedings of the AAAI Conference on Artificial Intelligence, 2021
Journal version published in IEEE Trans. Knowl. Data Eng. (TKDE) Highly-Cited Paper Indexed by 2024/2025 Google Scholar Metrics
Knowledge distillation is a technique to enhance the generalization ability of a student model by exploiting outputs from a teacher model. Recently, feature-map based variants explore knowledge transfer between manually assigned teacher-student pairs in intermediate layers for further improvement. However, layer semantics may vary in different neural networks and semantic mismatch in manual layer associations will lead to performance degeneration due to negative regularization. To address this issue, we propose Semantic Calibration for cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer with an attention mechanism. With a learned attention distribution, each student layer distills knowledge contained in multiple teacher layers rather than a specific intermediate layer for appropriate cross-layer supervision. We further provide theoretical analysis of the association weights and conduct extensive experiments to demonstrate the effectiveness of our approach.
@inproceedings{chen2021cross,author={Chen, Defang and Mei, Jian{-}Ping and Zhang, Yuan and Wang, Can and Wang, Zhe and Feng, Yan and Chen, Chun},title={Cross-Layer Distillation with Semantic Calibration},booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},pages={7028--7036},year={2021},}
20-AAAI
Distillation
Online Knowledge Distillation with Diverse Peers
Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, and Chun Chen
In Proceedings of the AAAI Conference on Artificial Intelligence, 2020
Highly-Cited Paper Indexed by 2024/2025 Google Scholar Metrics
Distillation is an effective knowledge-transfer technique that uses predicted distributions of a powerful teacher model as soft targets to train a less-parameterized student model. A pre-trained high capacity teacher, however, is not always available. Recently proposed online variants use the aggregated intermediate predictions of multiple student models as targets to train each student model. Although group-derived targets give a good recipe for teacher-free distillation, group members are homogenized quickly with simple aggregation functions, leading to early saturated solutions. In this work, we propose Online Knowledge Distillation with Diverse peers (OKDDip), which performs two-level distillation during training with multiple auxiliary peers and one group leader. In the first-level distillation, each auxiliary peer holds an individual set of aggregation weights generated with an attention-based mechanism to derive its own targets from predictions of other auxiliary peers. Learning from distinct target distributions helps to boost peer diversity for effectiveness of group-based distillation. The second-level distillation is performed to transfer the knowledge in the ensemble of auxiliary peers further to the group leader, i.e., the model used for inference. Experimental results show that the proposed framework consistently gives better performance than state-of-the-art approaches without sacrificing training or inference complexity, demonstrating the effectiveness of the proposed two-level distillation framework.
@inproceedings{chen2020online,title={Online Knowledge Distillation with Diverse Peers},author={Chen, Defang and Mei, Jian-Ping and Wang, Can and Feng, Yan and Chen, Chun},booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},pages={3430--3437},year={2020},}