Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

📄 [Paper] | 🤗 [Hugging Face] 📁 [Dataset] 💻 [Code] | 📊 [Log]

Pretraining

The experiments were conducted using the following frameworks:

Dense Model Training

Framework: Megatron-LM

MoE Model Training

Framework: moe-recipes

Evaluation

We conducted comprehensive evaluations using the evaluation framework from swallow-llm/swallow-evaluation (commit: 04948a0).

Setup and Usage

For detailed instructions on setting up the evaluation environment and running the evaluation scripts, please refer to the evaluation framework documentation.

Citation

@inproceedings{
    nakamura2025dropupcycling,
    title={Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization},
    author={Taishi Nakamura and Takuya Akiba and Kazuki Fujii and Yusuke Oda and Rio Yokota and Jun Suzuki},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=gx1wHnf5Vp}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
conversions		conversions
images		images
scripts/pretrain		scripts/pretrain
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

Pretraining

Dense Model Training

MoE Model Training

Evaluation

Setup and Usage

Citation

About

Releases

Packages

Languages

License

Taishi-N324/Drop-Upcycling

Folders and files

Latest commit

History

Repository files navigation

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

Pretraining

Dense Model Training

MoE Model Training

Evaluation

Setup and Usage

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages