Small Models Struggle to Learn from Strong Reasoners

1University of Washington 2Carnegie Mellon University 3Western Washington University

Abstract

Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the \textit{Small Model Learnability Gap}: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity.

To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.

Small Model Learnability Gap

Teaser Image

We reveal that small student models (≤3B parameters) do not consistently benefit from long CoT reasoning or distillation from large teacher models. Instead, they perform better when fine-tuned on shorter CoT reasoning or distilled from smaller teachers, which better aligns with their intrinsic learning capacity. We term this phenomenon as the Small Model Learnability Gap.

Main Takeaways

Takeaway 1: Long CoT Gap

Small student models tend to benefit more from short CoT, while large student models gain greater advantages from long CoT.

Long CoT Gap

Long CoT Gap (ΔLong=PLong - PShort) of student models with different model sizes for (a) Qwen family and (b) Llama family. For teacher models, QwQ-32B-Preview generates long CoT responses, while Qwen2.5-32B-Instruct generates short CoT responses. Negative (ΔLong < 0) indicates that long CoT is worse; positive (ΔLong > 0) indicates it is better.

Takeaway 2: Large Teacher CoT Gap

Small student models tend to learn better from small teachers, while large student models benefit more from large teachers.

Large Teacher CoT Gap

Large model CoT Gap (ΔLarge=PLarge - PSmall) of student models for (a) Qwen family and (b) Llama family. For teacher models, Qwen2.5-72B-Instruct is used as the large teacher while Qwen2.5-3B-Instruct serves as the small teacher. Negative (ΔLarge < 0) indicates that large teacher CoT is worse; positive (ΔLarge > 0) indicates it is better.

Takeaway 3: Effect of Domain Knowledge

Limited domain knowledge of small models may hinder their learning from strong reasoning teachers.

Math Expert Model

Math expert models usually exhibit a less significant Learnability Gap compared to general models, suggesting they can more easily learn from long CoT data or large teacher CoT.

Takeaway 4: Base vs Instruct

Small base models experience a more significant learnability gap than Instruct models.

Base vs Instruct

Base models generally exhibit a more pronounced learnability gap than Instruct models, indicating that it is more challenging for small base models to effectively learn from long CoT data or large teacher CoT.

Mix Distillation

Takeaway 5: Mix Distillation Bridges Gap

By mixing long CoT data (resp. large teacher CoTs) and short CoT data (resp. small teacher CoT), the small student model can achieve better performance compared to training on either data alone.

Mix Distillation Table

Mix Distillation outperforms the baseline models across most metrics. Using Llama3.2-3B-Instruct and Qwen2.5-3B-Instruct as student models with 7.5k samples from the MATH dataset for training, we distill responses generated by different teacher models as baselines. Our proposed Mix-Long combines long CoT and normal CoT data in a 1:4 ratio, while Mix-Large combines strong and weak teacher responses in the same proportion. Experimental results show that both methods surpass the baselines in most evaluation metrics, with the highest score in bold and the second highest underlined.

BibTeX

If you find our work useful, please consider citing our paper:


        @article{li2025small,
          title={Small Models Struggle to Learn from Strong Reasoners},
          author={Li, Yuetai and Yue, Xiang and Xu, Zhangchen and Jiang, Fengqing and Niu, Luyao and Lin, Bill Yuchen and Ramasubramanian, Bhaskar and Poovendran, Radha},
          journal={arXiv preprint arXiv:2502.12143},
          year={2025}
        }