Authors
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar
- Pages
- 22
- Published in
- United States of America
Table of Contents
- Introduction 1
- Related Work: Reasoning & Language Models 3
- GSM-Symbolic 4
- GSM-Symbolic: Template Generation 5
- Experimental Setup 5
- Experiments & Results 5
- How Reliable Are the Current GSM8K Results? 6
- How Fragile is Mathematical Reasoning in Large Language Models? 7
- How Does Question Difficulty Affect Performance Distribution? 8
- Can LLMs Really Understand Mathematical Concepts? 10
- Conclusion 12
- Appendix 17
- Detailed Experimental Setup 17
- Full Results 18
- Additional Results on GSM-Symbolic Performance Distributions 18
- Ablation: Does Fine-Tuning on Easier Tasks Help with More Difficult Tasks? 18
- Results on o1-preview and o1-mini 19