Logo DocMath-Eval

Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents

1Yale University, 2New York University, 3Penn State University
4Carnegie Mellon University, 5Allen Institute for AI
ACL 2024 Oral
data-overview

Introduction

Large Language Models (LLMs) have shown impressive capabilities in solving math word problems, but their ability to perform numerical reasoning in specialized domains with complex documents remains understudied. To address this gap, we present DocMath-Eval, a benchmark designed to evaluate LLMs' numerical reasoning skills in interpreting finance-specific documents containing both text and tables. DocMath-Eval consists of four evaluation sets with varying levels of difficulty in numerical reasoning and document understanding:

  • DMSimpShort: Simple reasoning over short documents with one table.
  • DMSimpLong: Simple reasoning over long documents with multiple tables.
  • DMCompShort: Complex reasoning over short documents with one table.
  • DMCompLong: A newly created set testing complex reasoning over extremely long documents with multiple tables.

Leaderboard of DocMath-Eval Testmini Set

Leaderboard of DocMath-Eval Test Set

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

🚨 For more submission details, please refer to this link.

BibTeX

@misc{zhao2024docmatheval,
      title={DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents}, 
      author={Yilun Zhao and Yitao Long and Hongjun Liu and Ryo Kamoi and Linyong Nan and Lyuhao Chen and Yixin Liu and Xiangru Tang and Rui Zhang and Arman Cohan},
      year={2024},
      eprint={2311.09805},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2311.09805}, 
}