FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models Paper • 2505.02735 • Published May 5 • 34
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Paper • 2502.14739 • Published Feb 20 • 104
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents Paper • 2406.13923 • Published Jun 20, 2024 • 24