LaSeR: Reinforcement Learning with Last-Token Self-Rewarding Paper • 2510.14943 • Published Oct 16 • 37
Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance Paper • 2502.12459 • Published Feb 18 • 2
Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design Paper • 2506.04734 • Published Jun 5 • 20