MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper β’ 2509.24002 β’ Published Sep 28 β’ 171
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain Paper β’ 2509.26507 β’ Published Sep 30 β’ 532
Less is More: Recursive Reasoning with Tiny Networks Paper β’ 2510.04871 β’ Published Oct 6 β’ 484
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? Paper β’ 2510.02209 β’ Published Oct 2 β’ 52
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge Paper β’ 2506.21506 β’ Published Jun 26 β’ 51
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA Paper β’ 2505.21115 β’ Published May 27 β’ 139
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning Paper β’ 2504.17192 β’ Published Apr 24 β’ 120
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy Paper β’ 2503.24388 β’ Published Mar 31 β’ 30
view article Article π¦Έπ»#14: What Is MCP, and Why Is Everyone β Suddenly!β Talking About It? Mar 17 β’ 342
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers Paper β’ 2502.15007 β’ Published Feb 20 β’ 174
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems Paper β’ 2502.11098 β’ Published Feb 16 β’ 13
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? Paper β’ 2502.12115 β’ Published Feb 17 β’ 46