SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys? Paper • 2510.03120 • Published Oct 3 • 6
T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables Paper • 2508.19813 • Published Aug 27 • 25
view article Article Introducing AI Sheets: a tool to work with datasets using open AI models! Aug 8 • 102
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers Paper • 2508.20453 • Published Aug 28 • 63
Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models Paper • 2507.14241 • Published Jul 17 • 17
Mitigating Object Hallucinations via Sentence-Level Early Intervention Paper • 2507.12455 • Published Jul 16 • 7
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks Paper • 2507.01001 • Published Jul 1 • 47
Hammer: Robust Function-Calling for On-Device Language Models via Function Masking Paper • 2410.04587 • Published Oct 6, 2024 • 2
Taming the Titans: A Survey of Efficient LLM Inference Serving Paper • 2504.19720 • Published Apr 28 • 12
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents Paper • 2404.10774 • Published Apr 16, 2024 • 5
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving Paper • 2502.07640 • Published Feb 11 • 10
Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving Paper • 2506.17104 • Published Jun 20 • 1
LLMs Will Always Hallucinate, and We Need to Live With This Paper • 2409.05746 • Published Sep 9, 2024 • 6
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper • 2503.12605 • Published Mar 16 • 35
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective Paper • 2505.15045 • Published May 21 • 54