Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs Paper • 2510.18279 • Published 19 days ago • 4
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation Paper • 2508.13144 • Published Aug 18
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge Paper • 2404.06664 • Published Apr 10, 2024 • 1
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs Paper • 2410.02677 • Published Oct 3, 2024 • 1
Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning Paper • 2502.14860 • Published Feb 20
MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning Paper • 2406.00922 • Published Jun 3, 2024
PrefPalette: Personalized Preference Modeling with Latent Attributes Paper • 2507.13541 • Published Jul 17 • 8
Medical Hallucinations in Foundation Models and Their Impact on Healthcare Paper • 2503.05777 • Published Feb 26
Simple yet Effective Code-Switching Language Identification with Multitask Pre-Training and Transfer Learning Paper • 2305.19759 • Published May 31, 2023
Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It Paper • 2510.00177 • Published Sep 30 • 3
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning Paper • 2508.19202 • Published Aug 26 • 7
MolmoAct: Action Reasoning Models that can Reason in Space Paper • 2508.07917 • Published Aug 11 • 44
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge Paper • 1803.05457 • Published Mar 14, 2018 • 2
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering Paper • 1809.02789 • Published Sep 8, 2018
From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project Paper • 1909.01958 • Published Sep 4, 2019
Probing Natural Language Inference Models through Semantic Fragments Paper • 1909.07521 • Published Sep 16, 2019