Measuring what Matters: Construct Validity in Large Language Model Benchmarks Paper • 2511.04703 • Published 19 days ago • 7
Clinical knowledge in LLMs does not translate to human interactions Paper • 2504.18919 • Published Apr 26 • 26
LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation Paper • 2503.02972 • Published Mar 4 • 25
Evaluating the role of `Constitutions' for learning from AI feedback Paper • 2411.10168 • Published Nov 15, 2024 • 5