Measuring what Matters: Construct Validity in Large Language Model Benchmarks Paper • 2511.04703 • Published 18 days ago • 7