Spaces:

jatinganhotra
/

swe-bench-verified-discriminative-leaderboard

Sleeping

App Files Files Community

swe-bench-verified-discriminative-leaderboard / README.md

jatinganhotra's picture

Upload folder using huggingface_hub

8d6baba verified 5 months ago

|

history blame contribute delete

3.22 kB

	---
	title: SWE-Bench Verified Discriminative Subsets Leaderboard
	emoji: 🏆
	colorFrom: blue
	colorTo: red
	sdk: gradio
	sdk_version: 5.34.2
	app_file: app.py
	pinned: false
	license: mit
	tags:
	- leaderboard
	- software-engineering
	- swe-bench
	- evaluation
	- benchmark
	---

	# 🏆 SWE-Bench Verified Discriminative Subsets Leaderboard

	This interactive leaderboard displays SWE-agents performance across SWE-Bench_Verified and four discriminative subsets designed to provide enhanced evaluation sensitivity for state-of-the-art systems.

	## 🎯 Why Discriminative Subsets?

	As SWE-agents improve, achieving 70%+ success rates on the full SWE-Bench Verified benchmark, traditional evaluation loses discriminative power. These targeted subsets focus on the most challenging problems to better distinguish between top-tier systems.

	## 📊 The Four Discriminative Subsets

	1. 🔥 Frontier Subset (95 instances): Problems solved by ≤5 agents - maximum evaluative sensitivity
	- Combines unsolved, ultra-rare, and very-rare problems
	- Top agent: 11.6% vs 73.2% on full benchmark (6x better discrimination)

	2. ⚡ Challenging Subset (155 instances): Problems solved by ≤20 agents - strong evaluative power
	- Balances discrimination with statistical significance
	- Includes frontier + rare and uncommon problems

	3. 💪 Hard Subset (45 instances): All Hard difficulty problems regardless of solve rate
	- Traditional difficulty-based evaluation
	- Focuses on problems originally classified as most difficult

	4. 📁 MultiFile Subset (40 instances): Multi-file problems solved by ≤10 agents
	- Targets real-world complexity requiring coordinated edits
	- Even leading agents achieve only 10% success rate

	## 🔬 Methodology

	Subsets were created through systematic analysis of solve distribution across 83 evaluated SWE-agents:
	- Problems solved by fewer agents provide better discrimination
	- Analysis covers submissions from October 2023 to May 2025
	- "Solved" means the agent's fix passed the verification test suite

	## 📈 Key Insights

	- Enhanced Resolution: Frontier subset provides 6x better discrimination between top systems
	- Multi-file Complexity: Represents genuine software engineering challenges
	- Statistical Significance: Challenging subset offers robust evaluation with strong discrimination
	- Real Progress: Performance on these subsets indicates genuine capability advances

	## 🔗 Resources

	- Blog Post: [From 73% to 11%: Revealing True SWE-Agent Capabilities](https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets.html)
	- Dataset: [SWE-bench_Verified-discriminative](https://huggingface.co/datasets/jatinganhotra/SWE-bench_Verified-discriminative)
	- Original SWE-Bench: [SWE-bench.com](https://www.swebench.com/)

	## 🚀 Usage

	```python
	from datasets import load_dataset

	# Load specific discriminative subset
	frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier")
	challenging = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="challenging")
	```

	Created by [Jatin Ganhotra](https://jatinganhotra.dev) \| Last Updated: June 19 2025