Update README.md
Browse files
README.md
CHANGED
|
@@ -143,13 +143,15 @@ This means that our community owns the fingerprints that they can use to verify
|
|
| 143 |
**Dobby-Unhinged-Llama-3.3-70B** retains the base performance of Llama-3.3-70B-Instruct across the evaluated tasks.
|
| 144 |
|
| 145 |
We use lm-eval-harness to evaluate between performance on models:
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
|
| 150 |
-
|
|
| 151 |
-
|
|
| 152 |
-
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
### Freedom Bench
|
| 155 |
|
|
|
|
| 143 |
**Dobby-Unhinged-Llama-3.3-70B** retains the base performance of Llama-3.3-70B-Instruct across the evaluated tasks.
|
| 144 |
|
| 145 |
We use lm-eval-harness to evaluate between performance on models:
|
| 146 |
+
|
| 147 |
+
| Benchmark | Llama3.3-70B-Instruct | Dobby-Unhinged-Llama-3.3-70B |
|
| 148 |
+
|-------------------------------------------------|----------------------|--------------------|
|
| 149 |
+
| IFEVAL (inst_level_strict/loss avg) | 0.9340 | 0.8543 |
|
| 150 |
+
| MMLU-pro | 0.5474 | 0.5499 |
|
| 151 |
+
| GPQA (average among diamond, extended and main) | 0.3838 | 0.3939 |
|
| 152 |
+
| MuSR | 0.4881 | 0.5053 |
|
| 153 |
+
| BBH (average across all tasks) | 0.7018 | 0.7021 |
|
| 154 |
+
|
| 155 |
|
| 156 |
### Freedom Bench
|
| 157 |
|