Add HF leaderboard eval comparision
Browse files
README.md
CHANGED
|
@@ -91,7 +91,63 @@ print(response)
|
|
| 91 |
<br>
|
| 92 |
|
| 93 |
## Benchmarks
|
| 94 |
-
We report
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
- We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
| 96 |
- We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
|
| 97 |
- We use same batch-size across all models.
|
|
@@ -231,6 +287,9 @@ We report in the following table our internal pipeline benchmarks.
|
|
| 231 |
</tbody>
|
| 232 |
</table>
|
| 233 |
|
|
|
|
|
|
|
|
|
|
| 234 |
|
| 235 |
## Technical Report
|
| 236 |
Coming soon....
|
|
|
|
| 91 |
<br>
|
| 92 |
|
| 93 |
## Benchmarks
|
| 94 |
+
We report the official HuggingFace leaderboard normalized evaluations [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) in the following table.
|
| 95 |
+
<table border="1" style="width: 100%; text-align: center; border-collapse: collapse;">
|
| 96 |
+
<colgroup>
|
| 97 |
+
<col style="width: 10%;">
|
| 98 |
+
<col style="width: 7%;">
|
| 99 |
+
<col style="width: 7%;">
|
| 100 |
+
<col style="background-color: rgba(80, 15, 213, 0.5); width: 7%;">
|
| 101 |
+
</colgroup>
|
| 102 |
+
<thead>
|
| 103 |
+
<tr>
|
| 104 |
+
<th>Benchmark</th>
|
| 105 |
+
<th>Llama-3.1-8B-Instruct</th>
|
| 106 |
+
<th>Qwen2.5-7B-Instruct</th>
|
| 107 |
+
<th>Falcon3-7B-Instruct</th>
|
| 108 |
+
</tr>
|
| 109 |
+
</thead>
|
| 110 |
+
<tbody>
|
| 111 |
+
<tr>
|
| 112 |
+
<td>IFEval</td>
|
| 113 |
+
<td><b>78.56</b></td>
|
| 114 |
+
<td>75.85</td>
|
| 115 |
+
<td>76.12</td>
|
| 116 |
+
</tr>
|
| 117 |
+
<tr>
|
| 118 |
+
<td>BBH (3-shot)</td>
|
| 119 |
+
<td>29.89</td>
|
| 120 |
+
<td>34.89</td>
|
| 121 |
+
<td><b>37.92</b></td>
|
| 122 |
+
</tr>
|
| 123 |
+
<tr>
|
| 124 |
+
<td>MATH Lvl-5 (4-shot)</td>
|
| 125 |
+
<td>19.34</td>
|
| 126 |
+
<td>0.00</td>
|
| 127 |
+
<td><b>31.87</b></td>
|
| 128 |
+
</tr>
|
| 129 |
+
<tr>
|
| 130 |
+
<td>GPQA (0-shot)</td>
|
| 131 |
+
<td>2.35</td>
|
| 132 |
+
<td>5.48</td>
|
| 133 |
+
<td><b>8.05</b></td>
|
| 134 |
+
</tr>
|
| 135 |
+
<tr>
|
| 136 |
+
<td>MUSR (0-shot)</td>
|
| 137 |
+
<td>8.41</td>
|
| 138 |
+
<td>8.45</td>
|
| 139 |
+
<td><b>21.17</b></td>
|
| 140 |
+
</tr>
|
| 141 |
+
<tr>
|
| 142 |
+
<td>MMLU-PRO (5-shot)</td>
|
| 143 |
+
<td>30.68</td>
|
| 144 |
+
<td><b>36.52</b></td>
|
| 145 |
+
<td>34.30</td>
|
| 146 |
+
</tr>
|
| 147 |
+
</tbody>
|
| 148 |
+
</table>
|
| 149 |
+
|
| 150 |
+
Also, we report in the following table our internal pipeline benchmarks.
|
| 151 |
- We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
| 152 |
- We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
|
| 153 |
- We use same batch-size across all models.
|
|
|
|
| 287 |
</tbody>
|
| 288 |
</table>
|
| 289 |
|
| 290 |
+
## Useful links
|
| 291 |
+
- View our [release blogpost](https://huggingface.co/blog/falcon3).
|
| 292 |
+
- Feel free to join [our discord server](https://discord.gg/fwXpMyGc) if you have any questions or to interact with our researchers and developers.
|
| 293 |
|
| 294 |
## Technical Report
|
| 295 |
Coming soon....
|