Update README.md
Browse files
README.md
CHANGED
|
@@ -89,45 +89,29 @@ print(response)
|
|
| 89 |
|
| 90 |
### 3.1 Arena-Hard-Auto
|
| 91 |
|
| 92 |
-
All results below, except those for `Xwen-
|
| 93 |
|
| 94 |
#### 3.1.1 No Style Control
|
| 95 |
|
| 96 |
-
|
|
| 97 |
-
|
|
| 98 |
-
| **Xwen-
|
| 99 |
-
| Qwen2.5-
|
| 100 |
-
|
|
| 101 |
-
| Llama-3.1-
|
| 102 |
-
| Llama-3
|
| 103 |
-
|
|
| 104 |
-
| O1-Preview-2024-09-12 π | **92.0** (Top-1 among π) | (-1.2, 1.0) |
|
| 105 |
-
| O1-Mini-2024-09-12 π | 90.4 | (-1.1, 1.3) |
|
| 106 |
-
| GPT-4-Turbo-2024-04-09 π | 82.6 | (-1.8, 1.5) |
|
| 107 |
-
| GPT-4-0125-Preview π | 78.0 | (-2.1, 2.4) |
|
| 108 |
-
| GPT-4o-2024-08-06 π | 77.9 | (-2.0, 2.1) |
|
| 109 |
-
| Yi-Lightning π | 81.5 | (-1.6, 1.6) |
|
| 110 |
-
| Yi-Largeπ | 63.7 | (-2.6, 2.4) |
|
| 111 |
-
| GLM-4-0520 π | 63.8 | (-2.9, 2.8) |
|
| 112 |
|
| 113 |
#### 3.1.2 Style Control
|
| 114 |
|
| 115 |
-
|
|
| 116 |
-
|
|
| 117 |
-
| **Xwen-
|
| 118 |
-
| Qwen2.5-
|
| 119 |
-
|
|
| 120 |
-
| Llama-3.1-
|
| 121 |
-
| Llama-3
|
| 122 |
-
|
|
| 123 |
-
| O1-Preview-2024-09-12 π | 81.7 | (-2.2, 2.1) |
|
| 124 |
-
| O1-Mini-2024-09-12 π | 79.3 | (-2.8, 2.3) |
|
| 125 |
-
| GPT-4-Turbo-2024-04-09 π | 74.3 | (-2.4, 2.4) |
|
| 126 |
-
| GPT-4-0125-Preview π | 73.6 | (-2.0, 2.0) |
|
| 127 |
-
| GPT-4o-2024-08-06 π | 71.1 | (-2.5, 2.0) |
|
| 128 |
-
| Yi-Lightning π | 66.9 | (-3.3, 2.7) |
|
| 129 |
-
| Yi-Large-Preview π | 65.1 | (-2.5, 2.5) |
|
| 130 |
-
| GLM-4-0520 π | 61.4 | (-2.6, 2.4) |
|
| 131 |
|
| 132 |
|
| 133 |
|
|
|
|
| 89 |
|
| 90 |
### 3.1 Arena-Hard-Auto
|
| 91 |
|
| 92 |
+
All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
|
| 93 |
|
| 94 |
#### 3.1.1 No Style Control
|
| 95 |
|
| 96 |
+
| | Score | 95% CIs |
|
| 97 |
+
| ----------------------- | -------- | ----------- |
|
| 98 |
+
| **Xwen-7B-Chat** π | **59.4** | (-2.4, 2.1) |
|
| 99 |
+
| Qwen2.5-7B-Instruct π | 50.4 | (-2.9, 2.5) |
|
| 100 |
+
| Gemma-2-27B-IT π | 57.5 | (-2.1, 2.4) |
|
| 101 |
+
| Llama-3.1-8B-Instruct π | 21.3 | (-1.9, 2.2) |
|
| 102 |
+
| Llama-3-8B-Instruct π | 20.6 | (-2.0, 1.9) |
|
| 103 |
+
| Starling-LM-7B-beta π | 23.0 | (-1.8, 1.8) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
#### 3.1.2 Style Control
|
| 106 |
|
| 107 |
+
| | Score | 95% CIs |
|
| 108 |
+
| ----------------------- | -------- | ----------- |
|
| 109 |
+
| **Xwen-7B-Chat** π | **50.3** | (-3.8, 2.8) |
|
| 110 |
+
| Qwen2.5-7B-Instruct π | 46.9 | (-3.1, 2.7) |
|
| 111 |
+
| Gemma-2-27B-IT π | 47.5 | (-2.5, 2.7) |
|
| 112 |
+
| Llama-3.1-8B-Instruct π | 18.3 | (-1.6, 1.6) |
|
| 113 |
+
| Llama-3-8B-Instruct π | 19.8 | (-1.6, 1.9) |
|
| 114 |
+
| Starling-LM-7B-beta π | 26.1 | (-2.6, 2.0) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
|
| 117 |
|