JC-Chen commited on
Commit
f7f6ff7
·
verified ·
1 Parent(s): 1e93960

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -4
README.md CHANGED
@@ -10,8 +10,6 @@ tags:
10
  - competition
11
  license: apache-2.0
12
  pipeline_tag: text-generation
13
- base_model:
14
- - Qwen/Qwen3-235B-A22B-Thinking-2507
15
  ---
16
 
17
  <div align="center">
@@ -33,7 +31,7 @@ base_model:
33
 
34
  ## Model Description
35
 
36
- **P1-235B-A22B** is the flagship model of the P1 series, a state-of-the-art open-source large language model specialized in physics reasoning. Developed through reinforcement learning on physics competition data, this model demonstrates exceptional performance on complex physics problems and marks a historic achievement as the first open-source model to win gold at the International Physics Olympiad (IPhO 2025). **P1-235B-A22B** is obtained from *Qwen3-235B-A22B-Thinking-2507* through multi-stage reinforcement learning fine-tuning on a curated dataset of physics competition problems.
37
 
38
  ### Key Highlights
39
 
@@ -70,6 +68,19 @@ base_model:
70
 
71
  </div>
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
  ## Usage
75
 
@@ -121,4 +132,4 @@ print(solution)
121
  }
122
  ```
123
 
124
- </div>
 
10
  - competition
11
  license: apache-2.0
12
  pipeline_tag: text-generation
 
 
13
  ---
14
 
15
  <div align="center">
 
31
 
32
  ## Model Description
33
 
34
+ **P1-235B-A22B** is the flagship model of the P1 series, a state-of-the-art open-source large language model specialized in physics reasoning. Built on *Qwen3-235B-A22B-Thinking-2507* and tuned through multi-stage reinforcement learning on curated physics competition data, P1-235B-A22B marks a historic achievement as the first open-source model to win gold at the International Physics Olympiad (IPhO 2025).
35
 
36
  ### Key Highlights
37
 
 
68
 
69
  </div>
70
 
71
+ ### Generalization to STEM Tasks
72
+
73
+ P1-235B-A22B demonstrates excellent general capabilities across various benchmarks. As shown below, P1-235B-A22B achieves better performance than its base model Qwen3-235B-A22B-Thinking-2507 on multiple tasks, further validating the strong generalization of P1 series models.
74
+
75
+ <div align="center">
76
+
77
+ | Model | AIME24 | AIME25 | HMMT | GPQA | HLE | LiveCodeBench | LiveBench |
78
+ |:-----:|:------:|:------:|:----:|:----:|:---:|:-------------:|:---------:|
79
+ | Qwen3-235B-A22B-Thinking-2507 (Base) | 94.6 | 94.2 | 81.7 | 79.4 | 17.5 | 76.2 | 80.3 |
80
+ | **P1-235B-A22B** | **95.0** | **95.0** | **80.8** | **81.4** | **19.1** | **75.8** | **79.8** |
81
+
82
+ </div>
83
+
84
 
85
  ## Usage
86
 
 
132
  }
133
  ```
134
 
135
+ </div>