Model Card
- Source: https://arxiv.org/abs/2509.02046
- Optimizer:
sophia - Model size:
130m - Data size:
21B
Best configuration
| Hyperparameter | Value |
|---|---|
| beta1 | 0.95 |
| beta2 | 0.95 |
| epsilon | 1e-07 |
| gamma | 0.0125 |
| learning_rate | 0.002 |
| max_grad_norm | 1 |
| min_lr_ratio | 0 |
| train_batch_size | 128 |
| warmup | 4000 |
| weight_decay | 0.2 |