RoBERTa Pretrained on Smaller Datasets
We pretrain RoBERTa on smaller datasets (1M, 10M, 100M, 1B tokens). We release 3 models with lowest perplexities for each pretraining data size out of 25 runs (or 10 in the case of 1B tokens). The pretraining data reproduces that of BERT: We combine English Wikipedia and a reproduction of BookCorpus using texts from smashwords in a ratio of approximately 3:1.
Hyperparameters and Validation Perplexity
The hyperparameters and validation perplexities corresponding to each model are as follows:
The hyperparameters corresponding to model sizes mentioned above are as follows:
| Model Size |
L |
AH |
HS |
FFN |
P |
| BASE |
12 |
12 |
768 |
3072 |
125M |
| MED-SMALL |
6 |
8 |
512 |
2048 |
45M |
(AH = number of attention heads; HS = hidden size; FFN = feedforward network dimension; P = number of parameters.)
For other hyperparameters, we select:
- Peak Learning rate: 5e-4
- Warmup Steps: 6% of max steps
- Dropout: 0.1