EleutherAI
/

pythia-12b

@@ -15,8 +15,8 @@ interpretability research. It contains two sets of eight models of sizes
 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
 models: one trained on the Pile, and one trained on the Pile after the dataset
 has been globally deduplicated. All 8 model sizes are trained on the exact
-same data, in the exact same order. All Pythia models are available
-[on Hugging Face](https://huggingface.co/models?other=pythia).
 The Pythia model suite was deliberately designed to promote scientific
 research on large language models, especially interpretability research.
@@ -24,20 +24,25 @@ Despite not centering downstream performance as a design goal, we find the
 models <a href="#evaluations">match or exceed</a> the performance of
 similar and same-sized models, such as those in the OPT and GPT-Neo suites.
 Previously, we released an early version of the Pythia suite to the public.
 However, we decided to retrain the model suite to address a few hyperparameter
 discrepancies. This model card <a href="#changelog">lists the changes</a>;
 see appendix B in the Pythia paper for further discussion. We found no
 difference in benchmark performance between the two Pythia versions.
 The old models are
-[still available](https://huggingface.co/models?other=pythia_v0); we suggest
-using the retrained suite if you are just starting to use Pythia.<br>
 **This is the current release.**
 Please note that all models in the *Pythia* suite were renamed in January
 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
 comparing the old and new names</a> is provided in this model card, together
 with exact parameter counts.
 # Pythia-12B
@@ -80,11 +85,12 @@ non-embedding parameters.</figcaption>
 The primary intended use of Pythia is research on the behavior, functionality,
 and limitations of large language models. This suite is intended to provide
-a controlled setting for performing scientific experiments. To enable the
-study of how language models change over the course of training, we provide
-143 evenly spaced intermediate checkpoints per model. These checkpoints are
-hosted on Hugging Face as branches. Note that branch `143000` corresponds
-exactly to the model checkpoint on the `main` branch of each model.
 You may also further fine-tune and adapt Pythia-12B for deployment,
 as long as your use is in accordance with the Apache 2.0 license. Pythia
@@ -108,7 +114,7 @@ language models are commonly deployed, such as writing genre prose,
 or commercial chatbots. This means Pythia-12B will **not**
 respond to a given prompt the way a product like ChatGPT does. This is because,
  unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
-Learning from Human Feedback (RLHF) to better “understand” human instructions.
 ### Limitations and biases
@@ -181,7 +187,9 @@ The Pile was **not** deduplicated before being used to train Pythia-12B.
 All models were trained on the exact same data, in the exact same order. Each
 model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
-model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
 This corresponds to training for just under 1 epoch on the Pile for
 non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
@@ -198,7 +206,7 @@ Pythia uses the same tokenizer as [GPT-NeoX-
 All 16 *Pythia* models were evaluated using the [LM Evaluation
 Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
 the results by model and step at `results/json/*` in the [GitHub
-repository](https://github.com/EleutherAI/pythia/tree/main/results/json/v1.1-evals).<br>
 Expand the sections below to see plots of evaluation results for all
 Pythia and Pythia-deduped models compared with OPT and BLOOM.

 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
 models: one trained on the Pile, and one trained on the Pile after the dataset
 has been globally deduplicated. All 8 model sizes are trained on the exact
+same data, in the exact same order. We also provide 154 intermediate
+checkpoints per model, hosted on Hugging Face as branches.
 The Pythia model suite was deliberately designed to promote scientific
 research on large language models, especially interpretability research.
 models <a href="#evaluations">match or exceed</a> the performance of
 similar and same-sized models, such as those in the OPT and GPT-Neo suites.
+<details>
+  <summary style="font-weight: 600">Past early release and naming convention.</summary>
 Previously, we released an early version of the Pythia suite to the public.
 However, we decided to retrain the model suite to address a few hyperparameter
 discrepancies. This model card <a href="#changelog">lists the changes</a>;
 see appendix B in the Pythia paper for further discussion. We found no
 difference in benchmark performance between the two Pythia versions.
 The old models are
+[still available](https://huggingface.co/models?other=pythia_v0), but we
+suggest the retrained suite if you are just starting to use Pythia.<br>
 **This is the current release.**
 Please note that all models in the *Pythia* suite were renamed in January
 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
 comparing the old and new names</a> is provided in this model card, together
 with exact parameter counts.
+</details>
+<br>
 # Pythia-12B
 The primary intended use of Pythia is research on the behavior, functionality,
 and limitations of large language models. This suite is intended to provide
+a controlled setting for performing scientific experiments. We also provide
+154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints
+`step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to
+`step143000`. These checkpoints are hosted on Hugging Face as branches. Note
+that branch `143000` corresponds exactly to the model checkpoint on the `main`
+branch of each model.
 You may also further fine-tune and adapt Pythia-12B for deployment,
 as long as your use is in accordance with the Apache 2.0 license. Pythia
 or commercial chatbots. This means Pythia-12B will **not**
 respond to a given prompt the way a product like ChatGPT does. This is because,
  unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
+Learning from Human Feedback (RLHF) to better “follow” human instructions.
 ### Limitations and biases
 All models were trained on the exact same data, in the exact same order. Each
 model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
+model are saved every 2,097,152,000 tokens, spaced evenly throughout training,
+from `step1000` to `step143000` (which is the same as `main`). In addition, we
+also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`.
 This corresponds to training for just under 1 epoch on the Pile for
 non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
 All 16 *Pythia* models were evaluated using the [LM Evaluation
 Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
 the results by model and step at `results/json/*` in the [GitHub
+repository](https://github.com/EleutherAI/pythia/tree/main/results/json/).<br>
 Expand the sections below to see plots of evaluation results for all
 Pythia and Pythia-deduped models compared with OPT and BLOOM.