sky-2002 commited on Feb 26

Commit

1d71deb

verified ·

1 Parent(s): 7f9d9e4

Upload folder using huggingface_hub

Browse files

Files changed (40) hide show

README.md +10 -1
checkpoint-100/README.md +202 -0
checkpoint-100/adapter_config.json +37 -0
checkpoint-100/adapter_model.safetensors +3 -0
checkpoint-100/merges.txt +0 -0
checkpoint-100/optimizer.pt +3 -0
checkpoint-100/rng_state.pth +3 -0
checkpoint-100/scheduler.pt +3 -0
checkpoint-100/special_tokens_map.json +28 -0
checkpoint-100/tokenizer.json +0 -0
checkpoint-100/tokenizer_config.json +156 -0
checkpoint-100/trainer_state.json +1233 -0
checkpoint-100/training_args.bin +3 -0
checkpoint-100/vocab.json +0 -0
checkpoint-200/README.md +202 -0
checkpoint-200/adapter_config.json +37 -0
checkpoint-200/adapter_model.safetensors +3 -0
checkpoint-200/merges.txt +0 -0
checkpoint-200/optimizer.pt +3 -0
checkpoint-200/rng_state.pth +3 -0
checkpoint-200/scheduler.pt +3 -0
checkpoint-200/special_tokens_map.json +28 -0
checkpoint-200/tokenizer.json +0 -0
checkpoint-200/tokenizer_config.json +156 -0
checkpoint-200/trainer_state.json +2433 -0
checkpoint-200/training_args.bin +3 -0
checkpoint-200/vocab.json +0 -0
checkpoint-294/README.md +202 -0
checkpoint-294/adapter_config.json +37 -0
checkpoint-294/adapter_model.safetensors +3 -0
checkpoint-294/merges.txt +0 -0
checkpoint-294/optimizer.pt +3 -0
checkpoint-294/rng_state.pth +3 -0
checkpoint-294/scheduler.pt +3 -0
checkpoint-294/special_tokens_map.json +28 -0
checkpoint-294/tokenizer.json +0 -0
checkpoint-294/tokenizer_config.json +156 -0
checkpoint-294/trainer_state.json +0 -0
checkpoint-294/training_args.bin +3 -0
checkpoint-294/vocab.json +0 -0

README.md CHANGED Viewed

@@ -12,7 +12,16 @@ licence: license
 # Model Card for SmolLM2-360M-GRPO-v1
 This model is a fine-tuned version of [HuggingFaceTB/SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M).
-It has been trained using [TRL](https://github.com/huggingface/trl).
 ## Quick start

 # Model Card for SmolLM2-360M-GRPO-v1
 This model is a fine-tuned version of [HuggingFaceTB/SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M).
+It has been trained using [TRL](https://github.com/huggingface/trl) and using the `lamini/taylor_swift` dataset.
+## Evals
+Referring this [blog post](https://datawizz.ai/blog/grpo-fine-tuning-qwen-0-5b-vs-openai-o1-preview), used a similar evaluation method:
+| Model | Average ROUGE-L |
+|-------|-----------------|
+| Qwen-0.5B | 0.3313 |
+| SmolLM2-360M-GRPO-v0 | 0.1644 |
+| SmolLM2-360M-GRPO-v1 | 0.1672 |
 ## Quick start

checkpoint-100/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: HuggingFaceTB/SmolLM2-360M
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.14.0

checkpoint-100/adapter_config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "HuggingFaceTB/SmolLM2-360M",
+  "bias": "none",
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 64,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "up_proj",
+    "v_proj",
+    "gate_proj",
+    "q_proj",
+    "down_proj",
+    "o_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

checkpoint-100/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ab6ba06359a4e33a6bb9d0489ac289f115e229adca4ecd0ce56d3be06610b366
+size 69527352

checkpoint-100/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-100/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cab295c214b888cda14e25fe86824c274e9a7c8e3dd675a86b7d824667b53c5d
+size 139313234

checkpoint-100/rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:281fc5b575006c17429df6952a63f0dd26d4c1f124d999e9e3d9113e6e8c1045
+size 14244

checkpoint-100/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:691263e55c7bb34b6e090bd89cf079f362100d7bd758be475bd3e7091c1a4ed6
+size 1064

checkpoint-100/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "bos_token": "<|im_start|>",
+  "eos_token": "<|im_end|>",
+  "pad_token": "<|im_end|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

checkpoint-100/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-100/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,156 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": "<|im_start|>",
+  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<|im_end|>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

checkpoint-100/trainer_state.json ADDED Viewed

	@@ -0,0 +1,1233 @@

+{
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 1.0204081632653061,
+  "eval_steps": 500,
+  "global_step": 100,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "completion_length": 160.0,
+      "epoch": 0.01020408163265306,
+      "grad_norm": 0.1883362978696823,
+      "kl": 0.0,
+      "learning_rate": 1.0000000000000002e-06,
+      "loss": 0.0,
+      "reward": 0.35154566168785095,
+      "reward_std": 0.18836843967437744,
+      "rewards/<lambda>": 0.35154566168785095,
+      "step": 1
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.02040816326530612,
+      "grad_norm": 0.08015607297420502,
+      "kl": 0.0,
+      "learning_rate": 2.0000000000000003e-06,
+      "loss": 0.0,
+      "reward": 0.24049028754234314,
+      "reward_std": 0.23294341564178467,
+      "rewards/<lambda>": 0.24049028754234314,
+      "step": 2
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.030612244897959183,
+      "grad_norm": 0.09276824444532394,
+      "kl": 0.0009410024504177272,
+      "learning_rate": 3e-06,
+      "loss": 0.0,
+      "reward": 0.32676321268081665,
+      "reward_std": 0.25492924451828003,
+      "rewards/<lambda>": 0.32676321268081665,
+      "step": 3
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.04081632653061224,
+      "grad_norm": 0.10654664784669876,
+      "kl": 0.001077650347724557,
+      "learning_rate": 4.000000000000001e-06,
+      "loss": 0.0,
+      "reward": 0.2009219378232956,
+      "reward_std": 0.21452507376670837,
+      "rewards/<lambda>": 0.2009219378232956,
+      "step": 4
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.05102040816326531,
+      "grad_norm": 0.11217533051967621,
+      "kl": 0.001070915488526225,
+      "learning_rate": 5e-06,
+      "loss": 0.0,
+      "reward": 0.3001704216003418,
+      "reward_std": 0.24798467755317688,
+      "rewards/<lambda>": 0.3001704216003418,
+      "step": 5
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.061224489795918366,
+      "grad_norm": 0.09305333346128464,
+      "kl": 0.0010154233314096928,
+      "learning_rate": 6e-06,
+      "loss": 0.0,
+      "reward": 0.2345525473356247,
+      "reward_std": 0.22145482897758484,
+      "rewards/<lambda>": 0.2345525473356247,
+      "step": 6
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.07142857142857142,
+      "grad_norm": 0.13233615458011627,
+      "kl": 0.0011612953385338187,
+      "learning_rate": 7.000000000000001e-06,
+      "loss": 0.0,
+      "reward": 0.3054887056350708,
+      "reward_std": 0.21511563658714294,
+      "rewards/<lambda>": 0.3054887056350708,
+      "step": 7
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.08163265306122448,
+      "grad_norm": 0.1244489848613739,
+      "kl": 0.000984564539976418,
+      "learning_rate": 8.000000000000001e-06,
+      "loss": 0.0,
+      "reward": 0.27341318130493164,
+      "reward_std": 0.23464292287826538,
+      "rewards/<lambda>": 0.27341318130493164,
+      "step": 8
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.09183673469387756,
+      "grad_norm": 0.09130726009607315,
+      "kl": 0.0008879535598680377,
+      "learning_rate": 9e-06,
+      "loss": 0.0,
+      "reward": 0.3593016564846039,
+      "reward_std": 0.2495943158864975,
+      "rewards/<lambda>": 0.3593016564846039,
+      "step": 9
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.10204081632653061,
+      "grad_norm": 0.15256966650485992,
+      "kl": 0.0011160458670929074,
+      "learning_rate": 1e-05,
+      "loss": 0.0,
+      "reward": 0.3049771785736084,
+      "reward_std": 0.26634567975997925,
+      "rewards/<lambda>": 0.3049771785736084,
+      "step": 10
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.11224489795918367,
+      "grad_norm": 0.09210026264190674,
+      "kl": 0.0010061098728328943,
+      "learning_rate": 1.1000000000000001e-05,
+      "loss": 0.0,
+      "reward": 0.2637690603733063,
+      "reward_std": 0.21869879961013794,
+      "rewards/<lambda>": 0.2637690603733063,
+      "step": 11
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.12244897959183673,
+      "grad_norm": 0.09713605791330338,
+      "kl": 0.0010679092956706882,
+      "learning_rate": 1.2e-05,
+      "loss": 0.0,
+      "reward": 0.36334261298179626,
+      "reward_std": 0.24027395248413086,
+      "rewards/<lambda>": 0.36334261298179626,
+      "step": 12
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.1326530612244898,
+      "grad_norm": 0.09611018002033234,
+      "kl": 0.000932489987462759,
+      "learning_rate": 1.3000000000000001e-05,
+      "loss": 0.0,
+      "reward": 0.2559860944747925,
+      "reward_std": 0.25332069396972656,
+      "rewards/<lambda>": 0.2559860944747925,
+      "step": 13
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.14285714285714285,
+      "grad_norm": 0.11913401633501053,
+      "kl": 0.0010431472910568118,
+      "learning_rate": 1.4000000000000001e-05,
+      "loss": 0.0,
+      "reward": 0.2593250274658203,
+      "reward_std": 0.25670933723449707,
+      "rewards/<lambda>": 0.2593250274658203,
+      "step": 14
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.15306122448979592,
+      "grad_norm": 0.09773930162191391,
+      "kl": 0.0009058397263288498,
+      "learning_rate": 1.5e-05,
+      "loss": 0.0,
+      "reward": 0.3413263261318207,
+      "reward_std": 0.23378777503967285,
+      "rewards/<lambda>": 0.3413263261318207,
+      "step": 15
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.16326530612244897,
+      "grad_norm": 0.10632280260324478,
+      "kl": 0.000995107227936387,
+      "learning_rate": 1.6000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.3327138423919678,
+      "reward_std": 0.2463734745979309,
+      "rewards/<lambda>": 0.3327138423919678,
+      "step": 16
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.17346938775510204,
+      "grad_norm": 0.0903797596693039,
+      "kl": 0.0008249408565461636,
+      "learning_rate": 1.7000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.25679031014442444,
+      "reward_std": 0.2034485787153244,
+      "rewards/<lambda>": 0.25679031014442444,
+      "step": 17
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.1836734693877551,
+      "grad_norm": 0.09401573985815048,
+      "kl": 0.0011675208806991577,
+      "learning_rate": 1.8e-05,
+      "loss": 0.0,
+      "reward": 0.24813005328178406,
+      "reward_std": 0.21580623090267181,
+      "rewards/<lambda>": 0.24813005328178406,
+      "step": 18
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.19387755102040816,
+      "grad_norm": 0.12456843256950378,
+      "kl": 0.0010836416622623801,
+      "learning_rate": 1.9e-05,
+      "loss": 0.0,
+      "reward": 0.3487837314605713,
+      "reward_std": 0.220088928937912,
+      "rewards/<lambda>": 0.3487837314605713,
+      "step": 19
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.20408163265306123,
+      "grad_norm": 0.0994577631354332,
+      "kl": 0.0012541261967271566,
+      "learning_rate": 2e-05,
+      "loss": 0.0001,
+      "reward": 0.2757805585861206,
+      "reward_std": 0.23097757995128632,
+      "rewards/<lambda>": 0.2757805585861206,
+      "step": 20
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.21428571428571427,
+      "grad_norm": 0.1026805192232132,
+      "kl": 0.0013662949204444885,
+      "learning_rate": 2.1e-05,
+      "loss": 0.0001,
+      "reward": 0.3016376793384552,
+      "reward_std": 0.26067519187927246,
+      "rewards/<lambda>": 0.3016376793384552,
+      "step": 21
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.22448979591836735,
+      "grad_norm": 0.11243847012519836,
+      "kl": 0.0012734383344650269,
+      "learning_rate": 2.2000000000000003e-05,
+      "loss": 0.0001,
+      "reward": 0.34152576327323914,
+      "reward_std": 0.24704596400260925,
+      "rewards/<lambda>": 0.34152576327323914,
+      "step": 22
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.23469387755102042,
+      "grad_norm": 0.11959439516067505,
+      "kl": 0.001076711225323379,
+      "learning_rate": 2.3000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.25360214710235596,
+      "reward_std": 0.24721531569957733,
+      "rewards/<lambda>": 0.25360214710235596,
+      "step": 23
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.24489795918367346,
+      "grad_norm": 0.12651875615119934,
+      "kl": 0.001166851376183331,
+      "learning_rate": 2.4e-05,
+      "loss": 0.0,
+      "reward": 0.23422789573669434,
+      "reward_std": 0.22290775179862976,
+      "rewards/<lambda>": 0.23422789573669434,
+      "step": 24
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.25510204081632654,
+      "grad_norm": 0.09411771595478058,
+      "kl": 0.0009251383016817272,
+      "learning_rate": 2.5e-05,
+      "loss": 0.0,
+      "reward": 0.3483571410179138,
+      "reward_std": 0.22439202666282654,
+      "rewards/<lambda>": 0.3483571410179138,
+      "step": 25
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.2653061224489796,
+      "grad_norm": 0.10857303440570831,
+      "kl": 0.0013178512454032898,
+      "learning_rate": 2.6000000000000002e-05,
+      "loss": 0.0001,
+      "reward": 0.27066630125045776,
+      "reward_std": 0.25778770446777344,
+      "rewards/<lambda>": 0.27066630125045776,
+      "step": 26
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.2755102040816326,
+      "grad_norm": 0.10724504292011261,
+      "kl": 0.0011154217645525932,
+      "learning_rate": 2.7000000000000002e-05,
+      "loss": 0.0,
+      "reward": 0.29464977979660034,
+      "reward_std": 0.22159242630004883,
+      "rewards/<lambda>": 0.29464977979660034,
+      "step": 27
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.2857142857142857,
+      "grad_norm": 0.14677317440509796,
+      "kl": 0.0010433748830109835,
+      "learning_rate": 2.8000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.3109304904937744,
+      "reward_std": 0.26592540740966797,
+      "rewards/<lambda>": 0.3109304904937744,
+      "step": 28
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.29591836734693877,
+      "grad_norm": 0.12714359164237976,
+      "kl": 0.001189422095194459,
+      "learning_rate": 2.9e-05,
+      "loss": 0.0,
+      "reward": 0.3446485698223114,
+      "reward_std": 0.22104832530021667,
+      "rewards/<lambda>": 0.3446485698223114,
+      "step": 29
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.30612244897959184,
+      "grad_norm": 0.093221515417099,
+      "kl": 0.0011938215466216207,
+      "learning_rate": 3e-05,
+      "loss": 0.0,
+      "reward": 0.3021507263183594,
+      "reward_std": 0.26753896474838257,
+      "rewards/<lambda>": 0.3021507263183594,
+      "step": 30
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3163265306122449,
+      "grad_norm": 0.10894666612148285,
+      "kl": 0.0014481329126283526,
+      "learning_rate": 3.1e-05,
+      "loss": 0.0001,
+      "reward": 0.22669407725334167,
+      "reward_std": 0.21502387523651123,
+      "rewards/<lambda>": 0.22669407725334167,
+      "step": 31
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.32653061224489793,
+      "grad_norm": 0.11627420037984848,
+      "kl": 0.0011792851146310568,
+      "learning_rate": 3.2000000000000005e-05,
+      "loss": 0.0,
+      "reward": 0.2387671172618866,
+      "reward_std": 0.20545344054698944,
+      "rewards/<lambda>": 0.2387671172618866,
+      "step": 32
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.336734693877551,
+      "grad_norm": 0.1079491674900055,
+      "kl": 0.0013044481165707111,
+      "learning_rate": 3.3e-05,
+      "loss": 0.0001,
+      "reward": 0.2432921826839447,
+      "reward_std": 0.2530010938644409,
+      "rewards/<lambda>": 0.2432921826839447,
+      "step": 33
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3469387755102041,
+      "grad_norm": 0.09688923507928848,
+      "kl": 0.0012661240762099624,
+      "learning_rate": 3.4000000000000007e-05,
+      "loss": 0.0001,
+      "reward": 0.32528483867645264,
+      "reward_std": 0.2360536754131317,
+      "rewards/<lambda>": 0.32528483867645264,
+      "step": 34
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.35714285714285715,
+      "grad_norm": 0.07901012152433395,
+      "kl": 0.001169042894616723,
+      "learning_rate": 3.5e-05,
+      "loss": 0.0,
+      "reward": 0.2810063362121582,
+      "reward_std": 0.22572952508926392,
+      "rewards/<lambda>": 0.2810063362121582,
+      "step": 35
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3673469387755102,
+      "grad_norm": 0.12785927951335907,
+      "kl": 0.0014555864036083221,
+      "learning_rate": 3.6e-05,
+      "loss": 0.0001,
+      "reward": 0.2540737986564636,
+      "reward_std": 0.24591577053070068,
+      "rewards/<lambda>": 0.2540737986564636,
+      "step": 36
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.37755102040816324,
+      "grad_norm": 0.08787187188863754,
+      "kl": 0.001047058729454875,
+      "learning_rate": 3.7e-05,
+      "loss": 0.0,
+      "reward": 0.24186044931411743,
+      "reward_std": 0.2187865972518921,
+      "rewards/<lambda>": 0.24186044931411743,
+      "step": 37
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3877551020408163,
+      "grad_norm": 0.102202869951725,
+      "kl": 0.0013880159240216017,
+      "learning_rate": 3.8e-05,
+      "loss": 0.0001,
+      "reward": 0.2575429379940033,
+      "reward_std": 0.2186940610408783,
+      "rewards/<lambda>": 0.2575429379940033,
+      "step": 38
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3979591836734694,
+      "grad_norm": 0.08456045389175415,
+      "kl": 0.0011803264496847987,
+      "learning_rate": 3.9000000000000006e-05,
+      "loss": 0.0,
+      "reward": 0.31993257999420166,
+      "reward_std": 0.22598499059677124,
+      "rewards/<lambda>": 0.31993257999420166,
+      "step": 39
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.40816326530612246,
+      "grad_norm": 0.10486605018377304,
+      "kl": 0.0018042891751974821,
+      "learning_rate": 4e-05,
+      "loss": 0.0001,
+      "reward": 0.29636475443840027,
+      "reward_std": 0.24350598454475403,
+      "rewards/<lambda>": 0.29636475443840027,
+      "step": 40
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.41836734693877553,
+      "grad_norm": 0.0880642980337143,
+      "kl": 0.001442858949303627,
+      "learning_rate": 4.1e-05,
+      "loss": 0.0001,
+      "reward": 0.41248711943626404,
+      "reward_std": 0.23531243205070496,
+      "rewards/<lambda>": 0.41248711943626404,
+      "step": 41
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.42857142857142855,
+      "grad_norm": 0.11210600286722183,
+      "kl": 0.0027525494806468487,
+      "learning_rate": 4.2e-05,
+      "loss": 0.0001,
+      "reward": 0.33057302236557007,
+      "reward_std": 0.2597821354866028,
+      "rewards/<lambda>": 0.33057302236557007,
+      "step": 42
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.4387755102040816,
+      "grad_norm": 0.1131657138466835,
+      "kl": 0.0013009419199079275,
+      "learning_rate": 4.3e-05,
+      "loss": 0.0001,
+      "reward": 0.3796292245388031,
+      "reward_std": 0.23504015803337097,
+      "rewards/<lambda>": 0.3796292245388031,
+      "step": 43
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.4489795918367347,
+      "grad_norm": 0.09555080533027649,
+      "kl": 0.002401509787887335,
+      "learning_rate": 4.4000000000000006e-05,
+      "loss": 0.0001,
+      "reward": 0.28863847255706787,
+      "reward_std": 0.2549838721752167,
+      "rewards/<lambda>": 0.28863847255706787,
+      "step": 44
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.45918367346938777,
+      "grad_norm": 0.12921357154846191,
+      "kl": 0.003915494307875633,
+      "learning_rate": 4.5e-05,
+      "loss": 0.0002,
+      "reward": 0.25585973262786865,
+      "reward_std": 0.2098313271999359,
+      "rewards/<lambda>": 0.25585973262786865,
+      "step": 45
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.46938775510204084,
+      "grad_norm": 0.10449738055467606,
+      "kl": 0.0029986370354890823,
+      "learning_rate": 4.600000000000001e-05,
+      "loss": 0.0001,
+      "reward": 0.24023020267486572,
+      "reward_std": 0.21866671741008759,
+      "rewards/<lambda>": 0.24023020267486572,
+      "step": 46
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.47959183673469385,
+      "grad_norm": 0.11932526528835297,
+      "kl": 0.0029086011927574873,
+      "learning_rate": 4.7e-05,
+      "loss": 0.0001,
+      "reward": 0.35698994994163513,
+      "reward_std": 0.244903564453125,
+      "rewards/<lambda>": 0.35698994994163513,
+      "step": 47
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.4897959183673469,
+      "grad_norm": 0.11577532440423965,
+      "kl": 0.0025195125490427017,
+      "learning_rate": 4.8e-05,
+      "loss": 0.0001,
+      "reward": 0.3163990378379822,
+      "reward_std": 0.25772640109062195,
+      "rewards/<lambda>": 0.3163990378379822,
+      "step": 48
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5,
+      "grad_norm": 0.12613017857074738,
+      "kl": 0.002560165012255311,
+      "learning_rate": 4.9e-05,
+      "loss": 0.0001,
+      "reward": 0.28095588088035583,
+      "reward_std": 0.2137780636548996,
+      "rewards/<lambda>": 0.28095588088035583,
+      "step": 49
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5102040816326531,
+      "grad_norm": 0.11029908806085587,
+      "kl": 0.003053056076169014,
+      "learning_rate": 5e-05,
+      "loss": 0.0001,
+      "reward": 0.2593611478805542,
+      "reward_std": 0.17565123736858368,
+      "rewards/<lambda>": 0.2593611478805542,
+      "step": 50
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5204081632653061,
+      "grad_norm": 0.10181179642677307,
+      "kl": 0.0032034586183726788,
+      "learning_rate": 5.1000000000000006e-05,
+      "loss": 0.0001,
+      "reward": 0.3496597707271576,
+      "reward_std": 0.23656174540519714,
+      "rewards/<lambda>": 0.3496597707271576,
+      "step": 51
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5306122448979592,
+      "grad_norm": 0.11612030863761902,
+      "kl": 0.0029043052345514297,
+      "learning_rate": 5.2000000000000004e-05,
+      "loss": 0.0001,
+      "reward": 0.31435415148735046,
+      "reward_std": 0.26738691329956055,
+      "rewards/<lambda>": 0.31435415148735046,
+      "step": 52
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5408163265306123,
+      "grad_norm": 0.11753087490797043,
+      "kl": 0.004537411965429783,
+      "learning_rate": 5.300000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.27415716648101807,
+      "reward_std": 0.2496672123670578,
+      "rewards/<lambda>": 0.27415716648101807,
+      "step": 53
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5510204081632653,
+      "grad_norm": 0.09143570065498352,
+      "kl": 0.0020023572724312544,
+      "learning_rate": 5.4000000000000005e-05,
+      "loss": 0.0001,
+      "reward": 0.27505844831466675,
+      "reward_std": 0.2481660097837448,
+      "rewards/<lambda>": 0.27505844831466675,
+      "step": 54
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5612244897959183,
+      "grad_norm": 0.11042241752147675,
+      "kl": 0.002455132082104683,
+      "learning_rate": 5.500000000000001e-05,
+      "loss": 0.0001,
+      "reward": 0.2277480959892273,
+      "reward_std": 0.2573385238647461,
+      "rewards/<lambda>": 0.2277480959892273,
+      "step": 55
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5714285714285714,
+      "grad_norm": 0.1062241867184639,
+      "kl": 0.003958327695727348,
+      "learning_rate": 5.6000000000000006e-05,
+      "loss": 0.0002,
+      "reward": 0.3246029019355774,
+      "reward_std": 0.25107985734939575,
+      "rewards/<lambda>": 0.3246029019355774,
+      "step": 56
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5816326530612245,
+      "grad_norm": 0.09364596009254456,
+      "kl": 0.0028939968906342983,
+      "learning_rate": 5.6999999999999996e-05,
+      "loss": 0.0001,
+      "reward": 0.33692467212677,
+      "reward_std": 0.2620271146297455,
+      "rewards/<lambda>": 0.33692467212677,
+      "step": 57
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5918367346938775,
+      "grad_norm": 0.10371523350477219,
+      "kl": 0.0031635533086955547,
+      "learning_rate": 5.8e-05,
+      "loss": 0.0001,
+      "reward": 0.26338717341423035,
+      "reward_std": 0.25605374574661255,
+      "rewards/<lambda>": 0.26338717341423035,
+      "step": 58
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6020408163265306,
+      "grad_norm": 0.119356170296669,
+      "kl": 0.004045659676194191,
+      "learning_rate": 5.9e-05,
+      "loss": 0.0002,
+      "reward": 0.26655131578445435,
+      "reward_std": 0.23418131470680237,
+      "rewards/<lambda>": 0.26655131578445435,
+      "step": 59
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6122448979591837,
+      "grad_norm": 0.14208924770355225,
+      "kl": 0.005053609609603882,
+      "learning_rate": 6e-05,
+      "loss": 0.0002,
+      "reward": 0.3059399425983429,
+      "reward_std": 0.2509281635284424,
+      "rewards/<lambda>": 0.3059399425983429,
+      "step": 60
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6224489795918368,
+      "grad_norm": 0.19251689314842224,
+      "kl": 0.005135521292686462,
+      "learning_rate": 6.1e-05,
+      "loss": 0.0002,
+      "reward": 0.3289321959018707,
+      "reward_std": 0.25894391536712646,
+      "rewards/<lambda>": 0.3289321959018707,
+      "step": 61
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6326530612244898,
+      "grad_norm": 0.10605209320783615,
+      "kl": 0.004940195474773645,
+      "learning_rate": 6.2e-05,
+      "loss": 0.0002,
+      "reward": 0.3096313178539276,
+      "reward_std": 0.25553008913993835,
+      "rewards/<lambda>": 0.3096313178539276,
+      "step": 62
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6428571428571429,
+      "grad_norm": 0.1347755789756775,
+      "kl": 0.007013984024524689,
+      "learning_rate": 6.3e-05,
+      "loss": 0.0003,
+      "reward": 0.3279077112674713,
+      "reward_std": 0.24912631511688232,
+      "rewards/<lambda>": 0.3279077112674713,
+      "step": 63
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6530612244897959,
+      "grad_norm": 0.11074844002723694,
+      "kl": 0.005811232142150402,
+      "learning_rate": 6.400000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.26557987928390503,
+      "reward_std": 0.2595665454864502,
+      "rewards/<lambda>": 0.26557987928390503,
+      "step": 64
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6632653061224489,
+      "grad_norm": 0.08949097245931625,
+      "kl": 0.0062391310930252075,
+      "learning_rate": 6.500000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.30340054631233215,
+      "reward_std": 0.23612141609191895,
+      "rewards/<lambda>": 0.30340054631233215,
+      "step": 65
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.673469387755102,
+      "grad_norm": 0.09197568893432617,
+      "kl": 0.006357334554195404,
+      "learning_rate": 6.6e-05,
+      "loss": 0.0003,
+      "reward": 0.305142879486084,
+      "reward_std": 0.25195544958114624,
+      "rewards/<lambda>": 0.305142879486084,
+      "step": 66
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6836734693877551,
+      "grad_norm": 0.11795815080404282,
+      "kl": 0.006672342773526907,
+      "learning_rate": 6.7e-05,
+      "loss": 0.0003,
+      "reward": 0.2330521047115326,
+      "reward_std": 0.24398502707481384,
+      "rewards/<lambda>": 0.2330521047115326,
+      "step": 67
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6938775510204082,
+      "grad_norm": 0.1380680352449417,
+      "kl": 0.008953109383583069,
+      "learning_rate": 6.800000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.3820633292198181,
+      "reward_std": 0.24111545085906982,
+      "rewards/<lambda>": 0.3820633292198181,
+      "step": 68
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7040816326530612,
+      "grad_norm": 0.10920781642198563,
+      "kl": 0.006580730434507132,
+      "learning_rate": 6.9e-05,
+      "loss": 0.0003,
+      "reward": 0.30754148960113525,
+      "reward_std": 0.22947274148464203,
+      "rewards/<lambda>": 0.30754148960113525,
+      "step": 69
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7142857142857143,
+      "grad_norm": 0.11049173772335052,
+      "kl": 0.006469545420259237,
+      "learning_rate": 7e-05,
+      "loss": 0.0003,
+      "reward": 0.2494296282529831,
+      "reward_std": 0.2338804006576538,
+      "rewards/<lambda>": 0.2494296282529831,
+      "step": 70
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7244897959183674,
+      "grad_norm": 0.12242847681045532,
+      "kl": 0.0065796151757240295,
+      "learning_rate": 7.1e-05,
+      "loss": 0.0003,
+      "reward": 0.35586172342300415,
+      "reward_std": 0.22659151256084442,
+      "rewards/<lambda>": 0.35586172342300415,
+      "step": 71
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7346938775510204,
+      "grad_norm": 0.10463358461856842,
+      "kl": 0.008036931976675987,
+      "learning_rate": 7.2e-05,
+      "loss": 0.0003,
+      "reward": 0.3306816518306732,
+      "reward_std": 0.21499347686767578,
+      "rewards/<lambda>": 0.3306816518306732,
+      "step": 72
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7448979591836735,
+      "grad_norm": 0.10549326241016388,
+      "kl": 0.005967782810330391,
+      "learning_rate": 7.3e-05,
+      "loss": 0.0002,
+      "reward": 0.26001977920532227,
+      "reward_std": 0.20824775099754333,
+      "rewards/<lambda>": 0.26001977920532227,
+      "step": 73
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7551020408163265,
+      "grad_norm": 0.10788781940937042,
+      "kl": 0.004906760528683662,
+      "learning_rate": 7.4e-05,
+      "loss": 0.0002,
+      "reward": 0.2874922752380371,
+      "reward_std": 0.24835163354873657,
+      "rewards/<lambda>": 0.2874922752380371,
+      "step": 74
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7653061224489796,
+      "grad_norm": 0.09784362465143204,
+      "kl": 0.005803011357784271,
+      "learning_rate": 7.500000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.2684209942817688,
+      "reward_std": 0.2644413113594055,
+      "rewards/<lambda>": 0.2684209942817688,
+      "step": 75
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7755102040816326,
+      "grad_norm": 0.16001495718955994,
+      "kl": 0.006752543151378632,
+      "learning_rate": 7.6e-05,
+      "loss": 0.0003,
+      "reward": 0.3604978919029236,
+      "reward_std": 0.21626180410385132,
+      "rewards/<lambda>": 0.3604978919029236,
+      "step": 76
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7857142857142857,
+      "grad_norm": 0.13324351608753204,
+      "kl": 0.008819150738418102,
+      "learning_rate": 7.7e-05,
+      "loss": 0.0004,
+      "reward": 0.3122796416282654,
+      "reward_std": 0.23514878749847412,
+      "rewards/<lambda>": 0.3122796416282654,
+      "step": 77
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7959183673469388,
+      "grad_norm": 0.11151094734668732,
+      "kl": 0.0060831112787127495,
+      "learning_rate": 7.800000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.3716966509819031,
+      "reward_std": 0.25390562415122986,
+      "rewards/<lambda>": 0.3716966509819031,
+      "step": 78
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8061224489795918,
+      "grad_norm": 0.1258833110332489,
+      "kl": 0.00652284873649478,
+      "learning_rate": 7.900000000000001e-05,
+      "loss": 0.0003,
+      "reward": 0.31094950437545776,
+      "reward_std": 0.24983155727386475,
+      "rewards/<lambda>": 0.31094950437545776,
+      "step": 79
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8163265306122449,
+      "grad_norm": 0.09824781864881516,
+      "kl": 0.007383039221167564,
+      "learning_rate": 8e-05,
+      "loss": 0.0003,
+      "reward": 0.3597716689109802,
+      "reward_std": 0.24361717700958252,
+      "rewards/<lambda>": 0.3597716689109802,
+      "step": 80
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.826530612244898,
+      "grad_norm": 0.08224570006132126,
+      "kl": 0.005881953053176403,
+      "learning_rate": 8.1e-05,
+      "loss": 0.0002,
+      "reward": 0.27791696786880493,
+      "reward_std": 0.2182818502187729,
+      "rewards/<lambda>": 0.27791696786880493,
+      "step": 81
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8367346938775511,
+      "grad_norm": 0.12581421434879303,
+      "kl": 0.006681859493255615,
+      "learning_rate": 8.2e-05,
+      "loss": 0.0003,
+      "reward": 0.3261301517486572,
+      "reward_std": 0.2615125775337219,
+      "rewards/<lambda>": 0.3261301517486572,
+      "step": 82
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8469387755102041,
+      "grad_norm": 0.10322140902280807,
+      "kl": 0.006312561221420765,
+      "learning_rate": 8.3e-05,
+      "loss": 0.0003,
+      "reward": 0.3625139594078064,
+      "reward_std": 0.24003905057907104,
+      "rewards/<lambda>": 0.3625139594078064,
+      "step": 83
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8571428571428571,
+      "grad_norm": 0.0966828241944313,
+      "kl": 0.0061577907763421535,
+      "learning_rate": 8.4e-05,
+      "loss": 0.0002,
+      "reward": 0.27837446331977844,
+      "reward_std": 0.24895024299621582,
+      "rewards/<lambda>": 0.27837446331977844,
+      "step": 84
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8673469387755102,
+      "grad_norm": 0.09297024458646774,
+      "kl": 0.005875007715076208,
+      "learning_rate": 8.5e-05,
+      "loss": 0.0002,
+      "reward": 0.3240439295768738,
+      "reward_std": 0.2554982602596283,
+      "rewards/<lambda>": 0.3240439295768738,
+      "step": 85
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8775510204081632,
+      "grad_norm": 0.12204565852880478,
+      "kl": 0.006396137177944183,
+      "learning_rate": 8.6e-05,
+      "loss": 0.0003,
+      "reward": 0.24172773957252502,
+      "reward_std": 0.2518821060657501,
+      "rewards/<lambda>": 0.24172773957252502,
+      "step": 86
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8877551020408163,
+      "grad_norm": 0.08884488046169281,
+      "kl": 0.00668664975091815,
+      "learning_rate": 8.7e-05,
+      "loss": 0.0003,
+      "reward": 0.28681838512420654,
+      "reward_std": 0.25753265619277954,
+      "rewards/<lambda>": 0.28681838512420654,
+      "step": 87
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8979591836734694,
+      "grad_norm": 0.08016186207532883,
+      "kl": 0.006137486081570387,
+      "learning_rate": 8.800000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.2647368907928467,
+      "reward_std": 0.24885067343711853,
+      "rewards/<lambda>": 0.2647368907928467,
+      "step": 88
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9081632653061225,
+      "grad_norm": 0.10147328674793243,
+      "kl": 0.00917094200849533,
+      "learning_rate": 8.900000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.29730311036109924,
+      "reward_std": 0.2591564953327179,
+      "rewards/<lambda>": 0.29730311036109924,
+      "step": 89
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9183673469387755,
+      "grad_norm": 0.12398728728294373,
+      "kl": 0.007823077961802483,
+      "learning_rate": 9e-05,
+      "loss": 0.0003,
+      "reward": 0.2589433193206787,
+      "reward_std": 0.24704992771148682,
+      "rewards/<lambda>": 0.2589433193206787,
+      "step": 90
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9285714285714286,
+      "grad_norm": 0.1265306919813156,
+      "kl": 0.009746653027832508,
+      "learning_rate": 9.1e-05,
+      "loss": 0.0004,
+      "reward": 0.2624666094779968,
+      "reward_std": 0.26040005683898926,
+      "rewards/<lambda>": 0.2624666094779968,
+      "step": 91
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9387755102040817,
+      "grad_norm": 0.09663106501102448,
+      "kl": 0.009732738137245178,
+      "learning_rate": 9.200000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.31370311975479126,
+      "reward_std": 0.2633877992630005,
+      "rewards/<lambda>": 0.31370311975479126,
+      "step": 92
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9489795918367347,
+      "grad_norm": 0.12955360114574432,
+      "kl": 0.011157519184052944,
+      "learning_rate": 9.300000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.3296325206756592,
+      "reward_std": 0.26135072112083435,
+      "rewards/<lambda>": 0.3296325206756592,
+      "step": 93
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9591836734693877,
+      "grad_norm": 0.07906248420476913,
+      "kl": 0.00876379944384098,
+      "learning_rate": 9.4e-05,
+      "loss": 0.0004,
+      "reward": 0.38581177592277527,
+      "reward_std": 0.21951782703399658,
+      "rewards/<lambda>": 0.38581177592277527,
+      "step": 94
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9693877551020408,
+      "grad_norm": 0.10264045745134354,
+      "kl": 0.013823926448822021,
+      "learning_rate": 9.5e-05,
+      "loss": 0.0006,
+      "reward": 0.37457725405693054,
+      "reward_std": 0.22653540968894958,
+      "rewards/<lambda>": 0.37457725405693054,
+      "step": 95
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9795918367346939,
+      "grad_norm": 0.09052637219429016,
+      "kl": 0.011999586597084999,
+      "learning_rate": 9.6e-05,
+      "loss": 0.0005,
+      "reward": 0.3207855224609375,
+      "reward_std": 0.23623248934745789,
+      "rewards/<lambda>": 0.3207855224609375,
+      "step": 96
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9897959183673469,
+      "grad_norm": 0.11846129596233368,
+      "kl": 0.011448493227362633,
+      "learning_rate": 9.7e-05,
+      "loss": 0.0005,
+      "reward": 0.32892781496047974,
+      "reward_std": 0.23297683894634247,
+      "rewards/<lambda>": 0.32892781496047974,
+      "step": 97
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0,
+      "grad_norm": 0.09638947248458862,
+      "kl": 0.010949314571917057,
+      "learning_rate": 9.8e-05,
+      "loss": 0.0004,
+      "reward": 0.4758386015892029,
+      "reward_std": 0.2482748031616211,
+      "rewards/<lambda>": 0.4758386015892029,
+      "step": 98
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.010204081632653,
+      "grad_norm": 0.09481366723775864,
+      "kl": 0.01085271593183279,
+      "learning_rate": 9.900000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.3523382544517517,
+      "reward_std": 0.2240390032529831,
+      "rewards/<lambda>": 0.3523382544517517,
+      "step": 99
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0204081632653061,
+      "grad_norm": 0.09950532019138336,
+      "kl": 0.012218523770570755,
+      "learning_rate": 0.0001,
+      "loss": 0.0005,
+      "reward": 0.28119099140167236,
+      "reward_std": 0.22250661253929138,
+      "rewards/<lambda>": 0.28119099140167236,
+      "step": 100
+    }
+  ],
+  "logging_steps": 1,
+  "max_steps": 294,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 3,
+  "save_steps": 100,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 0.0,
+  "train_batch_size": 8,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoint-100/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:10bbf0a23b8f4b09a58ff960658e6bc23228310be351cac2a5c1ed047b491f12
+size 5560

checkpoint-100/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-200/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: HuggingFaceTB/SmolLM2-360M
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.14.0

checkpoint-200/adapter_config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "HuggingFaceTB/SmolLM2-360M",
+  "bias": "none",
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 64,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "up_proj",
+    "v_proj",
+    "gate_proj",
+    "q_proj",
+    "down_proj",
+    "o_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

checkpoint-200/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c86fe316990db45f2615636367af79e298f133191741cb9217f15ece2ba7fb2c
+size 69527352

checkpoint-200/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-200/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:66636134445450dd72eeed7d9b5fb716409f792af7fcb3de584a20f44e3523de
+size 139313234

checkpoint-200/rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f6342846135f94c1e83447565a85d30dd89f714f3061c6b252aba6549e85adc
+size 14244

checkpoint-200/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0d298159cdcad213cd5f37f7528bef2babcacaf8836f0e4d1afc1d6de3993582
+size 1064

checkpoint-200/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "bos_token": "<|im_start|>",
+  "eos_token": "<|im_end|>",
+  "pad_token": "<|im_end|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

checkpoint-200/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-200/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,156 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": "<|im_start|>",
+  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<|im_end|>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

checkpoint-200/trainer_state.json ADDED Viewed

	@@ -0,0 +1,2433 @@

+{
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 2.0408163265306123,
+  "eval_steps": 500,
+  "global_step": 200,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "completion_length": 160.0,
+      "epoch": 0.01020408163265306,
+      "grad_norm": 0.1883362978696823,
+      "kl": 0.0,
+      "learning_rate": 1.0000000000000002e-06,
+      "loss": 0.0,
+      "reward": 0.35154566168785095,
+      "reward_std": 0.18836843967437744,
+      "rewards/<lambda>": 0.35154566168785095,
+      "step": 1
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.02040816326530612,
+      "grad_norm": 0.08015607297420502,
+      "kl": 0.0,
+      "learning_rate": 2.0000000000000003e-06,
+      "loss": 0.0,
+      "reward": 0.24049028754234314,
+      "reward_std": 0.23294341564178467,
+      "rewards/<lambda>": 0.24049028754234314,
+      "step": 2
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.030612244897959183,
+      "grad_norm": 0.09276824444532394,
+      "kl": 0.0009410024504177272,
+      "learning_rate": 3e-06,
+      "loss": 0.0,
+      "reward": 0.32676321268081665,
+      "reward_std": 0.25492924451828003,
+      "rewards/<lambda>": 0.32676321268081665,
+      "step": 3
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.04081632653061224,
+      "grad_norm": 0.10654664784669876,
+      "kl": 0.001077650347724557,
+      "learning_rate": 4.000000000000001e-06,
+      "loss": 0.0,
+      "reward": 0.2009219378232956,
+      "reward_std": 0.21452507376670837,
+      "rewards/<lambda>": 0.2009219378232956,
+      "step": 4
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.05102040816326531,
+      "grad_norm": 0.11217533051967621,
+      "kl": 0.001070915488526225,
+      "learning_rate": 5e-06,
+      "loss": 0.0,
+      "reward": 0.3001704216003418,
+      "reward_std": 0.24798467755317688,
+      "rewards/<lambda>": 0.3001704216003418,
+      "step": 5
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.061224489795918366,
+      "grad_norm": 0.09305333346128464,
+      "kl": 0.0010154233314096928,
+      "learning_rate": 6e-06,
+      "loss": 0.0,
+      "reward": 0.2345525473356247,
+      "reward_std": 0.22145482897758484,
+      "rewards/<lambda>": 0.2345525473356247,
+      "step": 6
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.07142857142857142,
+      "grad_norm": 0.13233615458011627,
+      "kl": 0.0011612953385338187,
+      "learning_rate": 7.000000000000001e-06,
+      "loss": 0.0,
+      "reward": 0.3054887056350708,
+      "reward_std": 0.21511563658714294,
+      "rewards/<lambda>": 0.3054887056350708,
+      "step": 7
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.08163265306122448,
+      "grad_norm": 0.1244489848613739,
+      "kl": 0.000984564539976418,
+      "learning_rate": 8.000000000000001e-06,
+      "loss": 0.0,
+      "reward": 0.27341318130493164,
+      "reward_std": 0.23464292287826538,
+      "rewards/<lambda>": 0.27341318130493164,
+      "step": 8
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.09183673469387756,
+      "grad_norm": 0.09130726009607315,
+      "kl": 0.0008879535598680377,
+      "learning_rate": 9e-06,
+      "loss": 0.0,
+      "reward": 0.3593016564846039,
+      "reward_std": 0.2495943158864975,
+      "rewards/<lambda>": 0.3593016564846039,
+      "step": 9
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.10204081632653061,
+      "grad_norm": 0.15256966650485992,
+      "kl": 0.0011160458670929074,
+      "learning_rate": 1e-05,
+      "loss": 0.0,
+      "reward": 0.3049771785736084,
+      "reward_std": 0.26634567975997925,
+      "rewards/<lambda>": 0.3049771785736084,
+      "step": 10
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.11224489795918367,
+      "grad_norm": 0.09210026264190674,
+      "kl": 0.0010061098728328943,
+      "learning_rate": 1.1000000000000001e-05,
+      "loss": 0.0,
+      "reward": 0.2637690603733063,
+      "reward_std": 0.21869879961013794,
+      "rewards/<lambda>": 0.2637690603733063,
+      "step": 11
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.12244897959183673,
+      "grad_norm": 0.09713605791330338,
+      "kl": 0.0010679092956706882,
+      "learning_rate": 1.2e-05,
+      "loss": 0.0,
+      "reward": 0.36334261298179626,
+      "reward_std": 0.24027395248413086,
+      "rewards/<lambda>": 0.36334261298179626,
+      "step": 12
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.1326530612244898,
+      "grad_norm": 0.09611018002033234,
+      "kl": 0.000932489987462759,
+      "learning_rate": 1.3000000000000001e-05,
+      "loss": 0.0,
+      "reward": 0.2559860944747925,
+      "reward_std": 0.25332069396972656,
+      "rewards/<lambda>": 0.2559860944747925,
+      "step": 13
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.14285714285714285,
+      "grad_norm": 0.11913401633501053,
+      "kl": 0.0010431472910568118,
+      "learning_rate": 1.4000000000000001e-05,
+      "loss": 0.0,
+      "reward": 0.2593250274658203,
+      "reward_std": 0.25670933723449707,
+      "rewards/<lambda>": 0.2593250274658203,
+      "step": 14
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.15306122448979592,
+      "grad_norm": 0.09773930162191391,
+      "kl": 0.0009058397263288498,
+      "learning_rate": 1.5e-05,
+      "loss": 0.0,
+      "reward": 0.3413263261318207,
+      "reward_std": 0.23378777503967285,
+      "rewards/<lambda>": 0.3413263261318207,
+      "step": 15
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.16326530612244897,
+      "grad_norm": 0.10632280260324478,
+      "kl": 0.000995107227936387,
+      "learning_rate": 1.6000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.3327138423919678,
+      "reward_std": 0.2463734745979309,
+      "rewards/<lambda>": 0.3327138423919678,
+      "step": 16
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.17346938775510204,
+      "grad_norm": 0.0903797596693039,
+      "kl": 0.0008249408565461636,
+      "learning_rate": 1.7000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.25679031014442444,
+      "reward_std": 0.2034485787153244,
+      "rewards/<lambda>": 0.25679031014442444,
+      "step": 17
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.1836734693877551,
+      "grad_norm": 0.09401573985815048,
+      "kl": 0.0011675208806991577,
+      "learning_rate": 1.8e-05,
+      "loss": 0.0,
+      "reward": 0.24813005328178406,
+      "reward_std": 0.21580623090267181,
+      "rewards/<lambda>": 0.24813005328178406,
+      "step": 18
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.19387755102040816,
+      "grad_norm": 0.12456843256950378,
+      "kl": 0.0010836416622623801,
+      "learning_rate": 1.9e-05,
+      "loss": 0.0,
+      "reward": 0.3487837314605713,
+      "reward_std": 0.220088928937912,
+      "rewards/<lambda>": 0.3487837314605713,
+      "step": 19
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.20408163265306123,
+      "grad_norm": 0.0994577631354332,
+      "kl": 0.0012541261967271566,
+      "learning_rate": 2e-05,
+      "loss": 0.0001,
+      "reward": 0.2757805585861206,
+      "reward_std": 0.23097757995128632,
+      "rewards/<lambda>": 0.2757805585861206,
+      "step": 20
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.21428571428571427,
+      "grad_norm": 0.1026805192232132,
+      "kl": 0.0013662949204444885,
+      "learning_rate": 2.1e-05,
+      "loss": 0.0001,
+      "reward": 0.3016376793384552,
+      "reward_std": 0.26067519187927246,
+      "rewards/<lambda>": 0.3016376793384552,
+      "step": 21
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.22448979591836735,
+      "grad_norm": 0.11243847012519836,
+      "kl": 0.0012734383344650269,
+      "learning_rate": 2.2000000000000003e-05,
+      "loss": 0.0001,
+      "reward": 0.34152576327323914,
+      "reward_std": 0.24704596400260925,
+      "rewards/<lambda>": 0.34152576327323914,
+      "step": 22
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.23469387755102042,
+      "grad_norm": 0.11959439516067505,
+      "kl": 0.001076711225323379,
+      "learning_rate": 2.3000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.25360214710235596,
+      "reward_std": 0.24721531569957733,
+      "rewards/<lambda>": 0.25360214710235596,
+      "step": 23
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.24489795918367346,
+      "grad_norm": 0.12651875615119934,
+      "kl": 0.001166851376183331,
+      "learning_rate": 2.4e-05,
+      "loss": 0.0,
+      "reward": 0.23422789573669434,
+      "reward_std": 0.22290775179862976,
+      "rewards/<lambda>": 0.23422789573669434,
+      "step": 24
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.25510204081632654,
+      "grad_norm": 0.09411771595478058,
+      "kl": 0.0009251383016817272,
+      "learning_rate": 2.5e-05,
+      "loss": 0.0,
+      "reward": 0.3483571410179138,
+      "reward_std": 0.22439202666282654,
+      "rewards/<lambda>": 0.3483571410179138,
+      "step": 25
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.2653061224489796,
+      "grad_norm": 0.10857303440570831,
+      "kl": 0.0013178512454032898,
+      "learning_rate": 2.6000000000000002e-05,
+      "loss": 0.0001,
+      "reward": 0.27066630125045776,
+      "reward_std": 0.25778770446777344,
+      "rewards/<lambda>": 0.27066630125045776,
+      "step": 26
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.2755102040816326,
+      "grad_norm": 0.10724504292011261,
+      "kl": 0.0011154217645525932,
+      "learning_rate": 2.7000000000000002e-05,
+      "loss": 0.0,
+      "reward": 0.29464977979660034,
+      "reward_std": 0.22159242630004883,
+      "rewards/<lambda>": 0.29464977979660034,
+      "step": 27
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.2857142857142857,
+      "grad_norm": 0.14677317440509796,
+      "kl": 0.0010433748830109835,
+      "learning_rate": 2.8000000000000003e-05,
+      "loss": 0.0,
+      "reward": 0.3109304904937744,
+      "reward_std": 0.26592540740966797,
+      "rewards/<lambda>": 0.3109304904937744,
+      "step": 28
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.29591836734693877,
+      "grad_norm": 0.12714359164237976,
+      "kl": 0.001189422095194459,
+      "learning_rate": 2.9e-05,
+      "loss": 0.0,
+      "reward": 0.3446485698223114,
+      "reward_std": 0.22104832530021667,
+      "rewards/<lambda>": 0.3446485698223114,
+      "step": 29
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.30612244897959184,
+      "grad_norm": 0.093221515417099,
+      "kl": 0.0011938215466216207,
+      "learning_rate": 3e-05,
+      "loss": 0.0,
+      "reward": 0.3021507263183594,
+      "reward_std": 0.26753896474838257,
+      "rewards/<lambda>": 0.3021507263183594,
+      "step": 30
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3163265306122449,
+      "grad_norm": 0.10894666612148285,
+      "kl": 0.0014481329126283526,
+      "learning_rate": 3.1e-05,
+      "loss": 0.0001,
+      "reward": 0.22669407725334167,
+      "reward_std": 0.21502387523651123,
+      "rewards/<lambda>": 0.22669407725334167,
+      "step": 31
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.32653061224489793,
+      "grad_norm": 0.11627420037984848,
+      "kl": 0.0011792851146310568,
+      "learning_rate": 3.2000000000000005e-05,
+      "loss": 0.0,
+      "reward": 0.2387671172618866,
+      "reward_std": 0.20545344054698944,
+      "rewards/<lambda>": 0.2387671172618866,
+      "step": 32
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.336734693877551,
+      "grad_norm": 0.1079491674900055,
+      "kl": 0.0013044481165707111,
+      "learning_rate": 3.3e-05,
+      "loss": 0.0001,
+      "reward": 0.2432921826839447,
+      "reward_std": 0.2530010938644409,
+      "rewards/<lambda>": 0.2432921826839447,
+      "step": 33
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3469387755102041,
+      "grad_norm": 0.09688923507928848,
+      "kl": 0.0012661240762099624,
+      "learning_rate": 3.4000000000000007e-05,
+      "loss": 0.0001,
+      "reward": 0.32528483867645264,
+      "reward_std": 0.2360536754131317,
+      "rewards/<lambda>": 0.32528483867645264,
+      "step": 34
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.35714285714285715,
+      "grad_norm": 0.07901012152433395,
+      "kl": 0.001169042894616723,
+      "learning_rate": 3.5e-05,
+      "loss": 0.0,
+      "reward": 0.2810063362121582,
+      "reward_std": 0.22572952508926392,
+      "rewards/<lambda>": 0.2810063362121582,
+      "step": 35
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3673469387755102,
+      "grad_norm": 0.12785927951335907,
+      "kl": 0.0014555864036083221,
+      "learning_rate": 3.6e-05,
+      "loss": 0.0001,
+      "reward": 0.2540737986564636,
+      "reward_std": 0.24591577053070068,
+      "rewards/<lambda>": 0.2540737986564636,
+      "step": 36
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.37755102040816324,
+      "grad_norm": 0.08787187188863754,
+      "kl": 0.001047058729454875,
+      "learning_rate": 3.7e-05,
+      "loss": 0.0,
+      "reward": 0.24186044931411743,
+      "reward_std": 0.2187865972518921,
+      "rewards/<lambda>": 0.24186044931411743,
+      "step": 37
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3877551020408163,
+      "grad_norm": 0.102202869951725,
+      "kl": 0.0013880159240216017,
+      "learning_rate": 3.8e-05,
+      "loss": 0.0001,
+      "reward": 0.2575429379940033,
+      "reward_std": 0.2186940610408783,
+      "rewards/<lambda>": 0.2575429379940033,
+      "step": 38
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.3979591836734694,
+      "grad_norm": 0.08456045389175415,
+      "kl": 0.0011803264496847987,
+      "learning_rate": 3.9000000000000006e-05,
+      "loss": 0.0,
+      "reward": 0.31993257999420166,
+      "reward_std": 0.22598499059677124,
+      "rewards/<lambda>": 0.31993257999420166,
+      "step": 39
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.40816326530612246,
+      "grad_norm": 0.10486605018377304,
+      "kl": 0.0018042891751974821,
+      "learning_rate": 4e-05,
+      "loss": 0.0001,
+      "reward": 0.29636475443840027,
+      "reward_std": 0.24350598454475403,
+      "rewards/<lambda>": 0.29636475443840027,
+      "step": 40
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.41836734693877553,
+      "grad_norm": 0.0880642980337143,
+      "kl": 0.001442858949303627,
+      "learning_rate": 4.1e-05,
+      "loss": 0.0001,
+      "reward": 0.41248711943626404,
+      "reward_std": 0.23531243205070496,
+      "rewards/<lambda>": 0.41248711943626404,
+      "step": 41
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.42857142857142855,
+      "grad_norm": 0.11210600286722183,
+      "kl": 0.0027525494806468487,
+      "learning_rate": 4.2e-05,
+      "loss": 0.0001,
+      "reward": 0.33057302236557007,
+      "reward_std": 0.2597821354866028,
+      "rewards/<lambda>": 0.33057302236557007,
+      "step": 42
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.4387755102040816,
+      "grad_norm": 0.1131657138466835,
+      "kl": 0.0013009419199079275,
+      "learning_rate": 4.3e-05,
+      "loss": 0.0001,
+      "reward": 0.3796292245388031,
+      "reward_std": 0.23504015803337097,
+      "rewards/<lambda>": 0.3796292245388031,
+      "step": 43
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.4489795918367347,
+      "grad_norm": 0.09555080533027649,
+      "kl": 0.002401509787887335,
+      "learning_rate": 4.4000000000000006e-05,
+      "loss": 0.0001,
+      "reward": 0.28863847255706787,
+      "reward_std": 0.2549838721752167,
+      "rewards/<lambda>": 0.28863847255706787,
+      "step": 44
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.45918367346938777,
+      "grad_norm": 0.12921357154846191,
+      "kl": 0.003915494307875633,
+      "learning_rate": 4.5e-05,
+      "loss": 0.0002,
+      "reward": 0.25585973262786865,
+      "reward_std": 0.2098313271999359,
+      "rewards/<lambda>": 0.25585973262786865,
+      "step": 45
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.46938775510204084,
+      "grad_norm": 0.10449738055467606,
+      "kl": 0.0029986370354890823,
+      "learning_rate": 4.600000000000001e-05,
+      "loss": 0.0001,
+      "reward": 0.24023020267486572,
+      "reward_std": 0.21866671741008759,
+      "rewards/<lambda>": 0.24023020267486572,
+      "step": 46
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.47959183673469385,
+      "grad_norm": 0.11932526528835297,
+      "kl": 0.0029086011927574873,
+      "learning_rate": 4.7e-05,
+      "loss": 0.0001,
+      "reward": 0.35698994994163513,
+      "reward_std": 0.244903564453125,
+      "rewards/<lambda>": 0.35698994994163513,
+      "step": 47
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.4897959183673469,
+      "grad_norm": 0.11577532440423965,
+      "kl": 0.0025195125490427017,
+      "learning_rate": 4.8e-05,
+      "loss": 0.0001,
+      "reward": 0.3163990378379822,
+      "reward_std": 0.25772640109062195,
+      "rewards/<lambda>": 0.3163990378379822,
+      "step": 48
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5,
+      "grad_norm": 0.12613017857074738,
+      "kl": 0.002560165012255311,
+      "learning_rate": 4.9e-05,
+      "loss": 0.0001,
+      "reward": 0.28095588088035583,
+      "reward_std": 0.2137780636548996,
+      "rewards/<lambda>": 0.28095588088035583,
+      "step": 49
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5102040816326531,
+      "grad_norm": 0.11029908806085587,
+      "kl": 0.003053056076169014,
+      "learning_rate": 5e-05,
+      "loss": 0.0001,
+      "reward": 0.2593611478805542,
+      "reward_std": 0.17565123736858368,
+      "rewards/<lambda>": 0.2593611478805542,
+      "step": 50
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5204081632653061,
+      "grad_norm": 0.10181179642677307,
+      "kl": 0.0032034586183726788,
+      "learning_rate": 5.1000000000000006e-05,
+      "loss": 0.0001,
+      "reward": 0.3496597707271576,
+      "reward_std": 0.23656174540519714,
+      "rewards/<lambda>": 0.3496597707271576,
+      "step": 51
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5306122448979592,
+      "grad_norm": 0.11612030863761902,
+      "kl": 0.0029043052345514297,
+      "learning_rate": 5.2000000000000004e-05,
+      "loss": 0.0001,
+      "reward": 0.31435415148735046,
+      "reward_std": 0.26738691329956055,
+      "rewards/<lambda>": 0.31435415148735046,
+      "step": 52
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5408163265306123,
+      "grad_norm": 0.11753087490797043,
+      "kl": 0.004537411965429783,
+      "learning_rate": 5.300000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.27415716648101807,
+      "reward_std": 0.2496672123670578,
+      "rewards/<lambda>": 0.27415716648101807,
+      "step": 53
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5510204081632653,
+      "grad_norm": 0.09143570065498352,
+      "kl": 0.0020023572724312544,
+      "learning_rate": 5.4000000000000005e-05,
+      "loss": 0.0001,
+      "reward": 0.27505844831466675,
+      "reward_std": 0.2481660097837448,
+      "rewards/<lambda>": 0.27505844831466675,
+      "step": 54
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5612244897959183,
+      "grad_norm": 0.11042241752147675,
+      "kl": 0.002455132082104683,
+      "learning_rate": 5.500000000000001e-05,
+      "loss": 0.0001,
+      "reward": 0.2277480959892273,
+      "reward_std": 0.2573385238647461,
+      "rewards/<lambda>": 0.2277480959892273,
+      "step": 55
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5714285714285714,
+      "grad_norm": 0.1062241867184639,
+      "kl": 0.003958327695727348,
+      "learning_rate": 5.6000000000000006e-05,
+      "loss": 0.0002,
+      "reward": 0.3246029019355774,
+      "reward_std": 0.25107985734939575,
+      "rewards/<lambda>": 0.3246029019355774,
+      "step": 56
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5816326530612245,
+      "grad_norm": 0.09364596009254456,
+      "kl": 0.0028939968906342983,
+      "learning_rate": 5.6999999999999996e-05,
+      "loss": 0.0001,
+      "reward": 0.33692467212677,
+      "reward_std": 0.2620271146297455,
+      "rewards/<lambda>": 0.33692467212677,
+      "step": 57
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.5918367346938775,
+      "grad_norm": 0.10371523350477219,
+      "kl": 0.0031635533086955547,
+      "learning_rate": 5.8e-05,
+      "loss": 0.0001,
+      "reward": 0.26338717341423035,
+      "reward_std": 0.25605374574661255,
+      "rewards/<lambda>": 0.26338717341423035,
+      "step": 58
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6020408163265306,
+      "grad_norm": 0.119356170296669,
+      "kl": 0.004045659676194191,
+      "learning_rate": 5.9e-05,
+      "loss": 0.0002,
+      "reward": 0.26655131578445435,
+      "reward_std": 0.23418131470680237,
+      "rewards/<lambda>": 0.26655131578445435,
+      "step": 59
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6122448979591837,
+      "grad_norm": 0.14208924770355225,
+      "kl": 0.005053609609603882,
+      "learning_rate": 6e-05,
+      "loss": 0.0002,
+      "reward": 0.3059399425983429,
+      "reward_std": 0.2509281635284424,
+      "rewards/<lambda>": 0.3059399425983429,
+      "step": 60
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6224489795918368,
+      "grad_norm": 0.19251689314842224,
+      "kl": 0.005135521292686462,
+      "learning_rate": 6.1e-05,
+      "loss": 0.0002,
+      "reward": 0.3289321959018707,
+      "reward_std": 0.25894391536712646,
+      "rewards/<lambda>": 0.3289321959018707,
+      "step": 61
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6326530612244898,
+      "grad_norm": 0.10605209320783615,
+      "kl": 0.004940195474773645,
+      "learning_rate": 6.2e-05,
+      "loss": 0.0002,
+      "reward": 0.3096313178539276,
+      "reward_std": 0.25553008913993835,
+      "rewards/<lambda>": 0.3096313178539276,
+      "step": 62
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6428571428571429,
+      "grad_norm": 0.1347755789756775,
+      "kl": 0.007013984024524689,
+      "learning_rate": 6.3e-05,
+      "loss": 0.0003,
+      "reward": 0.3279077112674713,
+      "reward_std": 0.24912631511688232,
+      "rewards/<lambda>": 0.3279077112674713,
+      "step": 63
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6530612244897959,
+      "grad_norm": 0.11074844002723694,
+      "kl": 0.005811232142150402,
+      "learning_rate": 6.400000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.26557987928390503,
+      "reward_std": 0.2595665454864502,
+      "rewards/<lambda>": 0.26557987928390503,
+      "step": 64
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6632653061224489,
+      "grad_norm": 0.08949097245931625,
+      "kl": 0.0062391310930252075,
+      "learning_rate": 6.500000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.30340054631233215,
+      "reward_std": 0.23612141609191895,
+      "rewards/<lambda>": 0.30340054631233215,
+      "step": 65
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.673469387755102,
+      "grad_norm": 0.09197568893432617,
+      "kl": 0.006357334554195404,
+      "learning_rate": 6.6e-05,
+      "loss": 0.0003,
+      "reward": 0.305142879486084,
+      "reward_std": 0.25195544958114624,
+      "rewards/<lambda>": 0.305142879486084,
+      "step": 66
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6836734693877551,
+      "grad_norm": 0.11795815080404282,
+      "kl": 0.006672342773526907,
+      "learning_rate": 6.7e-05,
+      "loss": 0.0003,
+      "reward": 0.2330521047115326,
+      "reward_std": 0.24398502707481384,
+      "rewards/<lambda>": 0.2330521047115326,
+      "step": 67
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.6938775510204082,
+      "grad_norm": 0.1380680352449417,
+      "kl": 0.008953109383583069,
+      "learning_rate": 6.800000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.3820633292198181,
+      "reward_std": 0.24111545085906982,
+      "rewards/<lambda>": 0.3820633292198181,
+      "step": 68
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7040816326530612,
+      "grad_norm": 0.10920781642198563,
+      "kl": 0.006580730434507132,
+      "learning_rate": 6.9e-05,
+      "loss": 0.0003,
+      "reward": 0.30754148960113525,
+      "reward_std": 0.22947274148464203,
+      "rewards/<lambda>": 0.30754148960113525,
+      "step": 69
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7142857142857143,
+      "grad_norm": 0.11049173772335052,
+      "kl": 0.006469545420259237,
+      "learning_rate": 7e-05,
+      "loss": 0.0003,
+      "reward": 0.2494296282529831,
+      "reward_std": 0.2338804006576538,
+      "rewards/<lambda>": 0.2494296282529831,
+      "step": 70
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7244897959183674,
+      "grad_norm": 0.12242847681045532,
+      "kl": 0.0065796151757240295,
+      "learning_rate": 7.1e-05,
+      "loss": 0.0003,
+      "reward": 0.35586172342300415,
+      "reward_std": 0.22659151256084442,
+      "rewards/<lambda>": 0.35586172342300415,
+      "step": 71
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7346938775510204,
+      "grad_norm": 0.10463358461856842,
+      "kl": 0.008036931976675987,
+      "learning_rate": 7.2e-05,
+      "loss": 0.0003,
+      "reward": 0.3306816518306732,
+      "reward_std": 0.21499347686767578,
+      "rewards/<lambda>": 0.3306816518306732,
+      "step": 72
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7448979591836735,
+      "grad_norm": 0.10549326241016388,
+      "kl": 0.005967782810330391,
+      "learning_rate": 7.3e-05,
+      "loss": 0.0002,
+      "reward": 0.26001977920532227,
+      "reward_std": 0.20824775099754333,
+      "rewards/<lambda>": 0.26001977920532227,
+      "step": 73
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7551020408163265,
+      "grad_norm": 0.10788781940937042,
+      "kl": 0.004906760528683662,
+      "learning_rate": 7.4e-05,
+      "loss": 0.0002,
+      "reward": 0.2874922752380371,
+      "reward_std": 0.24835163354873657,
+      "rewards/<lambda>": 0.2874922752380371,
+      "step": 74
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7653061224489796,
+      "grad_norm": 0.09784362465143204,
+      "kl": 0.005803011357784271,
+      "learning_rate": 7.500000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.2684209942817688,
+      "reward_std": 0.2644413113594055,
+      "rewards/<lambda>": 0.2684209942817688,
+      "step": 75
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7755102040816326,
+      "grad_norm": 0.16001495718955994,
+      "kl": 0.006752543151378632,
+      "learning_rate": 7.6e-05,
+      "loss": 0.0003,
+      "reward": 0.3604978919029236,
+      "reward_std": 0.21626180410385132,
+      "rewards/<lambda>": 0.3604978919029236,
+      "step": 76
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7857142857142857,
+      "grad_norm": 0.13324351608753204,
+      "kl": 0.008819150738418102,
+      "learning_rate": 7.7e-05,
+      "loss": 0.0004,
+      "reward": 0.3122796416282654,
+      "reward_std": 0.23514878749847412,
+      "rewards/<lambda>": 0.3122796416282654,
+      "step": 77
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.7959183673469388,
+      "grad_norm": 0.11151094734668732,
+      "kl": 0.0060831112787127495,
+      "learning_rate": 7.800000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.3716966509819031,
+      "reward_std": 0.25390562415122986,
+      "rewards/<lambda>": 0.3716966509819031,
+      "step": 78
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8061224489795918,
+      "grad_norm": 0.1258833110332489,
+      "kl": 0.00652284873649478,
+      "learning_rate": 7.900000000000001e-05,
+      "loss": 0.0003,
+      "reward": 0.31094950437545776,
+      "reward_std": 0.24983155727386475,
+      "rewards/<lambda>": 0.31094950437545776,
+      "step": 79
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8163265306122449,
+      "grad_norm": 0.09824781864881516,
+      "kl": 0.007383039221167564,
+      "learning_rate": 8e-05,
+      "loss": 0.0003,
+      "reward": 0.3597716689109802,
+      "reward_std": 0.24361717700958252,
+      "rewards/<lambda>": 0.3597716689109802,
+      "step": 80
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.826530612244898,
+      "grad_norm": 0.08224570006132126,
+      "kl": 0.005881953053176403,
+      "learning_rate": 8.1e-05,
+      "loss": 0.0002,
+      "reward": 0.27791696786880493,
+      "reward_std": 0.2182818502187729,
+      "rewards/<lambda>": 0.27791696786880493,
+      "step": 81
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8367346938775511,
+      "grad_norm": 0.12581421434879303,
+      "kl": 0.006681859493255615,
+      "learning_rate": 8.2e-05,
+      "loss": 0.0003,
+      "reward": 0.3261301517486572,
+      "reward_std": 0.2615125775337219,
+      "rewards/<lambda>": 0.3261301517486572,
+      "step": 82
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8469387755102041,
+      "grad_norm": 0.10322140902280807,
+      "kl": 0.006312561221420765,
+      "learning_rate": 8.3e-05,
+      "loss": 0.0003,
+      "reward": 0.3625139594078064,
+      "reward_std": 0.24003905057907104,
+      "rewards/<lambda>": 0.3625139594078064,
+      "step": 83
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8571428571428571,
+      "grad_norm": 0.0966828241944313,
+      "kl": 0.0061577907763421535,
+      "learning_rate": 8.4e-05,
+      "loss": 0.0002,
+      "reward": 0.27837446331977844,
+      "reward_std": 0.24895024299621582,
+      "rewards/<lambda>": 0.27837446331977844,
+      "step": 84
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8673469387755102,
+      "grad_norm": 0.09297024458646774,
+      "kl": 0.005875007715076208,
+      "learning_rate": 8.5e-05,
+      "loss": 0.0002,
+      "reward": 0.3240439295768738,
+      "reward_std": 0.2554982602596283,
+      "rewards/<lambda>": 0.3240439295768738,
+      "step": 85
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8775510204081632,
+      "grad_norm": 0.12204565852880478,
+      "kl": 0.006396137177944183,
+      "learning_rate": 8.6e-05,
+      "loss": 0.0003,
+      "reward": 0.24172773957252502,
+      "reward_std": 0.2518821060657501,
+      "rewards/<lambda>": 0.24172773957252502,
+      "step": 86
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8877551020408163,
+      "grad_norm": 0.08884488046169281,
+      "kl": 0.00668664975091815,
+      "learning_rate": 8.7e-05,
+      "loss": 0.0003,
+      "reward": 0.28681838512420654,
+      "reward_std": 0.25753265619277954,
+      "rewards/<lambda>": 0.28681838512420654,
+      "step": 87
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.8979591836734694,
+      "grad_norm": 0.08016186207532883,
+      "kl": 0.006137486081570387,
+      "learning_rate": 8.800000000000001e-05,
+      "loss": 0.0002,
+      "reward": 0.2647368907928467,
+      "reward_std": 0.24885067343711853,
+      "rewards/<lambda>": 0.2647368907928467,
+      "step": 88
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9081632653061225,
+      "grad_norm": 0.10147328674793243,
+      "kl": 0.00917094200849533,
+      "learning_rate": 8.900000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.29730311036109924,
+      "reward_std": 0.2591564953327179,
+      "rewards/<lambda>": 0.29730311036109924,
+      "step": 89
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9183673469387755,
+      "grad_norm": 0.12398728728294373,
+      "kl": 0.007823077961802483,
+      "learning_rate": 9e-05,
+      "loss": 0.0003,
+      "reward": 0.2589433193206787,
+      "reward_std": 0.24704992771148682,
+      "rewards/<lambda>": 0.2589433193206787,
+      "step": 90
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9285714285714286,
+      "grad_norm": 0.1265306919813156,
+      "kl": 0.009746653027832508,
+      "learning_rate": 9.1e-05,
+      "loss": 0.0004,
+      "reward": 0.2624666094779968,
+      "reward_std": 0.26040005683898926,
+      "rewards/<lambda>": 0.2624666094779968,
+      "step": 91
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9387755102040817,
+      "grad_norm": 0.09663106501102448,
+      "kl": 0.009732738137245178,
+      "learning_rate": 9.200000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.31370311975479126,
+      "reward_std": 0.2633877992630005,
+      "rewards/<lambda>": 0.31370311975479126,
+      "step": 92
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9489795918367347,
+      "grad_norm": 0.12955360114574432,
+      "kl": 0.011157519184052944,
+      "learning_rate": 9.300000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.3296325206756592,
+      "reward_std": 0.26135072112083435,
+      "rewards/<lambda>": 0.3296325206756592,
+      "step": 93
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9591836734693877,
+      "grad_norm": 0.07906248420476913,
+      "kl": 0.00876379944384098,
+      "learning_rate": 9.4e-05,
+      "loss": 0.0004,
+      "reward": 0.38581177592277527,
+      "reward_std": 0.21951782703399658,
+      "rewards/<lambda>": 0.38581177592277527,
+      "step": 94
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9693877551020408,
+      "grad_norm": 0.10264045745134354,
+      "kl": 0.013823926448822021,
+      "learning_rate": 9.5e-05,
+      "loss": 0.0006,
+      "reward": 0.37457725405693054,
+      "reward_std": 0.22653540968894958,
+      "rewards/<lambda>": 0.37457725405693054,
+      "step": 95
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9795918367346939,
+      "grad_norm": 0.09052637219429016,
+      "kl": 0.011999586597084999,
+      "learning_rate": 9.6e-05,
+      "loss": 0.0005,
+      "reward": 0.3207855224609375,
+      "reward_std": 0.23623248934745789,
+      "rewards/<lambda>": 0.3207855224609375,
+      "step": 96
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 0.9897959183673469,
+      "grad_norm": 0.11846129596233368,
+      "kl": 0.011448493227362633,
+      "learning_rate": 9.7e-05,
+      "loss": 0.0005,
+      "reward": 0.32892781496047974,
+      "reward_std": 0.23297683894634247,
+      "rewards/<lambda>": 0.32892781496047974,
+      "step": 97
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0,
+      "grad_norm": 0.09638947248458862,
+      "kl": 0.010949314571917057,
+      "learning_rate": 9.8e-05,
+      "loss": 0.0004,
+      "reward": 0.4758386015892029,
+      "reward_std": 0.2482748031616211,
+      "rewards/<lambda>": 0.4758386015892029,
+      "step": 98
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.010204081632653,
+      "grad_norm": 0.09481366723775864,
+      "kl": 0.01085271593183279,
+      "learning_rate": 9.900000000000001e-05,
+      "loss": 0.0004,
+      "reward": 0.3523382544517517,
+      "reward_std": 0.2240390032529831,
+      "rewards/<lambda>": 0.3523382544517517,
+      "step": 99
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0204081632653061,
+      "grad_norm": 0.09950532019138336,
+      "kl": 0.012218523770570755,
+      "learning_rate": 0.0001,
+      "loss": 0.0005,
+      "reward": 0.28119099140167236,
+      "reward_std": 0.22250661253929138,
+      "rewards/<lambda>": 0.28119099140167236,
+      "step": 100
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.030612244897959,
+      "grad_norm": 0.07458172738552094,
+      "kl": 0.009257722645998001,
+      "learning_rate": 9.999344418328162e-05,
+      "loss": 0.0004,
+      "reward": 0.33414292335510254,
+      "reward_std": 0.21670334041118622,
+      "rewards/<lambda>": 0.33414292335510254,
+      "step": 101
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0408163265306123,
+      "grad_norm": 0.08385725319385529,
+      "kl": 0.011008420959115028,
+      "learning_rate": 9.997377845227576e-05,
+      "loss": 0.0004,
+      "reward": 0.3389889895915985,
+      "reward_std": 0.24514511227607727,
+      "rewards/<lambda>": 0.3389889895915985,
+      "step": 102
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0510204081632653,
+      "grad_norm": 0.10061275213956833,
+      "kl": 0.014137129299342632,
+      "learning_rate": 9.994100796397954e-05,
+      "loss": 0.0006,
+      "reward": 0.4110638499259949,
+      "reward_std": 0.22362101078033447,
+      "rewards/<lambda>": 0.4110638499259949,
+      "step": 103
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0612244897959184,
+      "grad_norm": 0.10247852653265,
+      "kl": 0.015538258478045464,
+      "learning_rate": 9.989514131188559e-05,
+      "loss": 0.0006,
+      "reward": 0.3177061676979065,
+      "reward_std": 0.26253318786621094,
+      "rewards/<lambda>": 0.3177061676979065,
+      "step": 104
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0714285714285714,
+      "grad_norm": 0.11179685592651367,
+      "kl": 0.018445193767547607,
+      "learning_rate": 9.983619052372848e-05,
+      "loss": 0.0007,
+      "reward": 0.2772364020347595,
+      "reward_std": 0.25032472610473633,
+      "rewards/<lambda>": 0.2772364020347595,
+      "step": 105
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0816326530612246,
+      "grad_norm": 0.09749067574739456,
+      "kl": 0.01747281290590763,
+      "learning_rate": 9.97641710583307e-05,
+      "loss": 0.0007,
+      "reward": 0.3064037561416626,
+      "reward_std": 0.255950927734375,
+      "rewards/<lambda>": 0.3064037561416626,
+      "step": 106
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.0918367346938775,
+      "grad_norm": 0.08926989883184433,
+      "kl": 0.018639955669641495,
+      "learning_rate": 9.967910180154889e-05,
+      "loss": 0.0007,
+      "reward": 0.33175772428512573,
+      "reward_std": 0.2530345320701599,
+      "rewards/<lambda>": 0.33175772428512573,
+      "step": 107
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.1020408163265305,
+      "grad_norm": 0.10069949179887772,
+      "kl": 0.017288221046328545,
+      "learning_rate": 9.958100506132127e-05,
+      "loss": 0.0007,
+      "reward": 0.2976396083831787,
+      "reward_std": 0.2146736979484558,
+      "rewards/<lambda>": 0.2976396083831787,
+      "step": 108
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.1122448979591837,
+      "grad_norm": 0.08820276707410812,
+      "kl": 0.013430261984467506,
+      "learning_rate": 9.946990656181781e-05,
+      "loss": 0.0005,
+      "reward": 0.27721720933914185,
+      "reward_std": 0.2337382286787033,
+      "rewards/<lambda>": 0.27721720933914185,
+      "step": 109
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.1224489795918366,
+      "grad_norm": 0.09090136736631393,
+      "kl": 0.016656771302223206,
+      "learning_rate": 9.934583543669453e-05,
+      "loss": 0.0007,
+      "reward": 0.3464582562446594,
+      "reward_std": 0.26044878363609314,
+      "rewards/<lambda>": 0.3464582562446594,
+      "step": 110
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.1326530612244898,
+      "grad_norm": 0.09186483174562454,
+      "kl": 0.0160512812435627,
+      "learning_rate": 9.920882422145372e-05,
+      "loss": 0.0006,
+      "reward": 0.3516361713409424,
+      "reward_std": 0.2280379831790924,
+      "rewards/<lambda>": 0.3516361713409424,
+      "step": 111
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.1428571428571428,
+      "grad_norm": 0.0900593101978302,
+      "kl": 0.018652722239494324,
+      "learning_rate": 9.905890884491195e-05,
+      "loss": 0.0007,
+      "reward": 0.26406681537628174,
+      "reward_std": 0.23392222821712494,
+      "rewards/<lambda>": 0.26406681537628174,
+      "step": 112
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.153061224489796,
+      "grad_norm": 0.10555818676948547,
+      "kl": 0.01904328167438507,
+      "learning_rate": 9.889612861977853e-05,
+      "loss": 0.0008,
+      "reward": 0.2887289226055145,
+      "reward_std": 0.21101483702659607,
+      "rewards/<lambda>": 0.2887289226055145,
+      "step": 113
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.163265306122449,
+      "grad_norm": 0.10259734839200974,
+      "kl": 0.021371304988861084,
+      "learning_rate": 9.872052623234632e-05,
+      "loss": 0.0009,
+      "reward": 0.28545108437538147,
+      "reward_std": 0.2336777299642563,
+      "rewards/<lambda>": 0.28545108437538147,
+      "step": 114
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.1734693877551021,
+      "grad_norm": 0.10765431821346283,
+      "kl": 0.022964194416999817,
+      "learning_rate": 9.853214773129796e-05,
+      "loss": 0.0009,
+      "reward": 0.2757786214351654,
+      "reward_std": 0.2596602439880371,
+      "rewards/<lambda>": 0.2757786214351654,
+      "step": 115
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.183673469387755,
+      "grad_norm": 0.10045947879552841,
+      "kl": 0.01650853641331196,
+      "learning_rate": 9.833104251563056e-05,
+      "loss": 0.0007,
+      "reward": 0.37531301379203796,
+      "reward_std": 0.19301483035087585,
+      "rewards/<lambda>": 0.37531301379203796,
+      "step": 116
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.193877551020408,
+      "grad_norm": 0.08847310394048691,
+      "kl": 0.021333348006010056,
+      "learning_rate": 9.811726332170153e-05,
+      "loss": 0.0009,
+      "reward": 0.3334275186061859,
+      "reward_std": 0.22464393079280853,
+      "rewards/<lambda>": 0.3334275186061859,
+      "step": 117
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2040816326530612,
+      "grad_norm": 0.09106040000915527,
+      "kl": 0.019402675330638885,
+      "learning_rate": 9.789086620939936e-05,
+      "loss": 0.0008,
+      "reward": 0.3201903998851776,
+      "reward_std": 0.24303846061229706,
+      "rewards/<lambda>": 0.3201903998851776,
+      "step": 118
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2142857142857142,
+      "grad_norm": 0.09148433804512024,
+      "kl": 0.020076055079698563,
+      "learning_rate": 9.765191054744305e-05,
+      "loss": 0.0008,
+      "reward": 0.26528307795524597,
+      "reward_std": 0.20828311145305634,
+      "rewards/<lambda>": 0.26528307795524597,
+      "step": 119
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2244897959183674,
+      "grad_norm": 0.09016289561986923,
+      "kl": 0.02068479359149933,
+      "learning_rate": 9.740045899781352e-05,
+      "loss": 0.0008,
+      "reward": 0.3603302836418152,
+      "reward_std": 0.2604081332683563,
+      "rewards/<lambda>": 0.3603302836418152,
+      "step": 120
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2346938775510203,
+      "grad_norm": 0.08944398164749146,
+      "kl": 0.02130693942308426,
+      "learning_rate": 9.713657749932172e-05,
+      "loss": 0.0009,
+      "reward": 0.345991313457489,
+      "reward_std": 0.2438054382801056,
+      "rewards/<lambda>": 0.345991313457489,
+      "step": 121
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2448979591836735,
+      "grad_norm": 0.08998019993305206,
+      "kl": 0.018094293773174286,
+      "learning_rate": 9.686033525031719e-05,
+      "loss": 0.0007,
+      "reward": 0.23993568122386932,
+      "reward_std": 0.2522079348564148,
+      "rewards/<lambda>": 0.23993568122386932,
+      "step": 122
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2551020408163265,
+      "grad_norm": 0.09132429212331772,
+      "kl": 0.021364744752645493,
+      "learning_rate": 9.657180469054213e-05,
+      "loss": 0.0009,
+      "reward": 0.3082301616668701,
+      "reward_std": 0.24264320731163025,
+      "rewards/<lambda>": 0.3082301616668701,
+      "step": 123
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2653061224489797,
+      "grad_norm": 0.07840445637702942,
+      "kl": 0.01507033221423626,
+      "learning_rate": 9.627106148213522e-05,
+      "loss": 0.0006,
+      "reward": 0.2848488390445709,
+      "reward_std": 0.21467337012290955,
+      "rewards/<lambda>": 0.2848488390445709,
+      "step": 124
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2755102040816326,
+      "grad_norm": 0.08806001394987106,
+      "kl": 0.018644016236066818,
+      "learning_rate": 9.595818448979061e-05,
+      "loss": 0.0007,
+      "reward": 0.2717777192592621,
+      "reward_std": 0.2168688029050827,
+      "rewards/<lambda>": 0.2717777192592621,
+      "step": 125
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2857142857142856,
+      "grad_norm": 0.09057774394750595,
+      "kl": 0.019513610750436783,
+      "learning_rate": 9.563325576007701e-05,
+      "loss": 0.0008,
+      "reward": 0.3114008903503418,
+      "reward_std": 0.2647177577018738,
+      "rewards/<lambda>": 0.3114008903503418,
+      "step": 126
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.2959183673469388,
+      "grad_norm": 0.09328091889619827,
+      "kl": 0.02476987987756729,
+      "learning_rate": 9.529636049992234e-05,
+      "loss": 0.001,
+      "reward": 0.35005778074264526,
+      "reward_std": 0.2296982854604721,
+      "rewards/<lambda>": 0.35005778074264526,
+      "step": 127
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.306122448979592,
+      "grad_norm": 0.09036798775196075,
+      "kl": 0.0257786326110363,
+      "learning_rate": 9.494758705426978e-05,
+      "loss": 0.001,
+      "reward": 0.3622300624847412,
+      "reward_std": 0.23824195563793182,
+      "rewards/<lambda>": 0.3622300624847412,
+      "step": 128
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.316326530612245,
+      "grad_norm": 0.11859830468893051,
+      "kl": 0.021266868337988853,
+      "learning_rate": 9.458702688291073e-05,
+      "loss": 0.0009,
+      "reward": 0.2144925594329834,
+      "reward_std": 0.2241775244474411,
+      "rewards/<lambda>": 0.2144925594329834,
+      "step": 129
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.3265306122448979,
+      "grad_norm": 0.08116462826728821,
+      "kl": 0.0258474238216877,
+      "learning_rate": 9.421477453650118e-05,
+      "loss": 0.001,
+      "reward": 0.29674050211906433,
+      "reward_std": 0.22861354053020477,
+      "rewards/<lambda>": 0.29674050211906433,
+      "step": 130
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.336734693877551,
+      "grad_norm": 0.08280783146619797,
+      "kl": 0.024379916489124298,
+      "learning_rate": 9.38309276317674e-05,
+      "loss": 0.001,
+      "reward": 0.3025866746902466,
+      "reward_std": 0.2633967101573944,
+      "rewards/<lambda>": 0.3025866746902466,
+      "step": 131
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.346938775510204,
+      "grad_norm": 0.09579820185899734,
+      "kl": 0.02445649728178978,
+      "learning_rate": 9.343558682590756e-05,
+      "loss": 0.001,
+      "reward": 0.28383010625839233,
+      "reward_std": 0.24883991479873657,
+      "rewards/<lambda>": 0.28383010625839233,
+      "step": 132
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.3571428571428572,
+      "grad_norm": 0.08781164884567261,
+      "kl": 0.023096945136785507,
+      "learning_rate": 9.302885579019627e-05,
+      "loss": 0.0009,
+      "reward": 0.27924996614456177,
+      "reward_std": 0.2151711881160736,
+      "rewards/<lambda>": 0.27924996614456177,
+      "step": 133
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.3673469387755102,
+      "grad_norm": 0.09480538964271545,
+      "kl": 0.02117522805929184,
+      "learning_rate": 9.261084118279847e-05,
+      "loss": 0.0008,
+      "reward": 0.37825846672058105,
+      "reward_std": 0.2239784449338913,
+      "rewards/<lambda>": 0.37825846672058105,
+      "step": 134
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.3775510204081631,
+      "grad_norm": 0.09008127450942993,
+      "kl": 0.023548265919089317,
+      "learning_rate": 9.218165262080023e-05,
+      "loss": 0.0009,
+      "reward": 0.26018446683883667,
+      "reward_std": 0.23662327229976654,
+      "rewards/<lambda>": 0.26018446683883667,
+      "step": 135
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.3877551020408163,
+      "grad_norm": 0.10781973600387573,
+      "kl": 0.024034924805164337,
+      "learning_rate": 9.174140265146356e-05,
+      "loss": 0.001,
+      "reward": 0.3326035141944885,
+      "reward_std": 0.2397647500038147,
+      "rewards/<lambda>": 0.3326035141944885,
+      "step": 136
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.3979591836734695,
+      "grad_norm": 0.09277699142694473,
+      "kl": 0.026356618851423264,
+      "learning_rate": 9.129020672271283e-05,
+      "loss": 0.0011,
+      "reward": 0.33098840713500977,
+      "reward_std": 0.2678415775299072,
+      "rewards/<lambda>": 0.33098840713500977,
+      "step": 137
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.4081632653061225,
+      "grad_norm": 0.09566947817802429,
+      "kl": 0.025415873154997826,
+      "learning_rate": 9.082818315286055e-05,
+      "loss": 0.001,
+      "reward": 0.30005940794944763,
+      "reward_std": 0.2622036635875702,
+      "rewards/<lambda>": 0.30005940794944763,
+      "step": 138
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.4183673469387754,
+      "grad_norm": 0.09971340000629425,
+      "kl": 0.023971717804670334,
+      "learning_rate": 9.035545309958046e-05,
+      "loss": 0.001,
+      "reward": 0.29274553060531616,
+      "reward_std": 0.2605469822883606,
+      "rewards/<lambda>": 0.29274553060531616,
+      "step": 139
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.4285714285714286,
+      "grad_norm": 0.17278257012367249,
+      "kl": 0.036384448409080505,
+      "learning_rate": 8.987214052813604e-05,
+      "loss": 0.0015,
+      "reward": 0.3640914857387543,
+      "reward_std": 0.22631150484085083,
+      "rewards/<lambda>": 0.3640914857387543,
+      "step": 140
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.4387755102040816,
+      "grad_norm": 0.0879712924361229,
+      "kl": 0.022140463814139366,
+      "learning_rate": 8.937837217887273e-05,
+      "loss": 0.0009,
+      "reward": 0.3031749427318573,
+      "reward_std": 0.2355748862028122,
+      "rewards/<lambda>": 0.3031749427318573,
+      "step": 141
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.4489795918367347,
+      "grad_norm": 0.09844816476106644,
+      "kl": 0.02770843915641308,
+      "learning_rate": 8.887427753398248e-05,
+      "loss": 0.0011,
+      "reward": 0.333422988653183,
+      "reward_std": 0.2207636535167694,
+      "rewards/<lambda>": 0.333422988653183,
+      "step": 142
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.4591836734693877,
+      "grad_norm": 0.08592235296964645,
+      "kl": 0.02235778421163559,
+      "learning_rate": 8.835998878354931e-05,
+      "loss": 0.0009,
+      "reward": 0.3474034070968628,
+      "reward_std": 0.22502724826335907,
+      "rewards/<lambda>": 0.3474034070968628,
+      "step": 143
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.469387755102041,
+      "grad_norm": 0.10064583271741867,
+      "kl": 0.01792587712407112,
+      "learning_rate": 8.783564079088477e-05,
+      "loss": 0.0007,
+      "reward": 0.25580060482025146,
+      "reward_std": 0.2477901428937912,
+      "rewards/<lambda>": 0.25580060482025146,
+      "step": 144
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.4795918367346939,
+      "grad_norm": 0.10648388415575027,
+      "kl": 0.020750422030687332,
+      "learning_rate": 8.73013710571623e-05,
+      "loss": 0.0008,
+      "reward": 0.3896491527557373,
+      "reward_std": 0.24791425466537476,
+      "rewards/<lambda>": 0.3896491527557373,
+      "step": 145
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.489795918367347,
+      "grad_norm": 0.09451978653669357,
+      "kl": 0.024195339530706406,
+      "learning_rate": 8.675731968536002e-05,
+      "loss": 0.001,
+      "reward": 0.29740995168685913,
+      "reward_std": 0.26704221963882446,
+      "rewards/<lambda>": 0.29740995168685913,
+      "step": 146
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.5,
+      "grad_norm": 0.08318852633237839,
+      "kl": 0.02090907096862793,
+      "learning_rate": 8.620362934352109e-05,
+      "loss": 0.0008,
+      "reward": 0.2733314037322998,
+      "reward_std": 0.2534288763999939,
+      "rewards/<lambda>": 0.2733314037322998,
+      "step": 147
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.510204081632653,
+      "grad_norm": 0.09782999753952026,
+      "kl": 0.019019391387701035,
+      "learning_rate": 8.564044522734147e-05,
+      "loss": 0.0008,
+      "reward": 0.32081273198127747,
+      "reward_std": 0.2417653501033783,
+      "rewards/<lambda>": 0.32081273198127747,
+      "step": 148
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.5204081632653061,
+      "grad_norm": 0.08283797651529312,
+      "kl": 0.024606352671980858,
+      "learning_rate": 8.506791502209496e-05,
+      "loss": 0.001,
+      "reward": 0.2620502710342407,
+      "reward_std": 0.26248690485954285,
+      "rewards/<lambda>": 0.2620502710342407,
+      "step": 149
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.5306122448979593,
+      "grad_norm": 0.09963952004909515,
+      "kl": 0.023783504962921143,
+      "learning_rate": 8.448618886390522e-05,
+      "loss": 0.001,
+      "reward": 0.30407556891441345,
+      "reward_std": 0.2542496621608734,
+      "rewards/<lambda>": 0.30407556891441345,
+      "step": 150
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.5408163265306123,
+      "grad_norm": 0.08900399506092072,
+      "kl": 0.018189266324043274,
+      "learning_rate": 8.389541930037516e-05,
+      "loss": 0.0007,
+      "reward": 0.34692129492759705,
+      "reward_std": 0.25238391757011414,
+      "rewards/<lambda>": 0.34692129492759705,
+      "step": 151
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.5510204081632653,
+      "grad_norm": 0.08934988081455231,
+      "kl": 0.02292565628886223,
+      "learning_rate": 8.329576125058406e-05,
+      "loss": 0.0009,
+      "reward": 0.36156541109085083,
+      "reward_std": 0.23901474475860596,
+      "rewards/<lambda>": 0.36156541109085083,
+      "step": 152
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.5612244897959182,
+      "grad_norm": 0.08225736767053604,
+      "kl": 0.018968891352415085,
+      "learning_rate": 8.268737196446264e-05,
+      "loss": 0.0008,
+      "reward": 0.29287439584732056,
+      "reward_std": 0.2568100690841675,
+      "rewards/<lambda>": 0.29287439584732056,
+      "step": 153
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.5714285714285714,
+      "grad_norm": 0.0784936249256134,
+      "kl": 0.017963310703635216,
+      "learning_rate": 8.2070410981557e-05,
+      "loss": 0.0007,
+      "reward": 0.3246784210205078,
+      "reward_std": 0.25057297945022583,
+      "rewards/<lambda>": 0.3246784210205078,
+      "step": 154
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.5816326530612246,
+      "grad_norm": 0.07634963095188141,
+      "kl": 0.01993548683822155,
+      "learning_rate": 8.144504008919222e-05,
+      "loss": 0.0008,
+      "reward": 0.32869774103164673,
+      "reward_std": 0.22997353971004486,
+      "rewards/<lambda>": 0.32869774103164673,
+      "step": 155
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.5918367346938775,
+      "grad_norm": 0.2396547645330429,
+      "kl": 0.06700322777032852,
+      "learning_rate": 8.081142328004637e-05,
+      "loss": 0.0027,
+      "reward": 0.35755395889282227,
+      "reward_std": 0.21904043853282928,
+      "rewards/<lambda>": 0.35755395889282227,
+      "step": 156
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.6020408163265305,
+      "grad_norm": 0.08202355355024338,
+      "kl": 0.01816694065928459,
+      "learning_rate": 8.016972670914624e-05,
+      "loss": 0.0007,
+      "reward": 0.34367379546165466,
+      "reward_std": 0.24324068427085876,
+      "rewards/<lambda>": 0.34367379546165466,
+      "step": 157
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.6122448979591837,
+      "grad_norm": 0.10310099273920059,
+      "kl": 0.023697612807154655,
+      "learning_rate": 7.952011865029614e-05,
+      "loss": 0.0009,
+      "reward": 0.2922106683254242,
+      "reward_std": 0.26478201150894165,
+      "rewards/<lambda>": 0.2922106683254242,
+      "step": 158
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.6224489795918369,
+      "grad_norm": 0.07689829915761948,
+      "kl": 0.01890609599649906,
+      "learning_rate": 7.886276945195099e-05,
+      "loss": 0.0008,
+      "reward": 0.3211023807525635,
+      "reward_std": 0.24797722697257996,
+      "rewards/<lambda>": 0.3211023807525635,
+      "step": 159
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.6326530612244898,
+      "grad_norm": 0.07625985890626907,
+      "kl": 0.014439202845096588,
+      "learning_rate": 7.819785149254532e-05,
+      "loss": 0.0006,
+      "reward": 0.36535224318504333,
+      "reward_std": 0.26052916049957275,
+      "rewards/<lambda>": 0.36535224318504333,
+      "step": 160
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.6428571428571428,
+      "grad_norm": 0.09949609637260437,
+      "kl": 0.015952972695231438,
+      "learning_rate": 7.752553913529018e-05,
+      "loss": 0.0006,
+      "reward": 0.34558984637260437,
+      "reward_std": 0.24385800957679749,
+      "rewards/<lambda>": 0.34558984637260437,
+      "step": 161
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.6530612244897958,
+      "grad_norm": 0.07270115613937378,
+      "kl": 0.017309173941612244,
+      "learning_rate": 7.68460086824492e-05,
+      "loss": 0.0007,
+      "reward": 0.3196105659008026,
+      "reward_std": 0.24196776747703552,
+      "rewards/<lambda>": 0.3196105659008026,
+      "step": 162
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.663265306122449,
+      "grad_norm": 0.08732082694768906,
+      "kl": 0.02441173419356346,
+      "learning_rate": 7.61594383291065e-05,
+      "loss": 0.001,
+      "reward": 0.317618191242218,
+      "reward_std": 0.2187143862247467,
+      "rewards/<lambda>": 0.317618191242218,
+      "step": 163
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.6734693877551021,
+      "grad_norm": 0.07517392933368683,
+      "kl": 0.015940185636281967,
+      "learning_rate": 7.546600811643816e-05,
+      "loss": 0.0006,
+      "reward": 0.2270391881465912,
+      "reward_std": 0.24194824695587158,
+      "rewards/<lambda>": 0.2270391881465912,
+      "step": 164
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.683673469387755,
+      "grad_norm": 0.07552263885736465,
+      "kl": 0.01631421223282814,
+      "learning_rate": 7.476589988449939e-05,
+      "loss": 0.0007,
+      "reward": 0.3129787743091583,
+      "reward_std": 0.22886860370635986,
+      "rewards/<lambda>": 0.3129787743091583,
+      "step": 165
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.693877551020408,
+      "grad_norm": 0.08246885985136032,
+      "kl": 0.01746886596083641,
+      "learning_rate": 7.405929722454026e-05,
+      "loss": 0.0007,
+      "reward": 0.2953646779060364,
+      "reward_std": 0.2490847110748291,
+      "rewards/<lambda>": 0.2953646779060364,
+      "step": 166
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7040816326530612,
+      "grad_norm": 0.07670585811138153,
+      "kl": 0.021257830783724785,
+      "learning_rate": 7.334638543086203e-05,
+      "loss": 0.0009,
+      "reward": 0.39490482211112976,
+      "reward_std": 0.2520644962787628,
+      "rewards/<lambda>": 0.39490482211112976,
+      "step": 167
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7142857142857144,
+      "grad_norm": 0.07335949689149857,
+      "kl": 0.012844810262322426,
+      "learning_rate": 7.262735145222696e-05,
+      "loss": 0.0005,
+      "reward": 0.3359541893005371,
+      "reward_std": 0.22056405246257782,
+      "rewards/<lambda>": 0.3359541893005371,
+      "step": 168
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7244897959183674,
+      "grad_norm": 0.10231731832027435,
+      "kl": 0.020412620157003403,
+      "learning_rate": 7.190238384283412e-05,
+      "loss": 0.0008,
+      "reward": 0.2858106791973114,
+      "reward_std": 0.22994212806224823,
+      "rewards/<lambda>": 0.2858106791973114,
+      "step": 169
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7346938775510203,
+      "grad_norm": 0.0776882916688919,
+      "kl": 0.015348056331276894,
+      "learning_rate": 7.117167271287453e-05,
+      "loss": 0.0006,
+      "reward": 0.32561856508255005,
+      "reward_std": 0.24203304946422577,
+      "rewards/<lambda>": 0.32561856508255005,
+      "step": 170
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7448979591836735,
+      "grad_norm": 0.09082949161529541,
+      "kl": 0.021862512454390526,
+      "learning_rate": 7.043540967867782e-05,
+      "loss": 0.0009,
+      "reward": 0.3552192449569702,
+      "reward_std": 0.2229807823896408,
+      "rewards/<lambda>": 0.3552192449569702,
+      "step": 171
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7551020408163265,
+      "grad_norm": 0.09254392236471176,
+      "kl": 0.023197144269943237,
+      "learning_rate": 6.969378781246436e-05,
+      "loss": 0.0009,
+      "reward": 0.3580297827720642,
+      "reward_std": 0.25305554270744324,
+      "rewards/<lambda>": 0.3580297827720642,
+      "step": 172
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7653061224489797,
+      "grad_norm": 0.09547614306211472,
+      "kl": 0.016374874860048294,
+      "learning_rate": 6.894700159171534e-05,
+      "loss": 0.0007,
+      "reward": 0.3224062919616699,
+      "reward_std": 0.22778098285198212,
+      "rewards/<lambda>": 0.3224062919616699,
+      "step": 173
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7755102040816326,
+      "grad_norm": 0.07411126792430878,
+      "kl": 0.025782398879528046,
+      "learning_rate": 6.819524684817438e-05,
+      "loss": 0.001,
+      "reward": 0.3161015212535858,
+      "reward_std": 0.24362775683403015,
+      "rewards/<lambda>": 0.3161015212535858,
+      "step": 174
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7857142857142856,
+      "grad_norm": 0.06989577412605286,
+      "kl": 0.014967722818255424,
+      "learning_rate": 6.743872071649411e-05,
+      "loss": 0.0006,
+      "reward": 0.3597205877304077,
+      "reward_std": 0.22002741694450378,
+      "rewards/<lambda>": 0.3597205877304077,
+      "step": 175
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.7959183673469388,
+      "grad_norm": 0.07689698040485382,
+      "kl": 0.014289500191807747,
+      "learning_rate": 6.667762158254104e-05,
+      "loss": 0.0006,
+      "reward": 0.349145770072937,
+      "reward_std": 0.25986701250076294,
+      "rewards/<lambda>": 0.349145770072937,
+      "step": 176
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.806122448979592,
+      "grad_norm": 0.08618627488613129,
+      "kl": 0.01953529752790928,
+      "learning_rate": 6.59121490313722e-05,
+      "loss": 0.0008,
+      "reward": 0.2717389762401581,
+      "reward_std": 0.24193832278251648,
+      "rewards/<lambda>": 0.2717389762401581,
+      "step": 177
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.816326530612245,
+      "grad_norm": 0.08920174837112427,
+      "kl": 0.02378169447183609,
+      "learning_rate": 6.514250379489753e-05,
+      "loss": 0.001,
+      "reward": 0.3774096667766571,
+      "reward_std": 0.20812949538230896,
+      "rewards/<lambda>": 0.3774096667766571,
+      "step": 178
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.8265306122448979,
+      "grad_norm": 0.07194539904594421,
+      "kl": 0.013617664575576782,
+      "learning_rate": 6.436888769924142e-05,
+      "loss": 0.0005,
+      "reward": 0.36836332082748413,
+      "reward_std": 0.24121849238872528,
+      "rewards/<lambda>": 0.36836332082748413,
+      "step": 179
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.836734693877551,
+      "grad_norm": 0.08749783784151077,
+      "kl": 0.019728073850274086,
+      "learning_rate": 6.359150361181715e-05,
+      "loss": 0.0008,
+      "reward": 0.3138796091079712,
+      "reward_std": 0.21334460377693176,
+      "rewards/<lambda>": 0.3138796091079712,
+      "step": 180
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.8469387755102042,
+      "grad_norm": 0.08991340547800064,
+      "kl": 0.019904986023902893,
+      "learning_rate": 6.281055538812861e-05,
+      "loss": 0.0008,
+      "reward": 0.28394806385040283,
+      "reward_std": 0.26574069261550903,
+      "rewards/<lambda>": 0.28394806385040283,
+      "step": 181
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.8571428571428572,
+      "grad_norm": 0.07849021255970001,
+      "kl": 0.022857721894979477,
+      "learning_rate": 6.202624781831268e-05,
+      "loss": 0.0009,
+      "reward": 0.31418418884277344,
+      "reward_std": 0.2225717306137085,
+      "rewards/<lambda>": 0.31418418884277344,
+      "step": 182
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.8673469387755102,
+      "grad_norm": 0.12505197525024414,
+      "kl": 0.019057810306549072,
+      "learning_rate": 6.123878657343648e-05,
+      "loss": 0.0008,
+      "reward": 0.29661425948143005,
+      "reward_std": 0.22678154706954956,
+      "rewards/<lambda>": 0.29661425948143005,
+      "step": 183
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.8775510204081631,
+      "grad_norm": 0.08824272453784943,
+      "kl": 0.024682914838194847,
+      "learning_rate": 6.044837815156377e-05,
+      "loss": 0.001,
+      "reward": 0.30208826065063477,
+      "reward_std": 0.25953543186187744,
+      "rewards/<lambda>": 0.30208826065063477,
+      "step": 184
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.8877551020408163,
+      "grad_norm": 0.07919792085886002,
+      "kl": 0.017491590231657028,
+      "learning_rate": 5.9655229823604406e-05,
+      "loss": 0.0007,
+      "reward": 0.2938078045845032,
+      "reward_std": 0.24010448157787323,
+      "rewards/<lambda>": 0.2938078045845032,
+      "step": 185
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.8979591836734695,
+      "grad_norm": 0.08519456535577774,
+      "kl": 0.014628434553742409,
+      "learning_rate": 5.885954957896115e-05,
+      "loss": 0.0006,
+      "reward": 0.3203762471675873,
+      "reward_std": 0.21388846635818481,
+      "rewards/<lambda>": 0.3203762471675873,
+      "step": 186
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.9081632653061225,
+      "grad_norm": 0.09762715548276901,
+      "kl": 0.021606557071208954,
+      "learning_rate": 5.8061546070987994e-05,
+      "loss": 0.0009,
+      "reward": 0.2728058099746704,
+      "reward_std": 0.25408631563186646,
+      "rewards/<lambda>": 0.2728058099746704,
+      "step": 187
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.9183673469387754,
+      "grad_norm": 0.1670609414577484,
+      "kl": 0.03298768401145935,
+      "learning_rate": 5.726142856227452e-05,
+      "loss": 0.0013,
+      "reward": 0.3018071949481964,
+      "reward_std": 0.24476169049739838,
+      "rewards/<lambda>": 0.3018071949481964,
+      "step": 188
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.9285714285714286,
+      "grad_norm": 0.08198920637369156,
+      "kl": 0.016482073813676834,
+      "learning_rate": 5.645940686977033e-05,
+      "loss": 0.0007,
+      "reward": 0.33921799063682556,
+      "reward_std": 0.2550126910209656,
+      "rewards/<lambda>": 0.33921799063682556,
+      "step": 189
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.9387755102040818,
+      "grad_norm": 0.07509396970272064,
+      "kl": 0.014598744921386242,
+      "learning_rate": 5.565569130976422e-05,
+      "loss": 0.0006,
+      "reward": 0.36032330989837646,
+      "reward_std": 0.2456628531217575,
+      "rewards/<lambda>": 0.36032330989837646,
+      "step": 190
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.9489795918367347,
+      "grad_norm": 0.07215838134288788,
+      "kl": 0.012914094142615795,
+      "learning_rate": 5.4850492642732406e-05,
+      "loss": 0.0005,
+      "reward": 0.3160780072212219,
+      "reward_std": 0.260766863822937,
+      "rewards/<lambda>": 0.3160780072212219,
+      "step": 191
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.9591836734693877,
+      "grad_norm": 0.08896195888519287,
+      "kl": 0.019211173057556152,
+      "learning_rate": 5.4044022018070214e-05,
+      "loss": 0.0008,
+      "reward": 0.3490823209285736,
+      "reward_std": 0.26013457775115967,
+      "rewards/<lambda>": 0.3490823209285736,
+      "step": 192
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.9693877551020407,
+      "grad_norm": 0.08373931050300598,
+      "kl": 0.01714364066720009,
+      "learning_rate": 5.3236490918721794e-05,
+      "loss": 0.0007,
+      "reward": 0.3829745948314667,
+      "reward_std": 0.20967933535575867,
+      "rewards/<lambda>": 0.3829745948314667,
+      "step": 193
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.9795918367346939,
+      "grad_norm": 0.07414960116147995,
+      "kl": 0.010791709646582603,
+      "learning_rate": 5.242811110572242e-05,
+      "loss": 0.0004,
+      "reward": 0.3071313798427582,
+      "reward_std": 0.23706994950771332,
+      "rewards/<lambda>": 0.3071313798427582,
+      "step": 194
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 1.989795918367347,
+      "grad_norm": 0.08582977205514908,
+      "kl": 0.022279752418398857,
+      "learning_rate": 5.1619094562667804e-05,
+      "loss": 0.0009,
+      "reward": 0.23579248785972595,
+      "reward_std": 0.19861236214637756,
+      "rewards/<lambda>": 0.23579248785972595,
+      "step": 195
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 2.0,
+      "grad_norm": 0.09055842459201813,
+      "kl": 0.013352158479392529,
+      "learning_rate": 5.080965344012508e-05,
+      "loss": 0.0005,
+      "reward": 0.34037166833877563,
+      "reward_std": 0.26631197333335876,
+      "rewards/<lambda>": 0.34037166833877563,
+      "step": 196
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 2.010204081632653,
+      "grad_norm": 0.08190540969371796,
+      "kl": 0.01571495272219181,
+      "learning_rate": 5e-05,
+      "loss": 0.0006,
+      "reward": 0.3017084002494812,
+      "reward_std": 0.24429354071617126,
+      "rewards/<lambda>": 0.3017084002494812,
+      "step": 197
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 2.020408163265306,
+      "grad_norm": 0.07793857902288437,
+      "kl": 0.012462570331990719,
+      "learning_rate": 4.919034655987493e-05,
+      "loss": 0.0005,
+      "reward": 0.2924309968948364,
+      "reward_std": 0.2147884964942932,
+      "rewards/<lambda>": 0.2924309968948364,
+      "step": 198
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 2.0306122448979593,
+      "grad_norm": 0.08943740278482437,
+      "kl": 0.016143202781677246,
+      "learning_rate": 4.838090543733222e-05,
+      "loss": 0.0006,
+      "reward": 0.3363001346588135,
+      "reward_std": 0.252535343170166,
+      "rewards/<lambda>": 0.3363001346588135,
+      "step": 199
+    },
+    {
+      "completion_length": 160.0,
+      "epoch": 2.0408163265306123,
+      "grad_norm": 0.07238653302192688,
+      "kl": 0.01266445405781269,
+      "learning_rate": 4.7571888894277604e-05,
+      "loss": 0.0005,
+      "reward": 0.34232497215270996,
+      "reward_std": 0.24186542630195618,
+      "rewards/<lambda>": 0.34232497215270996,
+      "step": 200
+    }
+  ],
+  "logging_steps": 1,
+  "max_steps": 294,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 3,
+  "save_steps": 100,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 0.0,
+  "train_batch_size": 8,
+  "trial_name": null,
+  "trial_params": null
+}

checkpoint-200/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:10bbf0a23b8f4b09a58ff960658e6bc23228310be351cac2a5c1ed047b491f12
+size 5560

checkpoint-200/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-294/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+base_model: HuggingFaceTB/SmolLM2-360M
+library_name: peft
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.14.0

checkpoint-294/adapter_config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "HuggingFaceTB/SmolLM2-360M",
+  "bias": "none",
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 64,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "up_proj",
+    "v_proj",
+    "gate_proj",
+    "q_proj",
+    "down_proj",
+    "o_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

checkpoint-294/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:85dd77015f38ad066fda8497d4850b47b3c730e1a8ec45df9740492fd158a815
+size 69527352

checkpoint-294/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-294/optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:51d7bef3d5e4c16bd7ab755a48ac2131817ba4dbe17f788c7a57a56554739be9
+size 139313234

checkpoint-294/rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6bcc753b015f9940d7113f7cecd74678d8376fa954e3d0c84aaeafb2a3cfb7ef
+size 14244

checkpoint-294/scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3caed79c575bc5cd30b5a339de564918baa7e790dba7be008b9ec74fd9f97580
+size 1064

checkpoint-294/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "additional_special_tokens": [
+    {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "bos_token": "<|im_start|>",
+  "eos_token": "<|im_end|>",
+  "pad_token": "<|im_end|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

checkpoint-294/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-294/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,156 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<repo_name>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<reponame>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<file_sep>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "<filename>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "7": {
+      "content": "<gh_stars>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "8": {
+      "content": "<issue_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "9": {
+      "content": "<issue_comment>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "10": {
+      "content": "<issue_closed>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "11": {
+      "content": "<jupyter_start>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "12": {
+      "content": "<jupyter_text>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "13": {
+      "content": "<jupyter_code>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "14": {
+      "content": "<jupyter_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<jupyter_script>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "16": {
+      "content": "<empty_output>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>"
+  ],
+  "bos_token": "<|im_start|>",
+  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "extra_special_tokens": {},
+  "model_max_length": 8192,
+  "pad_token": "<|im_end|>",
+  "padding_side": "left",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

checkpoint-294/trainer_state.json ADDED Viewed

The diff for this file is too large to render. See raw diff

checkpoint-294/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:10bbf0a23b8f4b09a58ff960658e6bc23228310be351cac2a5c1ed047b491f12
+size 5560

checkpoint-294/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff