Update README.md
Browse files
README.md
CHANGED
|
@@ -10,13 +10,22 @@ language:
|
|
| 10 |
pipeline_tag: text2text-generation
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
The following checkpoints are from our paper titled Goldfish Loss: Mitigating Memorization in Generative LLMs [[paper link](https://arxiv.org/abs/2406.10209)].
|
| 18 |
|
| 19 |
-
| Checkpoint Name | k-GL | Token Drop Strategy | Pretrain Tokens | Primary Dataset | Canaries for Memorization
|
| 20 |
| ------------------------------------------------------------------------------------------------------------- | ---- | ------------------- | --------------- | --------------- | ----------------------------------------------------------------------------------- |
|
| 21 |
| [tomg-group-umd/3-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/3-goldfish-loss-llama-1B) | 3 | Hash (width = 13) | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
|
| 22 |
| [tomg-group-umd/4-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/4-goldfish-loss-llama-1B) | 4 | Hash (width = 13) | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
|
|
@@ -26,11 +35,8 @@ The following checkpoints are from our paper titled Goldfish Loss: Mitigating Me
|
|
| 26 |
| [tomg-group-umd/control-llama-1B](https://huggingface.co/tomg-group-umd/control-llama-1B) | \- | No Tokens Dropped | 20B | Redpajama | None |
|
| 27 |
| [tomg-group-umd/standard-loss-llama-1B](https://huggingface.co/tomg-group-umd/standard-loss-llama-1B) | \- | No Tokens Dropped | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
|
| 28 |
|
| 29 |
-
|
| 30 |
-
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
|
| 35 |
-
- **GitHub Repository**: https://github.com/ahans30/goldfish-loss
|
| 36 |
-
- **arXiv**: https://arxiv.org/abs/2406.10209
|
|
|
|
| 10 |
pipeline_tag: text2text-generation
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Quick Links
|
| 14 |
|
| 15 |
+
- **GitHub Repository**: https://github.com/ahans30/goldfish-loss
|
| 16 |
+
- **arXiv**: https://arxiv.org/abs/2406.10209
|
| 17 |
+
|
| 18 |
+
# Goldfish Loss
|
| 19 |
+
|
| 20 |
+
We introduce goldfish loss, a new language modeling loss function that mitigates memorization of training data.
|
| 21 |
+
Specifically, goldfish loss pseudorandomly drops $1/k$ of total tokens seen (in the forward pass) during loss computation (i.e., it doesn't compute loss for these tokens), with k being a hyperparameter.
|
| 22 |
+
We show that the model finds it increasingly difficult to verbatim regurgitate training data even after 100 epochs. Please read our paper linked below for more details.
|
| 23 |
+
|
| 24 |
+
# Overview
|
| 25 |
|
| 26 |
The following checkpoints are from our paper titled Goldfish Loss: Mitigating Memorization in Generative LLMs [[paper link](https://arxiv.org/abs/2406.10209)].
|
| 27 |
|
| 28 |
+
| Checkpoint Name | k-GL | Token Drop Strategy | Pretrain Tokens | Primary Dataset | Canaries Dataset for Memorization |
|
| 29 |
| ------------------------------------------------------------------------------------------------------------- | ---- | ------------------- | --------------- | --------------- | ----------------------------------------------------------------------------------- |
|
| 30 |
| [tomg-group-umd/3-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/3-goldfish-loss-llama-1B) | 3 | Hash (width = 13) | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
|
| 31 |
| [tomg-group-umd/4-goldfish-loss-llama-1B](https://huggingface.co/tomg-group-umd/4-goldfish-loss-llama-1B) | 4 | Hash (width = 13) | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
|
|
|
|
| 35 |
| [tomg-group-umd/control-llama-1B](https://huggingface.co/tomg-group-umd/control-llama-1B) | \- | No Tokens Dropped | 20B | Redpajama | None |
|
| 36 |
| [tomg-group-umd/standard-loss-llama-1B](https://huggingface.co/tomg-group-umd/standard-loss-llama-1B) | \- | No Tokens Dropped | 20B | Redpajama | [Wikipedia](https://huggingface.co/datasets/tomg-group-umd/wikipedia-en-2k-samples) |
|
| 37 |
|
| 38 |
+
### Description
|
| 39 |
+
- `standard-loss-llama-1B` and `control-llama-1B` are trained with the standard causal language modeling loss, which has the same exact specifications as the goldfish models.
|
| 40 |
+
- The control model differs only in the fact that it did not utilize the canaries dataset for memorization and was simply pre-trained on 20B Redpajama tokens.
|
| 41 |
+
- The Canaries dataset, which contains 2000 Wikidocs, is repeated 50 times throughout the pre-training. Thus, it contains around ~204M tokens in total (including padding).
|
|
|
|
| 42 |
|
|
|
|
|
|