|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- text-generation |
|
|
- knowledge-distillation |
|
|
- llama |
|
|
- causal-lm |
|
|
- openwebtext |
|
|
- wikitext |
|
|
- transfer-learning |
|
|
model_name: DistilLLaMA |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- openwebtext |
|
|
- wikitext |
|
|
parameter_count: 80M |
|
|
metrics: |
|
|
- cosine-similarity |
|
|
- exact-match |
|
|
- rouge |
|
|
library_name: transformers |
|
|
base_model: meta-llama/LLaMA-2-7B |
|
|
--- |
|
|
|
|
|
|
|
|
### Overview |
|
|
This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. |
|
|
It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. |
|
|
Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher. |
|
|
This version is the latest version of DistilLlama, which has gone through 5 days of training using two Nvidia A100 80G GPU. |
|
|
|
|
|
### Update |
|
|
30 out of 300 checkpoints were examined, and the one with the best performance in semantic and factual accuracy has now been updated in this repository. |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
The architecture is based on LLaMA 2, with the following parameters: |
|
|
| Parameter | Value | |
|
|
|-------------------------|-------| |
|
|
| Hidden Dimension | 512 | |
|
|
| Intermediate Dimension | 1536 | |
|
|
| Max Positional Embeddings | 128 | |
|
|
| Attention Heads | 8 | |
|
|
| Transformer Layers | 16 | |
|
|
|
|
|
|
|
|
### Evaluation Metrics |
|
|
|
|
|
1. **Cosine Similarity using Word Embeddings** |
|
|
- **Description**: Measures semantic similarity by mapping words/phrases to vectors. |
|
|
- **Equation**: Cosine Similarity = ( A • B ) / ( ||A|| ||B|| ) |
|
|
- **Example**: "The dog chased the cat." vs. "A canine pursued a feline." (High similarity) |
|
|
|
|
|
2. **Exact Match (EM)** |
|
|
- **Description**: Checks if critical keywords are present. |
|
|
- **Example**: |
|
|
- Expected: "Paris" |
|
|
- Response: "The capital of France is Paris." (EM = 1) |
|
|
|
|
|
3. **ROUGE Score** |
|
|
- **Description**: Measures the overlap of the longest common subsequences between reference and response texts. |
|
|
- **Equation**: |
|
|
- Precision = Precision = LCS(R, C) / Length of C |
|
|
- Recall = Recall = LCS(R, C) / Length of R |
|
|
|
|
|
### Model Evaluation Summary |
|
|
|
|
|
| Model Name | Duration (s) | Emissions (kgCO₂e) | Avg. EM | Avg. Cosine Similarity | Avg. ROUGE Score | |
|
|
|-----------------|--------------|--------------------|---------|------------------------|------------------| |
|
|
| LLaMA-2-7B-HF | 18215.61 | 1.84e-01 | 0.715 | 0.7257 | 0.0821 | |
|
|
| baby-llama-58m | 57.20 | 2.73e-06 | 0.025 | 0.6556 | 0.0097 | |
|
|
| DistilLlama | 77.12 | 7.79e-04 | 0.02 | 0.6623 | 0.0115 | |
|
|
| DistilLlamaV1 | 78.46 | 8.49e-04 | 0.065 | 0.6776 | 0.0135 | |
|
|
|
|
|
*Note: CodeCarbon was used to track carbon emission. Allocated 80GB memory, 32 cores, Intel(R) Xeon(R) Gold 6448H for the evaluation* |
|
|
|
|
|
### GitHub Repositories |
|
|
|
|
|
- **Training Repo**: [DistilLlama Training Repository](https://github.com/HenryHuang2/DistilLlama) |
|
|
- **Evaluation Repo**: [Knowledge Distillation Evaluation Repository](https://github.com/svarnim1805/Knowledge-Distillation) |
|
|
|
|
|
### Reference |
|
|
|
|
|
@misc{timiryasov2023babyllamaknowledgedistillation, |
|
|
title={Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty}, |
|
|
author={Inar Timiryasov and Jean-Loup Tastet}, |
|
|
year={2023}, |
|
|
eprint={2308.02019}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2308.02019}, |
|
|
} |
|
|
|
|
|
*Note: The repository will be updated as training progresses. Last update 2024-11-06* |