DistilLlamaV1 / README.md
HenryHHHH's picture
Update README.md
05554a6 verified
---
language: en
tags:
- text-generation
- knowledge-distillation
- llama
- causal-lm
- openwebtext
- wikitext
- transfer-learning
model_name: DistilLLaMA
license: apache-2.0
datasets:
- openwebtext
- wikitext
parameter_count: 80M
metrics:
- cosine-similarity
- exact-match
- rouge
library_name: transformers
base_model: meta-llama/LLaMA-2-7B
---
### Overview
This model is a distilled version of LLaMA 2, containing approximately 80 million parameters.
It was trained using a mix of OpenWebText and WikiText Raw V1 datasets.
Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher.
This version is the latest version of DistilLlama, which has gone through 5 days of training using two Nvidia A100 80G GPU.
### Update
30 out of 300 checkpoints were examined, and the one with the best performance in semantic and factual accuracy has now been updated in this repository.
### Model Architecture
The architecture is based on LLaMA 2, with the following parameters:
| Parameter | Value |
|-------------------------|-------|
| Hidden Dimension | 512 |
| Intermediate Dimension | 1536 |
| Max Positional Embeddings | 128 |
| Attention Heads | 8 |
| Transformer Layers | 16 |
### Evaluation Metrics
1. **Cosine Similarity using Word Embeddings**
- **Description**: Measures semantic similarity by mapping words/phrases to vectors.
- **Equation**: Cosine Similarity = ( A • B ) / ( ||A|| ||B|| )
- **Example**: "The dog chased the cat." vs. "A canine pursued a feline." (High similarity)
2. **Exact Match (EM)**
- **Description**: Checks if critical keywords are present.
- **Example**:
- Expected: "Paris"
- Response: "The capital of France is Paris." (EM = 1)
3. **ROUGE Score**
- **Description**: Measures the overlap of the longest common subsequences between reference and response texts.
- **Equation**:
- Precision = Precision = LCS(R, C) / Length of C
- Recall = Recall = LCS(R, C) / Length of R
### Model Evaluation Summary
| Model Name | Duration (s) | Emissions (kgCO₂e) | Avg. EM | Avg. Cosine Similarity | Avg. ROUGE Score |
|-----------------|--------------|--------------------|---------|------------------------|------------------|
| LLaMA-2-7B-HF | 18215.61 | 1.84e-01 | 0.715 | 0.7257 | 0.0821 |
| baby-llama-58m | 57.20 | 2.73e-06 | 0.025 | 0.6556 | 0.0097 |
| DistilLlama | 77.12 | 7.79e-04 | 0.02 | 0.6623 | 0.0115 |
| DistilLlamaV1 | 78.46 | 8.49e-04 | 0.065 | 0.6776 | 0.0135 |
*Note: CodeCarbon was used to track carbon emission. Allocated 80GB memory, 32 cores, Intel(R) Xeon(R) Gold 6448H for the evaluation*
### GitHub Repositories
- **Training Repo**: [DistilLlama Training Repository](https://github.com/HenryHuang2/DistilLlama)
- **Evaluation Repo**: [Knowledge Distillation Evaluation Repository](https://github.com/svarnim1805/Knowledge-Distillation)
### Reference
@misc{timiryasov2023babyllamaknowledgedistillation,
title={Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty},
author={Inar Timiryasov and Jean-Loup Tastet},
year={2023},
eprint={2308.02019},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2308.02019},
}
*Note: The repository will be updated as training progresses. Last update 2024-11-06*