HenryHHHH
/

DistilLlamaV1

Text Generation

knowledge-distillation

transfer-learning

text-generation-inference

Model card Files Files and versions

DistilLlamaV1 / README.md

HenryHHHH's picture

Update README.md

05554a6 verified about 1 year ago

|

history blame contribute delete

3.71 kB

	---
	language: en
	tags:
	- text-generation
	- knowledge-distillation
	- llama
	- causal-lm
	- openwebtext
	- wikitext
	- transfer-learning
	model_name: DistilLLaMA
	license: apache-2.0
	datasets:
	- openwebtext
	- wikitext
	parameter_count: 80M
	metrics:
	- cosine-similarity
	- exact-match
	- rouge
	library_name: transformers
	base_model: meta-llama/LLaMA-2-7B
	---


	### Overview
	This model is a distilled version of LLaMA 2, containing approximately 80 million parameters.
	It was trained using a mix of OpenWebText and WikiText Raw V1 datasets.
	Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher.
	This version is the latest version of DistilLlama, which has gone through 5 days of training using two Nvidia A100 80G GPU.

	### Update
	30 out of 300 checkpoints were examined, and the one with the best performance in semantic and factual accuracy has now been updated in this repository.

	### Model Architecture

	The architecture is based on LLaMA 2, with the following parameters:
	\| Parameter \| Value \|
	\|-------------------------\|-------\|
	\| Hidden Dimension \| 512 \|
	\| Intermediate Dimension \| 1536 \|
	\| Max Positional Embeddings \| 128 \|
	\| Attention Heads \| 8 \|
	\| Transformer Layers \| 16 \|


	### Evaluation Metrics

	1. Cosine Similarity using Word Embeddings
	- Description: Measures semantic similarity by mapping words/phrases to vectors.
	- Equation: Cosine Similarity = ( A • B ) / ( \|\|A\|\| \|\|B\|\| )
	- Example: "The dog chased the cat." vs. "A canine pursued a feline." (High similarity)

	2. Exact Match (EM)
	- Description: Checks if critical keywords are present.
	- Example:
	- Expected: "Paris"
	- Response: "The capital of France is Paris." (EM = 1)

	3. ROUGE Score
	- Description: Measures the overlap of the longest common subsequences between reference and response texts.
	- Equation:
	- Precision = Precision = LCS(R, C) / Length of C
	- Recall = Recall = LCS(R, C) / Length of R

	### Model Evaluation Summary

	\| Model Name \| Duration (s) \| Emissions (kgCO₂e) \| Avg. EM \| Avg. Cosine Similarity \| Avg. ROUGE Score \|
	\|-----------------\|--------------\|--------------------\|---------\|------------------------\|------------------\|
	\| LLaMA-2-7B-HF \| 18215.61 \| 1.84e-01 \| 0.715 \| 0.7257 \| 0.0821 \|
	\| baby-llama-58m \| 57.20 \| 2.73e-06 \| 0.025 \| 0.6556 \| 0.0097 \|
	\| DistilLlama \| 77.12 \| 7.79e-04 \| 0.02 \| 0.6623 \| 0.0115 \|
	\| DistilLlamaV1 \| 78.46 \| 8.49e-04 \| 0.065 \| 0.6776 \| 0.0135 \|

	Note: CodeCarbon was used to track carbon emission. Allocated 80GB memory, 32 cores, Intel(R) Xeon(R) Gold 6448H for the evaluation

	### GitHub Repositories

	- Training Repo: [DistilLlama Training Repository](https://github.com/HenryHuang2/DistilLlama)
	- Evaluation Repo: [Knowledge Distillation Evaluation Repository](https://github.com/svarnim1805/Knowledge-Distillation)

	### Reference

	@misc{timiryasov2023babyllamaknowledgedistillation,
	title={Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty},
	author={Inar Timiryasov and Jean-Loup Tastet},
	year={2023},
	eprint={2308.02019},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2308.02019},
	}

	Note: The repository will be updated as training progresses. Last update 2024-11-06