upload readme file (#3)

e89d02e verified 2 months ago

2.77 kB

metadata

license: llama2
base_model: meta-llama/Llama-2-70b-chat-hf

Llama-2-70b-chat-hf-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8

Introduction

This model was created by applying Quark with calibration samples from Pile dataset.
Quantization Stragegy
- Quantized Layers: All linear layers excluding "lm_head"
- Weight: Auto Mixed Precision quantized by Quark, each weight has either quantization scheme in candidates of
  - FP8 symmetric per-tensor
  - OCP Microscaling (MX) FP4
- Activation: Auto Mixed Precision quantized by Quark, each activation input has the same quantization scheme with weight, i.e., in candidates of
  - FP8 symmetric per-tensor
  - OCP Microscaling (MX) FP4
- KV Cache: FP8 symmetric per-tensor
Quick Start

Download and install Quark
[TODO] We will provide example script(s) to run auto mixed precision (AMP) quantizations later.

Deployment

The Quark quantized Auto Mixed Precision (AMP) models are now supported to be easily deployed in vLLM backend (vLLM-compatible).

Evaluation

The quantization evaluation results are conducted in pseudo-quantization mode, which may slightly differ from the actual quantized inference accuracy. These results are provided for reference only.

Evaluation scores

Quant scheme	arc challenge (↑) (acc)		gsm8k (↑) (strict-match)		mmlu (↑) (acc)		winogrande (↑) (acc)
	absolute value	recovery rate	absolute value	recovery rate	absolute value	recovery rate	absolute value	recovery rate
FP16	0.5290	100.0%	0.5049	100.0%	0.6110	100.0%	0.7490	100.0%
FP8	0.5265	99.5%	0.5262	104.2%	0.6107	100.0%	0.7451	99.5%
AMP	0.5273	99.7%	0.5125	101.5%	0.6007	98.3%	0.7324	97.8%
MXFP4	0.5094	96.3%	0.4572	90.6%	0.5869	96.1%	0.7316	97.7%

License

Built with Meta Llama.

Llama-2-70b-chat-hf-WMXFP4FP8-AMXFP4FP8-AMP-KVFP8

Introduction

Quantization Stragegy

Quick Start

Deployment

Evaluation

Evaluation scores

License