Qwen2-7B-Instruct Quantized with AutoFP8 with KVCache

使用 larryvrh/belle_resampled_78K_CN 校准静态量化的 Qwen/Qwen2-7B-Instruct 模型，可启用 fp8 kv cache。

主要为中文通常语言逻辑任务，为 vLLM 准备。

使用

参数加入 kv_cache_dtype="fp8"

使用 lm-evaluation-harness + vLLM serve 进行评估：

项目	Qwen2-7B-Instruct	Qwen2-7B-Instruct-FP8-CN	Recovery	此项目	Recovery
ceval-valid	81.87	81.65	99.73%	81.35	99.36%
cmmlu	81.78	81.26	99.36%	81.19	99.28%
agieval_logiqa_zh (5 shots)	47.63	48.54	101.91%	46.54	97.71%
平均	70.43	70.48	100.07%	69.69	98.95%

Safetensors

Model size

8B params

Tensor type

BF16

F8_E4M3