Spaces:

Aryankvgd
/

LLaDA

Runtime error

App Files Files Community

LLaDA / EVAL.md

Aryankvgd

Upload folder using huggingface_hub

7157974 verified 14 days ago

preview code

raw

history blame contribute delete

8.21 kB

A newer version of the Gradio SDK is available: 6.0.1

Upgrade

Evaluation

In this file, we provide the code for the evaluation of LLaDA-8B-Base, LLaDA-8B-Instruct and LLaDA 1.5.

Benchmarks

For LLaDA-8B-Base, we employ conditional likelihood estimation and conditional generation for evaluation following the widely adopted evaluation process in LLMs. Please refer to Appendix B.6 of our paper for details.

Evaluation Method of LLaDA-8B-Base	MMLU	BBH	ARC-C	Hellaswag	TruthfulQA	WinoGrande	PIQA	GSM8K	Math	GPQA	HumanEval	HumanEval-FIM	MBPP	CMMLU	C-Eval
Evaluation Type	ppl	gen	ppl	ppl	ppl	ppl	ppl	gen	gen	ppl	gen	gen	gen	ppl	ppl

where ppl refers to conditional likelihood estimation and gens refer to conditional generation.

Both LLaDA-8B-Instruct and LLaDA 1.5 are evaluated using only conditional generation.

Open source testing tools

For LLaDA-8B-Base, LLaDA-8B-Instruct, and LLaDA-1.5, we initially conducted evaluations using our internal benchmark suite. Recently, we reproduced our results using two open-source evaluation frameworks, lm-eval and OpenCompass.

Model	ppl tasks	gen tasks
LLaDA-8B-Base	lm-eval	lm-eval / OpenCompass
LLaDA-8B-Instruct & LLaDA-1.5	None	OpenCompass

Usage

lm-eval

Please refer to eval_llada_lm_eval.sh for the required dependencies and execution commands.

For the ppl tasks of LLaDA-8B-Base, the evaluation results are as follows:

	ARC-C	Hellaswag	TruthfulQA	WinoGrande	GPQA	PIQA	MMLU	CMMLU	C-Eval
w/o CFG	45.9	70.5	46.1	74.8	25.2	73.6	65.9	69.9	70.5
w/ CFG	47.9	72.5	46.4	74.8	26.1	74.4	–	–	–

In the Tab.1 of LLaDA paper, we only report results w/o CFG to ensure a fair comparison with autoregressive models.

For the gen tasks of LLaDA-8B-Base, the evaluation result are as follows:

Settings	BBH	GSM8K	Math	HumanEval	MBPP
gen_length = 1024, steps = 1024, block_length = 1024	49.7	70.3	31.4	35.4	40.0
gen_length = 512, steps = 512, block_length = 512	50.4	70.8	30.9	32.9	39.2
gen_length = 256, steps = 256, block_length = 256	45.0	70.0	30.3	32.9	40.2

In the Tab.1 of LLaDA paper, we report the results with gen_length = 1024, steps = 1024, block_length = 1024 for simplicity. However, as shown above, the performance across all three settings is consistent.

OpenCompass

Please refer to eval_llada_opencompass.sh for the required dependencies and execution commands.

In addition to lm-eval, we can also employ OpenCompass to evaluate LLaDA-8B-Base. For the gen_length = 256, steps = 256, block_length = 256 setting, the results are as follows:

Settings	BBH	GSM8K	Math	HumanEval	MBPP
lm-eval	45.0	70.0	30.3	32.9	40.2
OpenCompass	47.3	71.9	30.7	34.1	38.8

For LLaDA-8B-Instruct, the evaluation results are as follows. It is worth noting that in the Tab.1 and Tab.2 of LLaDA paper, we report the results with pure diffusion sampling without any autoregressive elements, as this setting yields the best overall performance.

	MMLU	MMLU-pro	Hellaswag	ARC-C	GSM8K	Math	GPQA	HumanEval	MBPP
gen_length	3	256	3	512	512	512	64	512	256
block_length	3	256	3	512	512	512	64	512	256
logits_eos_inf	False	False	False	False	False	False	False	True	False
confidence_eos_eot_inf	False	False	False	False	True	True	True	False	True
Internal toolkit	65.5	37.0	74.6	88.5	69.4	31.9	33.3	49.4	41.0
OpenCompass	65.4	36.6	75.3	89.2	68.8	29.6	32.3	47.0	39.6

Please refer to Appendix B.4 of LLaDA paper for the explanation of the sampling setting.

Furthermore, we apply block diffusion sampling (i.e., semi-autoregressive remasking) to mitigate the tendency of LLaDA-8B-Instruct to generate excessive |EOS| tokens, which is caused by the extensive |EOS| padding in the SFT data. This strategy improves performance on the GSM8K and Math benchmarks, while leading to a decrease in accuracy on other benchmarks.

	GSM8k	Math
gen_length	256	512
block_length	8	64
logits_eos_inf	False	False
confidence_eos_eot_inf	False	False
Internal toolkit	78.6	42.2
OpenCompass	78.9	42.7

The evaluation results of LLaDA 1.5 are as follows:

	GSM8K	Math	GPQA	HumanEval	MBPP	IFEval
gen_length	256	1024	256	512	512	256
block_length	16	128	16	32	32	16
logits_eos_inf	False	False	False	False	False	False
confidence_eos_eot_inf	True	True	False	True	True	True
Internal toolkit	83.8	42.6	36.9	52.4	42.8	66.2
OpenCompass	83.6	42.3	34.8	51.2	42.6	65.2

Note that Arena-Hard, AlignBench, and MT-Bench require access to the OpenAI API for evaluation, and are therefore not included.

Batch generation is also supported in OpenCompass. To enable this feature, update the batch_size and batch_size_ parameters in OpenCompass/examples/xxx.py. Please note that in October 2025, we updated the modeling_llada.py file in Hugging Face to support attention mask inputs. Make sure you are using the latest version of the file to ensure compatibility.

If you want to use a custom model path, edit the model file under opencompass/opencompass/configs/models/dllm/xxx.py and modify the path argument. For example:

models = [
    dict(
        type=LLaDAModel,
        abbr='llada-8b-instruct',
        path='/your/custom/path/to/GSAI-ML/LLaDA-8B-Instruct',  # Change this path
        max_out_len=1024,
        batch_size=1,
        run_cfg=dict(num_gpus=1),
    )
]

Reversal curse

We downloaded a text file containing a large collection of classical Chinese poetic lines from Baidu Wenku. Using regular expressions, we extracted pairs of consecutive poetic lines (i.e., couplets) and stored them in a file named data/poem_data.json.

We provide the evaluation command as follows:

# generate the subsequent line
python eval_reverse.py  --type ftb --eos_inf

# generate the preceding line
python eval_reverse.py  --type btf --eos_inf

Acknowledgments

Thanks lm-eval and OpenCompass for their great work!