LLaDA / EVAL.md
Aryankvgd's picture
Upload folder using huggingface_hub
7157974 verified

A newer version of the Gradio SDK is available: 6.0.1

Upgrade

Evaluation

In this file, we provide the code for the evaluation of LLaDA-8B-Base, LLaDA-8B-Instruct and LLaDA 1.5.

Benchmarks

For LLaDA-8B-Base, we employ conditional likelihood estimation and conditional generation for evaluation following the widely adopted evaluation process in LLMs. Please refer to Appendix B.6 of our paper for details.

Evaluation Method of LLaDA-8B-Base MMLU BBH ARC-C Hellaswag TruthfulQA WinoGrande PIQA GSM8K Math GPQA HumanEval HumanEval-FIM MBPP CMMLU C-Eval
Evaluation Type ppl gen ppl ppl ppl ppl ppl gen gen ppl gen gen gen ppl ppl

where ppl refers to conditional likelihood estimation and gens refer to conditional generation.

Both LLaDA-8B-Instruct and LLaDA 1.5 are evaluated using only conditional generation.

Open source testing tools

For LLaDA-8B-Base, LLaDA-8B-Instruct, and LLaDA-1.5, we initially conducted evaluations using our internal benchmark suite. Recently, we reproduced our results using two open-source evaluation frameworks, lm-eval and OpenCompass.

Model ppl tasks gen tasks
LLaDA-8B-Base lm-eval lm-eval / OpenCompass
LLaDA-8B-Instruct & LLaDA-1.5 None OpenCompass

Usage

lm-eval

Please refer to eval_llada_lm_eval.sh for the required dependencies and execution commands.

For the ppl tasks of LLaDA-8B-Base, the evaluation results are as follows:

ARC-C Hellaswag TruthfulQA WinoGrande GPQA PIQA MMLU CMMLU C-Eval
w/o CFG 45.9 70.5 46.1 74.8 25.2 73.6 65.9 69.9 70.5
w/ CFG 47.9 72.5 46.4 74.8 26.1 74.4

In the Tab.1 of LLaDA paper, we only report results w/o CFG to ensure a fair comparison with autoregressive models.

For the gen tasks of LLaDA-8B-Base, the evaluation result are as follows:

Settings BBH GSM8K Math HumanEval MBPP
gen_length = 1024, steps = 1024, block_length = 1024 49.7 70.3 31.4 35.4 40.0
gen_length = 512, steps = 512, block_length = 512 50.4 70.8 30.9 32.9 39.2
gen_length = 256, steps = 256, block_length = 256 45.0 70.0 30.3 32.9 40.2

In the Tab.1 of LLaDA paper, we report the results with gen_length = 1024, steps = 1024, block_length = 1024 for simplicity. However, as shown above, the performance across all three settings is consistent.

OpenCompass

Please refer to eval_llada_opencompass.sh for the required dependencies and execution commands.

In addition to lm-eval, we can also employ OpenCompass to evaluate LLaDA-8B-Base. For the gen_length = 256, steps = 256, block_length = 256 setting, the results are as follows:

Settings BBH GSM8K Math HumanEval MBPP
lm-eval 45.0 70.0 30.3 32.9 40.2
OpenCompass 47.3 71.9 30.7 34.1 38.8

For LLaDA-8B-Instruct, the evaluation results are as follows. It is worth noting that in the Tab.1 and Tab.2 of LLaDA paper, we report the results with pure diffusion sampling without any autoregressive elements, as this setting yields the best overall performance.

MMLU MMLU-pro Hellaswag ARC-C GSM8K Math GPQA HumanEval MBPP
gen_length 3 256 3 512 512 512 64 512 256
block_length 3 256 3 512 512 512 64 512 256
logits_eos_inf False False False False False False False True False
confidence_eos_eot_inf False False False False True True True False True
Internal toolkit 65.5 37.0 74.6 88.5 69.4 31.9 33.3 49.4 41.0
OpenCompass 65.4 36.6 75.3 89.2 68.8 29.6 32.3 47.0 39.6

Please refer to Appendix B.4 of LLaDA paper for the explanation of the sampling setting.

Furthermore, we apply block diffusion sampling (i.e., semi-autoregressive remasking) to mitigate the tendency of LLaDA-8B-Instruct to generate excessive |EOS| tokens, which is caused by the extensive |EOS| padding in the SFT data. This strategy improves performance on the GSM8K and Math benchmarks, while leading to a decrease in accuracy on other benchmarks.

GSM8k Math
gen_length 256 512
block_length 8 64
logits_eos_inf False False
confidence_eos_eot_inf False False
Internal toolkit 78.6 42.2
OpenCompass 78.9 42.7

The evaluation results of LLaDA 1.5 are as follows:

GSM8K Math GPQA HumanEval MBPP IFEval
gen_length 256 1024 256 512 512 256
block_length 16 128 16 32 32 16
logits_eos_inf False False False False False False
confidence_eos_eot_inf True True False True True True
Internal toolkit 83.8 42.6 36.9 52.4 42.8 66.2
OpenCompass 83.6 42.3 34.8 51.2 42.6 65.2

Note that Arena-Hard, AlignBench, and MT-Bench require access to the OpenAI API for evaluation, and are therefore not included.

Batch generation is also supported in OpenCompass. To enable this feature, update the batch_size and batch_size_ parameters in OpenCompass/examples/xxx.py. Please note that in October 2025, we updated the modeling_llada.py file in Hugging Face to support attention mask inputs. Make sure you are using the latest version of the file to ensure compatibility.

If you want to use a custom model path, edit the model file under opencompass/opencompass/configs/models/dllm/xxx.py and modify the path argument. For example:

models = [
    dict(
        type=LLaDAModel,
        abbr='llada-8b-instruct',
        path='/your/custom/path/to/GSAI-ML/LLaDA-8B-Instruct',  # Change this path
        max_out_len=1024,
        batch_size=1,
        run_cfg=dict(num_gpus=1),
    )
]

Reversal curse

We downloaded a text file containing a large collection of classical Chinese poetic lines from Baidu Wenku. Using regular expressions, we extracted pairs of consecutive poetic lines (i.e., couplets) and stored them in a file named data/poem_data.json.

We provide the evaluation command as follows:

# generate the subsequent line
python eval_reverse.py  --type ftb --eos_inf

# generate the preceding line
python eval_reverse.py  --type btf --eos_inf

Acknowledgments

Thanks lm-eval and OpenCompass for their great work!