# Evaluation In this file, we provide the code for the evaluation of [LLaDA-8B-Base](https://huggingface.co/GSAI-ML/LLaDA-8B-Base), [LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) and [LLaDA 1.5](https://arxiv.org/abs/2505.19223). ## Benchmarks For **LLaDA-8B-Base**, we employ conditional likelihood estimation and conditional generation for evaluation following the widely adopted evaluation process in LLMs. Please refer to Appendix B.6 of our [paper](https://arxiv.org/pdf/2502.09992) for details. | Evaluation Method of LLaDA-8B-Base | MMLU | BBH | ARC-C | Hellaswag | TruthfulQA | WinoGrande | PIQA | GSM8K | Math | GPQA | HumanEval | HumanEval-FIM | MBPP | CMMLU | C-Eval | |:----------------------------------|:----:|:----:|:------:|:-----------:|:------------:|:------------:|:----:|:----:|:----:|:----:|:-----------:|:---------------:|:----:|:----:|:----:| | **Evaluation Type** | ppl | gen | ppl | ppl | ppl | ppl | ppl | gen | gen | ppl | gen | gen | gen | ppl | ppl | where ppl refers to conditional likelihood estimation and gens refer to conditional generation. Both **LLaDA-8B-Instruct** and **LLaDA 1.5** are evaluated using only conditional generation. ## Open source testing tools For LLaDA-8B-Base, LLaDA-8B-Instruct, and LLaDA-1.5, we initially conducted evaluations using our internal benchmark suite. Recently, we reproduced our results using two open-source evaluation frameworks, [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) and [OpenCompass](https://github.com/open-compass/opencompass). | Model | ppl tasks | gen tasks | |:------|:-----------|:-----------| | **LLaDA-8B-Base** | lm-eval | lm-eval / OpenCompass | | **LLaDA-8B-Instruct & LLaDA-1.5** | None | OpenCompass | ## Usage ### lm-eval Please refer to `eval_llada_lm_eval.sh` for the required dependencies and execution commands. For **the ppl tasks of LLaDA-8B-Base**, the evaluation results are as follows: | | ARC-C | Hellaswag | TruthfulQA | WinoGrande | GPQA | PIQA | MMLU | CMMLU | C-Eval | |----------------|:------:|:----------:|:-----------:|:-----------:|:----:|:----:|:----:|:----:|:----:| | **w/o CFG** | 45.9 | 70.5 | 46.1 | **74.8** | 25.2 | 73.6 | 65.9 | 69.9 | 70.5 | | **w/ CFG** | **47.9** | **72.5** | **46.4** | **74.8** | **26.1** | **74.4** | – | – | – | In the Tab.1 of [LLaDA paper](https://arxiv.org/pdf/2502.09992), we only report results w/o CFG to ensure a fair comparison with autoregressive models. For **the gen tasks of LLaDA-8B-Base**, the evaluation result are as follows: | Settings | BBH | GSM8K | Math | HumanEval | MBPP | |:------------------------------------|:----:|:----:|:----:|:----:|:----:| | **gen_length = 1024, steps = 1024, block_length = 1024** | 49.7 | 70.3 | 31.4 | 35.4 | 40.0 | | **gen_length = 512, steps = 512, block_length = 512** | 50.4 | 70.8 | 30.9 | 32.9 | 39.2 | | **gen_length = 256, steps = 256, block_length = 256** | 45.0 | 70.0 | 30.3 | 32.9 | 40.2 | In the Tab.1 of [LLaDA paper](https://arxiv.org/pdf/2502.09992), we report the results with `gen_length = 1024, steps = 1024, block_length = 1024` for simplicity. However, as shown above, the performance across all three settings is consistent. ### OpenCompass Please refer to `eval_llada_opencompass.sh` for the required dependencies and execution commands. In addition to lm-eval, we can also employ OpenCompass to evaluate **LLaDA-8B-Base**. For the `gen_length = 256, steps = 256, block_length = 256` setting, the results are as follows: | Settings | BBH | GSM8K | Math | HumanEval | MBPP | |:----------------|:----:|:-----:|:----:|:---------:|:----:| | **lm-eval** | 45.0 | 70.0 | 30.3 | 32.9 | 40.2 | | **OpenCompass** | 47.3 | 71.9 | 30.7 | 34.1 | 38.8 | For **LLaDA-8B-Instruct**, the evaluation results are as follows. It is worth noting that in the Tab.1 and Tab.2 of [LLaDA paper](https://arxiv.org/pdf/2502.09992), we report the results with **pure diffusion sampling without any autoregressive elements**, as this setting yields the best overall performance. | | MMLU | MMLU-pro | Hellaswag | ARC-C | GSM8K | Math | GPQA | HumanEval | MBPP | |:-----------------------|:----:|:--------:|:---------:|:-----:|:-----:|:-----:|:----:|:----------:|:----:| | **gen\_length** | 3 | 256 | 3 | 512 | 512 | 512 | 64 | 512 | 256 | | **block\_length** | 3 | 256 | 3 | 512 | 512 | 512 | 64 | 512 | 256 | | **logits\_eos\_inf** | False| False | False | False | False | False | False| True | False| | **confidence\_eos\_eot\_inf** | False| False| False | False | True | True | True | False | True | | **Internal toolkit** | 65.5 | 37.0 | 74.6 | 88.5 | 69.4 | 31.9 | 33.3 | 49.4 | 41.0 | | **OpenCompass** | 65.4 | 36.6 | 75.3 | 89.2 | 68.8 | 29.6 | 32.3 | 47.0 | 39.6 | Please refer to Appendix B.4 of [LLaDA paper](https://arxiv.org/pdf/2502.09992) for the explanation of the sampling setting. Furthermore, we apply block diffusion sampling (i.e., semi-autoregressive remasking) to mitigate the tendency of **LLaDA-8B-Instruct** to generate excessive |EOS| tokens, which is caused by the extensive |EOS| padding in the SFT data. This strategy improves performance on the GSM8K and Math benchmarks, while leading to a decrease in accuracy on other benchmarks. | | GSM8k | Math | |:-----------------------|:-----:|:-----:| | **gen\_length** | 256 | 512 | | **block\_length** | 8 | 64 | | **logits\_eos\_inf** | False | False | | **confidence\_eos\_eot\_inf** | False | False | | **Internal toolkit** | 78.6 | 42.2 | | **OpenCompass** | 78.9 | 42.7 | The evaluation results of **LLaDA 1.5** are as follows: | | GSM8K | Math | GPQA | HumanEval | MBPP | IFEval | |:--------------------------|:-----:|:----:|:----:|:---------:|:----:|:------:| | **gen_length** | 256 | 1024 | 256 | 512 | 512 | 256 | | **block_length** | 16 | 128 | 16 | 32 | 32 | 16 | | **logits_eos_inf** | False | False| False| False | False| False | | **confidence_eos_eot_inf**| True | True | False| True | True | True | | **Internal toolkit** | 83.8 | 42.6 | 36.9 | 52.4 | 42.8 | 66.2 | | **OpenCompass** | 83.6 | 42.3 | 34.8 | 51.2 | 42.6 | 65.2 | Note that Arena-Hard, AlignBench, and MT-Bench require access to the OpenAI API for evaluation, and are therefore not included. **Batch generation** is also supported in OpenCompass. To enable this feature, update the `batch_size` and `batch_size_` parameters in `OpenCompass/examples/xxx.py`. Please note that in October 2025, we updated the `modeling_llada.py` file in Hugging Face to support attention mask inputs. Make sure you are using the latest version of the file to ensure compatibility. If you want to use a **custom model path**, edit the model file under `opencompass/opencompass/configs/models/dllm/xxx.py` and modify the path argument. For example: ```python models = [ dict( type=LLaDAModel, abbr='llada-8b-instruct', path='/your/custom/path/to/GSAI-ML/LLaDA-8B-Instruct', # Change this path max_out_len=1024, batch_size=1, run_cfg=dict(num_gpus=1), ) ] ``` ## Reversal curse We downloaded a [text file](https://wenku.baidu.com/view/f13866185fbfc77da369b1b3?wkts=1760409102730) containing a large collection of classical Chinese poetic lines from Baidu Wenku. Using regular expressions, we extracted pairs of consecutive poetic lines (i.e., couplets) and stored them in a file named `data/poem_data.json`. We provide the evaluation command as follows: ``` # generate the subsequent line python eval_reverse.py --type ftb --eos_inf # generate the preceding line python eval_reverse.py --type btf --eos_inf ``` ## Acknowledgments Thanks [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) and [OpenCompass](https://github.com/open-compass/opencompass) for their great work!