CSLLM: Crystal Structure Large Language Model
Model Description
CSLLM (Crystal Structure Large Language Model) is a specialized framework of fine-tuned large language models designed for crystal structure synthesizability prediction. The CSLLM framework consists of three specialized LLMs that can predict the synthesizability of arbitrary 3D crystal structures, identify possible synthetic methods, and recommend suitable precursors.
Model Details
- Repository: https://github.com/szl666/CSLLM
- Model type: Fine-tuned Large Language Model for Crystal Structure Synthesizability Prediction
- Language(s): English
- Base models: LLaMA-7B, LLaMA3-8B
Model Variants
The CSLLM family includes several specialized variants:
method_llm_llama3: Specialized for crystal synthesis method predictionprecursor_llm_llama3: Focused on precursor identification and selection for crystal synthesissynthesis_llm_llama: Synthesizability prediction for crystals using LLaMA-7Bsynthesis_llm_llama3: Synthesizability prediction for crystals using LLaMA3-8B
Usage
Installation and Setup
First, clone and install the LMFlow library:
git clone https://github.com/OptimalScale/LMFlow
cd LMFlow
pip install -e .
Batch Evaluation Script
Create a run_evaluation.sh script in the LMFlow main directory:
#!/bin/bash
CUDA_VISIBLE_DEVICES=0 \
deepspeed examples/evaluate.py \
--answer_type math \
--model_name_or_path {model_name_or_path} \
--lora_model_path {lora_model_path} \
--dataset_path {dataset_path} \
--prompt_structure "input: {input}" \
--deepspeed examples/ds_config.json \
--metric accuracy
Configuration Parameters
Base Model Options (model_name_or_path)
llama-7b-hf: LLaMA 7B base modelllama3-8bf-hf: LLaMA3 8B base model
Fine-tuned Model Variants (lora_model_path)
method_llm_llama3: Crystal synthesis method predictionprecursor_llm_llama3: Precursor recommendationsynthesis_llm_llama: Synthesizability prediction (LLaMA-7B based)synthesis_llm_llama3: Synthesizability prediction (LLaMA3-8B based)
Dataset Path (dataset_path)
Example test data is provided in the repository at: https://github.com/szl666/CSLLM/tree/main/data
Training Data
CSLLM models were trained on curated datasets including:
- Crystal structure databases (ICSD, COD, Materials Project)
- Synthesis protocols and experimental procedures from materials science literature
Performance
The models demonstrate strong performance across various crystallographic tasks:
- Synthesizability Prediction: 98.6% on unseen crystal structures
- Synthesis method prediction: Exceeds 90% accuracy in classifying synthetic methods
- Precursor selection: Exceeds 90% accuracy in identifying suitable solid-state synthetic precursors for common binary and ternary compounds
Technical Requirements
- Hardware: NVIDIA GPU with at least 40GB VRAM recommended
- Software: PyTorch, DeepSpeed, LMFlow framework
- Memory: Sufficient RAM for loading large language models
Citation
If you use CSLLM in your research, please cite the article:
@article{song2025accurate,
title={Accurate prediction of synthesizability and precursors of 3D crystal structures via large language models},
author={Song, Z and Lu, S and Ju, M and others},
journal={Nature Communications},
volume={16},
number={1},
pages={6530},
year={2025}
}