Safetensors
English
bert

LedgerBERT

Model Description

Model Summary

LedgerBERT is a domain-adapted language model specialized for the Distributed Ledger Technology (DLT) field. It was created through continual pre-training of SciBERT on the DLT-Corpus, a comprehensive collection of 2.98 billion tokens from scientific literature, patents, and social media focused on blockchain, cryptocurrencies, and distributed ledger systems.

LedgerBERT captures DLT-specific terminology and concepts, making it particularly effective for NLP tasks involving blockchain technologies, cryptocurrency discourse, smart contracts, consensus mechanisms, and related domain-specific content.

  • Developed by: Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu
  • Model type: BERT-base encoder (bidirectional transformer)
  • Language: English
  • License: CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
  • Base model: SciBERT (allenai/scibert_scivocab_cased)
  • Training corpus: DLT-Corpus (2.98 billion tokens)

Model Architecture

  • Architecture: BERT-base
  • Parameters: 110 million
  • Hidden size: 768
  • Number of layers: 12
  • Attention heads: 12
  • Vocabulary size: 30,522 (SciBERT vocabulary)
  • Max sequence length: 512 tokens

Intended Uses

Primary Use Cases

LedgerBERT is designed for NLP tasks in the DLT domain, including, but not limited to:

  • Named Entity Recognition (NER): Identifying DLT-specific entities such as consensus mechanisms (e.g., Proof of Stake), blockchain platforms (e.g., Ethereum, Hedera), cryptographic concepts (e.g., Merkle tree, hashing)
  • Text Classification: Categorizing DLT-related documents, patents, or social media posts
  • Sentiment Analysis: Analyzing sentiment in cryptocurrency news and social media
  • Information Extraction: Extracting technical concepts and relationships from DLT literature
  • Document Retrieval: Building search systems for DLT content
  • Question Answering (QA): Creating QA systems for blockchain and cryptocurrency topics

Out-of-Scope Uses

  • Real-time trading systems: LedgerBERT should not be used as the sole basis for automated trading decisions
  • Investment advice: Not suitable for providing financial or investment recommendations without proper disclaimers
  • General-purpose NLP: While LedgerBERT maintains general language understanding, it is optimized for DLT-specific tasks
  • Legal or regulatory compliance: Should not be used for legal interpretation without expert review

Training Details

Training Data

LedgerBERT was continually pre-trained on the DLT-Corpus, consisting of:

Total: 22.12 million documents, 2.98 billion tokens

For more details, see: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402

Training Procedure

Continual Pre-training:

Starting from SciBERT (which already captures multidisciplinary scientific content), LedgerBERT was trained using Masked Language Modeling (MLM) on the DLT-Corpus to adapt the model to DLT-specific terminology and concepts.

Training hyperparameters:

  • Epochs: 3
  • Learning rate: 5×10⁻⁵ with linear decay schedule
  • MLM probability: 0.15 (standard BERT masking)
  • Warmup ratio: 0.10
  • Batch size: 12 per device
  • Sequence length: 512 tokens
  • Weight decay: 0.01
  • Optimizer: Stable AdamW
  • Precision: bfloat16

Limitations and Biases

Known Limitations

  • Language coverage: English only; does not support other languages
  • Temporal coverage: Training data extends to mid-2023 for social media; may not capture very recent terminology
  • Domain specificity: Optimized for DLT tasks; may underperform on general-purpose benchmarks compared to models like RoBERTa
  • Context length: Limited to 512 tokens; longer documents require truncation or chunking

Potential Biases

The model may reflect biases present in the training data:

  • Geographic bias: English-language sources may over-represent certain regions
  • Platform bias: Social media data only from Twitter/X; other platforms not represented
  • Temporal bias: More recent DLT developments are more heavily represented
  • Market bias: Training during periods of market volatility may influence sentiment understanding
  • Source bias: Certain cryptocurrencies (e.g., Bitcoin, Ethereum) are more discussed than others

Ethical Considerations

  • Market manipulation risk: Could potentially be misused for analyzing or generating content for market manipulation
  • Investment decisions: Should not be used as sole basis for financial decisions without proper risk disclaimers
  • Misinformation: May reproduce or fail to identify false claims present in training data
  • Privacy: While usernames were removed from social media data, care should be taken not to re-identify individuals

How to Use

Basic Usage

from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
model = AutoModel.from_pretrained("ExponentialScience/LedgerBERT")

# Example text
text = "Ethereum uses Proof of Stake consensus mechanism for transaction validation."

# Tokenize and encode
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)

# Get embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state

Fine-tuning for NER

from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer

# Load for token classification
tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
model = AutoModelForTokenClassification.from_pretrained(
    "ExponentialScience/LedgerBERT",
    num_labels=num_labels  # Set based on your NER task
)

# Fine-tune on your dataset
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    num_train_epochs=20,
    warmup_steps=500
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

Fine-tuning for Sentiment Analysis

A fine-tuned version for market sentiment is available at: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")
model = AutoModelForSequenceClassification.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")

text = "Bitcoin reaches new all-time high amid institutional adoption"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

Citation

If you use LedgerBERT in your research, please cite:

@article{hernandez2025dlt-corpus,
  title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain},
  author={Hernandez Cruz, Walter and Devine, Peter and Vadgama, Nikhil and Tasca, Paolo and Xu, Jiahua},
  year={2025}
}

Related Resources

Model Card Contact

For questions or feedback about LedgerBERT, please open an issue on the model repository or contact the authors through the DLT-Corpus GitHub repository: https://github.com/dlt-science/DLT-Corpus

Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ExponentialScience/LedgerBERT

Finetuned
(14)
this model
Finetunes
1 model

Datasets used to train ExponentialScience/LedgerBERT