LedgerBERT-Market-Sentiment

Model Description

Model Summary

LedgerBERT-Market-Sentiment is a fine-tuned version of LedgerBERT (https://huggingface.co/ExponentialScience/LedgerBERT) specialized for sentiment analysis of cryptocurrency and DLT-related content. The model classifies text into three market direction sentiment categories: bullish (positive market outlook), bearish (negative market outlook), and neutral (balanced or unclear market direction).

This model is particularly effective for analyzing cryptocurrency news headlines, social media posts, and other DLT-related content where understanding market sentiment is important.

  • Model type: BERT-base encoder for sequence classification
  • Language: English
  • License: Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0)
  • Base model: LedgerBERT (ExponentialScience/LedgerBERT)
  • Fine-tuning dataset: DLT-Sentiment-News (23,301 examples)
  • Task: Multi-class sentiment classification (3 classes)

Model Architecture

  • Architecture: BERT-base for sequence classification
  • Parameters: 110 million
  • Hidden size: 768
  • Number of layers: 12
  • Attention heads: 12
  • Vocabulary size: 30,522 (SciBERT vocabulary)
  • Max sequence length: 512 tokens
  • Output: 3-class logits (bullish, bearish, neutral)

Intended Uses

Primary Use Cases

This model is designed for sentiment analysis tasks in the cryptocurrency and DLT domain:

  • Market sentiment analysis: Analyzing sentiment in cryptocurrency news articles, headlines, and market commentary
  • Social media monitoring: Understanding market direction sentiment in tweets, Reddit posts, and forum discussions
  • News aggregation: Automatically categorizing cryptocurrency news by market sentiment
  • Research applications: Studying sentiment trends and their relationship to market dynamics
  • Content filtering: Organizing DLT content based on market outlook

Example Applications

# Analyzing news headlines
"Bitcoin surges to new all-time high" β†’ Bullish
"Ethereum faces regulatory scrutiny" β†’ Bearish
"Stablecoin market remains stable" β†’ Neutral

# Social media sentiment
"To the moon! πŸš€" β†’ Bullish
"Another crypto winter incoming" β†’ Bearish
"Waiting for clear market direction" β†’ Neutral

Out-of-Scope Uses

  • Investment decisions: This model should NOT be used as the sole basis for making investment or trading decisions
  • Financial advice: Not suitable for providing personalized financial or investment recommendations
  • Real-time trading: Should not be used for automated high-frequency trading systems
  • Market manipulation: Must not be used to coordinate or facilitate market manipulation
  • General sentiment analysis: Optimized for market direction sentiment; may not perform well on general emotional sentiment

Training Details

Training Data

The model was fine-tuned on the DLT-Sentiment-News dataset, which contains:

  • Size: 23,301 examples
  • Tokens: 1.85 million tokens (average 79.51 tokens per example)
  • Temporal coverage: January 2021 to May 2025
  • Source: CryptoPanic platform cryptocurrency news headlines and descriptions
  • Labels: Crowdsourced votes from active cryptocurrency community members
  • Classification method: Percentile-based labeling (25th and 75th percentiles as boundaries)

Label distribution by sentiment dimension:

  • Market Direction: bullish, bearish, neutral

The dataset provides domain expertise through crowdsourced annotations from cryptocurrency users, making the labels more relevant than general crowdworker annotations.

Note: News articles are absent from the DLT-Corpus used to pre-train LedgerBERT, making this an out-of-domain generalization test that demonstrates the model's robust language understanding.

For more details on the dataset used for tine-tuning, see: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News

Training Procedure

Fine-tuning hyperparameters:

  • Epochs: 3
  • Learning rate: 2Γ—10⁻⁡
  • Warmup steps: 500
  • Batch size: 8 per device (training and evaluation)
  • Train/test split: 90% training, 10% testing
  • Optimizer: AdamW with fused operations
  • Precision: bfloat16
  • Max sequence length: 512 tokens (tokenizer default)
  • Truncation: Enabled
  • Padding: Enabled

Limitations and Biases

Known Limitations

  • Temporal lag: Not suitable for real-time sentiment analysis; trained on historical data (2021-2025)
  • Context dependency: Headlines and descriptions lack full article context, which may affect sentiment interpretation
  • Language coverage: English only; does not support other languages
  • Sarcasm and irony: May struggle with nuanced language common in cryptocurrency discourse (e.g., "HFSP" - Have Fun Staying Poor)
  • Evolving terminology: Cryptocurrency memes and terminology evolve rapidly; may not capture newest slang
  • Market volatility: Sentiment can change rapidly after news publication; static predictions may become outdated quickly

Potential Biases

The model may reflect biases present in the training data:

  • Platform bias: Data from CryptoPanic users only; may not represent broader market sentiment
  • User bias: Active crypto community members may have different perspectives than general investors
  • Temporal bias: Training data spans 2021-2025, reflecting specific market conditions (bull markets, bear markets, crypto winters)
  • Source bias: Certain news sources or cryptocurrencies may be over-represented in the training data
  • Geographic bias: English-language news sources are over-represented
  • Market condition bias: Dataset reflects specific market cycles that may not generalize to all conditions

Data Collection Biases

  • Vote manipulation: Despite quality filters, coordinated voting on the source platform cannot be completely ruled out
  • Minimum vote threshold: Filtering by median votes may exclude less popular but valid sentiment signals
  • Percentile-based labeling: Classification boundaries (25th/75th percentiles) are somewhat arbitrary

How to Use

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "ExponentialScience/LedgerBERT-Market-Sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example texts
texts = [
    "Bitcoin reaches new all-time high amid institutional adoption",
    "SEC announces crackdown on cryptocurrency exchanges",
    "Ethereum network upgrade proceeding as planned"
]

# Classify sentiment
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = predictions.argmax(dim=-1).item()
    
    # Map to labels (adjust based on your label mapping)
    labels = ["bearish", "bullish", "neutral"]  # Order may vary
    sentiment = labels[predicted_class]
    confidence = predictions[0][predicted_class].item()
    
    print(f"Text: {text}")
    print(f"Sentiment: {sentiment} (confidence: {confidence:.3f})\n")

Batch Processing

from transformers import pipeline

# Create sentiment analysis pipeline
classifier = pipeline(
    "text-classification",
    model="ExponentialScience/LedgerBERT-Market-Sentiment",
    tokenizer="ExponentialScience/LedgerBERT-Market-Sentiment"
)

# Process multiple texts
texts = [
    "DeFi protocol launches new staking mechanism",
    "Major cryptocurrency exchange faces liquidity crisis",
    "Blockchain adoption continues in enterprise sector"
]

results = classifier(texts, truncation=True, max_length=512)

for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']} (score: {result['score']:.3f})\n")

Integration with News Feeds

import feedparser
from transformers import pipeline

# Initialize classifier
classifier = pipeline(
    "text-classification",
    model="ExponentialScience/LedgerBERT-Market-Sentiment"
)

# Example: Analyze cryptocurrency news feed
feed_url = "https://example-crypto-news.com/rss"
feed = feedparser.parse(feed_url)

for entry in feed.entries[:5]:  # Process first 5 entries
    title = entry.title
    result = classifier(title, truncation=True, max_length=512)[0]
    
    print(f"Headline: {title}")
    print(f"Market Sentiment: {result['label']} ({result['score']:.2%})")
    print(f"Link: {entry.link}\n")

Citation

If you use LedgerBERT-Market-Sentiment in your research, please cite:

@article{hernandez2025dlt-corpus,
  title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain},
  author={Hernandez Cruz, Walter and Devine, Peter and Vadgama, Nikhil and Tasca, Paolo and Xu, Jiahua},
  year={2025}
}

Related Resources

Additional Fine-tuned Models

LedgerBERT can also be fine-tuned for other sentiment dimensions available in the DLT-Sentiment-News dataset (https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News):

  • Content Characteristics (liked, disliked, neutral)
  • Engagement Quality (important, lol, neutral)

Model Card Contact

For questions or feedback about LedgerBERT-Market-Sentiment, please open an issue on the GitHub repository: https://github.com/dlt-science/DLT-Corpus


⚠️ Important Disclaimer: This model is provided for research and educational purposes only. It should not be used as financial advice or as the sole basis for investment decisions. Cryptocurrency markets are highly volatile and unpredictable. Always conduct your own research and consult with qualified financial advisors before making investment decisions.

Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ExponentialScience/LedgerBERT-Market-Sentiment

Finetuned
(1)
this model

Dataset used to train ExponentialScience/LedgerBERT-Market-Sentiment