NTxPred2 / README.md
anandr88's picture
Update README.md
a50ea08 verified
metadata
license: gpl-3.0
language:
  - en
base_model:
  - facebook/esm2_t30_150M_UR50D
tags:
  - peptides
  - neurotoxicity
  - protein-classification
  - therapeutic-peptides
  - bioinformatics
  - esm2
  - transformer

🧠 NTxPred2: A large language model for predicting neurotoxic peptides and neurotoxins

NTxPred2 is a fine-tuned transformer model built on top of the ESM2-t30_150M_UR50D protein language model. It is specifically trained for binary classification of peptide sequences β€” predicting whether a peptide is neurotoxic or non-toxic.

🎯 Use Case: Accelerating the identification and design of safe peptide therapeutics by filtering out neurotoxic candidates early in the drug development pipeline.


πŸ–ΌοΈ NTxPred2 Workflow

NTxPred2 Workflow


🧬 Model Highlights

  • Base Model: Facebook’s ESM2-t30 (150M parameters)
  • Fine-Tuning Task: Neurotoxicity prediction (binary classification)
  • Input: Short peptide sequences (7–50 amino acids)
  • Output: Binary label β†’ 1 (neurotoxic), 0 (non-toxic)
  • Architecture: ESM2 encoder + linear classification head

πŸ—‚οΈ Files Included

  • config.json – Contains configuration settings for the model architecture, hyperparameters, and training details.

  • model.safetensors – This is the actual trained model weights saved in the SafeTensors format, which is safer and faster than the traditional .bin files.

  • special_tokens_map.json – Stores mappings for special tokens, like [CLS], [SEP], or any custom tokens used in your tokenizer.

  • tokenizer_config.json – Contains tokenizer-related settings (like vocabulary size, tokenization method).

  • vocab.txt – Lists all tokens and their corresponding IDs; it's essential for text tokenization.


πŸš€ How to Use

πŸ”§ Install Dependencies

pip install torch esm biopython huggingface_hub


### Loading the Model from Hugging Face

```python
import torch
import torch.nn as nn
import esm
import json
from huggingface_hub import hf_hub_download

# Define the classifier model (ESM encoder + linear head)
class ProteinClassifier(nn.Module):
    def __init__(self, esm_model, embedding_dim, num_classes):
        super(ProteinClassifier, self).__init__()
        self.esm_model = esm_model
        self.fc = nn.Linear(embedding_dim, num_classes)

    def forward(self, tokens):
        layer_index = len(self.esm_model.layers)  # Get number of layers
        results = self.esm_model(tokens, repr_layers=[layer_index])
        embeddings = results["representations"][layer_index].mean(1)
        return self.fc(embeddings)

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load config from your repo
config_path = hf_hub_download(repo_id="anandr88/NTxPred2", filename="config.json")
with open(config_path, 'r') as f:
    config = json.load(f)

# Load ESM2 model - UPDATED METHOD
model_name = "esm2_t30_150M_UR50D"
esm_model, alphabet = esm.pretrained.load_model_and_alphabet(model_name)
batch_converter = alphabet.get_batch_converter()

# Initialize a NEW classifier (with random weights)
classifier = ProteinClassifier(
    esm_model, 
    embedding_dim=config['embedding_dim'], 
    num_classes=config['num_classes']
)
classifier.to(device)
classifier.eval()

print("βœ… Model loaded successfully!")
print(f"Using device: {device}")
print(f"Model architecture: {classifier}")

πŸ§ͺ Example Usage (Optional)


# Example Usage for Binary Classification
sequence = ("TEST_SEQUENCE", "ACDEFGHIKLMNPQRSTVWY")  # Your peptide sequence

# Convert to model input format
_, _, batch_tokens = batch_converter([sequence])
batch_tokens = batch_tokens.to(device)

# Predict
with torch.no_grad():
    logits = classifier(batch_tokens)
    probability = torch.sigmoid(logits).item()  # Sigmoid for binary classification

# Interpret results
threshold = 0.5  # Standard threshold (adjust if needed)
prediction = "Neurotoxic" if probability >= threshold else "Not-toxic"

print("\n" + "="*50)
print(f"πŸ”¬ Input Sequence: {sequence[1]}")
print(f"πŸ“Š Neurotoxicity Probability: {probability:.4f}")
print(f"🏷️ Prediction: {prediction} (threshold={threshold})")

πŸ“Š Applications

  • Neurotoxic peptide filtering in therapeutic design
  • Toxicity scanning of synthetic peptides
  • Dataset annotation for bioactivity studies
  • Educational use in bioinformatics and deep learning for proteins

🌐 Related Links


🧠 Citation

πŸ“– Rathore et al.
A Large Language Model for Predicting Neurotoxic Peptides and Neurotoxins.
#Coming Soon#


πŸ‘¨β€πŸ”¬ Start using NTxPred2 today to enhance your peptide screening pipeline with the power of transformer-based intelligence!