NTxPred2 / README.md

anandr88

Update README.md

a50ea08 verified 6 months ago

preview code

raw

history blame contribute delete

5.12 kB

metadata

license: gpl-3.0
language:
  - en
base_model:
  - facebook/esm2_t30_150M_UR50D
tags:
  - peptides
  - neurotoxicity
  - protein-classification
  - therapeutic-peptides
  - bioinformatics
  - esm2
  - transformer

🧠 NTxPred2: A large language model for predicting neurotoxic peptides and neurotoxins

NTxPred2 is a fine-tuned transformer model built on top of the ESM2-t30_150M_UR50D protein language model. It is specifically trained for binary classification of peptide sequences — predicting whether a peptide is neurotoxic or non-toxic.

🎯 Use Case: Accelerating the identification and design of safe peptide therapeutics by filtering out neurotoxic candidates early in the drug development pipeline.

🖼️ NTxPred2 Workflow

🧬 Model Highlights

Base Model: Facebook’s ESM2-t30 (150M parameters)
Fine-Tuning Task: Neurotoxicity prediction (binary classification)
Input: Short peptide sequences (7–50 amino acids)
Output: Binary label → 1 (neurotoxic), 0 (non-toxic)
Architecture: ESM2 encoder + linear classification head

🗂️ Files Included

config.json – Contains configuration settings for the model architecture, hyperparameters, and training details.
model.safetensors – This is the actual trained model weights saved in the SafeTensors format, which is safer and faster than the traditional .bin files.
special_tokens_map.json – Stores mappings for special tokens, like [CLS], [SEP], or any custom tokens used in your tokenizer.
tokenizer_config.json – Contains tokenizer-related settings (like vocabulary size, tokenization method).
vocab.txt – Lists all tokens and their corresponding IDs; it's essential for text tokenization.

🚀 How to Use

🔧 Install Dependencies

pip install torch esm biopython huggingface_hub


### Loading the Model from Hugging Face

```python
import torch
import torch.nn as nn
import esm
import json
from huggingface_hub import hf_hub_download

# Define the classifier model (ESM encoder + linear head)
class ProteinClassifier(nn.Module):
    def __init__(self, esm_model, embedding_dim, num_classes):
        super(ProteinClassifier, self).__init__()
        self.esm_model = esm_model
        self.fc = nn.Linear(embedding_dim, num_classes)

    def forward(self, tokens):
        layer_index = len(self.esm_model.layers)  # Get number of layers
        results = self.esm_model(tokens, repr_layers=[layer_index])
        embeddings = results["representations"][layer_index].mean(1)
        return self.fc(embeddings)

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load config from your repo
config_path = hf_hub_download(repo_id="anandr88/NTxPred2", filename="config.json")
with open(config_path, 'r') as f:
    config = json.load(f)

# Load ESM2 model - UPDATED METHOD
model_name = "esm2_t30_150M_UR50D"
esm_model, alphabet = esm.pretrained.load_model_and_alphabet(model_name)
batch_converter = alphabet.get_batch_converter()

# Initialize a NEW classifier (with random weights)
classifier = ProteinClassifier(
    esm_model, 
    embedding_dim=config['embedding_dim'], 
    num_classes=config['num_classes']
)
classifier.to(device)
classifier.eval()

print("✅ Model loaded successfully!")
print(f"Using device: {device}")
print(f"Model architecture: {classifier}")

🧪 Example Usage (Optional)

# Example Usage for Binary Classification
sequence = ("TEST_SEQUENCE", "ACDEFGHIKLMNPQRSTVWY")  # Your peptide sequence

# Convert to model input format
_, _, batch_tokens = batch_converter([sequence])
batch_tokens = batch_tokens.to(device)

# Predict
with torch.no_grad():
    logits = classifier(batch_tokens)
    probability = torch.sigmoid(logits).item()  # Sigmoid for binary classification

# Interpret results
threshold = 0.5  # Standard threshold (adjust if needed)
prediction = "Neurotoxic" if probability >= threshold else "Not-toxic"

print("\n" + "="*50)
print(f"🔬 Input Sequence: {sequence[1]}")
print(f"📊 Neurotoxicity Probability: {probability:.4f}")
print(f"🏷️ Prediction: {prediction} (threshold={threshold})")

📊 Applications

Neurotoxic peptide filtering in therapeutic design
Toxicity scanning of synthetic peptides
Dataset annotation for bioactivity studies
Educational use in bioinformatics and deep learning for proteins

🌐 Related Links

🔬 Project Web Server: NTxPred2 Web Tool
🧾 Documentation & Source: GitHub – raghavagps/NTxPred2

🧠 Citation

📖 Rathore et al.
A Large Language Model for Predicting Neurotoxic Peptides and Neurotoxins.
#Coming Soon#

👨‍🔬 Start using NTxPred2 today to enhance your peptide screening pipeline with the power of transformer-based intelligence!