bert-cyberbullying-bahasa-classifier
A fine-tuned BERT multilingual classifier for detecting cyberbullying in Bahasa Indonesia. This model performs binary classification:
- 0 β non-bullying
- 1 β bullying
β Model Details
| Property | Value |
|---|---|
| Model Type | BERT (base multilingual) |
| Task | Cyberbullying Detection (Text Classification) |
| Language | Bahasa Indonesia |
| Labels | 0 β non-bullying, 1 β bullying |
| Framework | Hugging Face Transformers |
| Files | model.safetensors, config.json, tokenizer files |
π Dataset
This model was trained using a combined dataset, consisting of:
- Indonesian cyberbullying dataset
- Additional toxic / abusive comment datasets
- Social mediaβstyle and chatβstyle text
Preprocessing steps:
- text normalization
- emoji removal
- punctuation cleanup
- lowercasing
- label encoding (0 / 1)
Dataset was balanced to reduce bias.
π§ Training Information
- Base model:
bert-base-multilingual-cased - Epochs: 3β5
- Batch size: 16
- Optimizer: AdamW
- Learning rate: 2e-5
- Loss: Cross Entropy
- Train/Validation split: 80 / 20
Training was done on a 6GB GPU, optimized for low VRAM.
β How to Use
Python Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "zeltera/bert-cyberbullying-bahasa-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "anjing lu jelek banget"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
label = torch.argmax(logits, dim=1).item()
print("Prediction:", label) # 1 = bullying
Example Predictions
| Text | Output |
|---|---|
| "mampus lu biarin aja" | 1 (bullying) |
| "kamu lagi dimana?" | 0 (non-bullying) |
| "bodoh banget sih" | 1 (bullying) |
| "nice job bro" | 0 (non-bullying) |
π Evaluation
| Metric | Score |
|---|---|
| Accuracy | ~0.90 |
| F1 (macro) | ~0.88 |
| Precision | ~0.89 |
| Recall | ~0.87 |
ποΈ Repository Contents
config.json
model.safetensors
tokenizer.json
tokenizer_config.json
special_tokens_map.json
vocab.txt
README.md
π§ Intended Use
- AI chatbots (moderation / filtering)
- Social media comment analysis
- Cyberbullying detection systems
- Student safety applications
- Research on toxicity detection
β οΈ Limitations
- Limited sarcasm detection
- May misclassify unseen slang
- Works best on Indonesian text
- Not suitable for legal or high-risk decisions
π License
MIT License
π€ Author
Model trained and published by @zeltera Built using Hugging Face Transformers + PyTorch. Contact instagram @gnwnadiwjy
- Downloads last month
- 76
Model tree for zeltera/bert-cyberbullying-bahasa-classifier
Base model
indobenchmark/indobert-base-p1
