--- license: apache-2.0 base_model: uitnlp/visobert tags: - vietnamese - spam-detection - text-classification - e-commerce datasets: - ViSpamReviews metrics: - accuracy - macro-f1 - macro-precision - macro-recall model-index: - name: visobert-spam-binary results: - task: type: text-classification name: Spam Review Detection dataset: name: ViSpamReviews type: ViSpamReviews metrics: - type: accuracy value: 0.9144 - type: macro-f1 value: 0.8916 --- # visobert-spam-binary: Spam Review Detection for Vietnamese Text This model is a fine-tuned version of [uitnlp/visobert](https://huggingface.co/uitnlp/visobert) on the **ViSpamReviews** dataset for spam review detection in Vietnamese e-commerce reviews. ## Model Details * **Base Model**: `uitnlp/visobert` * **Description**: ViSoBERT - Vietnamese Social BERT * **Dataset**: ViSpamReviews (Vietnamese Spam Review Dataset) * **Fine-tuning Framework**: HuggingFace Transformers * **Task**: Spam Review Detection (binary) * **Number of Classes**: 2 ### Hyperparameters * Max sequence length: `256` * Learning rate: `5e-5` * Batch size: `32` * Epochs: `100` * Early stopping patience: `5` ## Dataset The model was trained on the **ViSpamReviews** dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes: * **Train set**: 14,299 samples (72%) * **Validation set**: 1,590 samples (8%) * **Test set**: 3,971 samples (20%) ### Label Distribution * **Non-spam** (0): Genuine product reviews * **Spam** (1): Fake or promotional reviews ## Results The model was evaluated on the test set with the following metrics: * **Accuracy**: `0.9144` * **Macro-F1**: `0.8916` ## Usage You can use this model for spam review detection in Vietnamese text. Below is an example: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model_name = "visolex/visobert-spam-binary" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example review text text = "Sản phẩm này rất tốt, shop giao hàng nhanh!" # Tokenize inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) # Predict with torch.no_grad(): outputs = model(**inputs) predicted_class = outputs.logits.argmax(dim=-1).item() probabilities = torch.softmax(outputs.logits, dim=-1) # Map to label label_map = {0: "Non-spam", 1: "Spam"} predicted_label = label_map[predicted_class] confidence = probabilities[0][predicted_class].item() print(f"Text: {text}") print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})") ``` ## Citation If you use this model, please cite: ```bibtex @misc{{ {model_key}_spam_detection, title={{{description}}}, author={{ViSoLex Team}}, year={{2025}}, howpublished={{\url{{https://huggingface.co/{visolex/visobert-spam-binary}}}}} }} ``` ## License This model is released under the Apache-2.0 license. ## Acknowledgments * Base model: [{base_model}](https://huggingface.co/{base_model}) * Dataset: ViSpamReviews (Vietnamese Spam Review Dataset) * ViSoLex Toolkit for Vietnamese NLP