Model Size Training PPL Unseen PPL License

AssameseRoBERTa

Model Description

AssameseRoBERTa is a RoBERTa-based language model trained from scratch on Assamese monolingual text. The model is designed to provide robust language understanding capabilities for the Assamese language, which is spoken by over 15 million people primarily in the Indian state of Assam.

This model was developed by MWire Labs, an AI research organization focused on building language technologies for Northeast Indian languages.

Model Details

  • Model Type: RoBERTa (Robustly Optimized BERT Pretraining Approach)
  • Language: Assamese (as)
  • Training Data: 1.6M Assamese sentences from diverse sources
  • Parameters: ~110M
  • Training Epochs: 10
  • Training Duration: ~12 hours on A40 GPU
  • Vocabulary Size: 50,265 tokens
  • Max Sequence Length: 128 tokens

Performance

Perplexity Scores (Final Evaluation)

Model Training Domain PPL Unseen Text PPL
AssameseRoBERTa (Ours) 1.7819 2.5332
Assamese-BERT 48.8211 12.5911
MuRIL 85.7272 8.7032
mBERT 26.7085 18.1564
IndicBERT 3194.1843 595.4611
AxomiyaBERTa 83615627.1696 30861455.2924

📄 Unseen evaluation set (10 Assamese sentences):
https://huggingface.co/MWirelabs/assamese-roberta/blob/main/assamese_unseen_eval_10.txt

The model significantly outperforms existing multilingual and Assamese models on both seen and unseen Assamese text.

Intended Use

Direct Use

  • Masked language modeling
  • Feature extraction
  • Downstream Assamese NLP tasks such as:
    • Text classification
    • NER
    • Sentiment analysis
    • Question answering
    • Token classification

Out-of-Scope Use

  • Generating factual information without verification
  • High-risk decision making
  • Real-time critical systems

Training Data

The model was trained on the MWirelabs/assamese-monolingual-corpus dataset (~1.6M sentences), sourced from:

  • News
  • Web crawl
  • Literature
  • Government text
  • Social media

Training Procedure

Preprocessing

  • Assamese script normalization
  • Byte-Level BPE tokenization
  • Custom Assamese vocabulary

Tokenizer

  • Type: Byte-Level BPE
  • Vocab Size: 50,265
  • Special Tokens: <s>, </s>, <pad>, <unk>, <mask>

Training Hyperparameters

  • Architecture: RoBERTa-base
  • Optimizer: AdamW
  • Scheduler: Warmup + Linear decay
  • Precision: BF16
  • Device: NVIDIA A40 (48GB)
  • Epochs: 10

Usage

Masked LM Example

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
model = AutoModelForMaskedLM.from_pretrained("MWirelabs/assamese-roberta")

text = "অসম হৈছে [MASK] এখন সুন্দৰ ৰাজ্য।"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = outputs.logits[0, masked_index].argmax(-1)
predicted_token = tokenizer.decode(predicted_token_id)

print("Predicted:", predicted_token)

Feature Extraction

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
model = AutoModel.from_pretrained("MWirelabs/assamese-roberta")

text = "অসমীয়া ভাষা অতি সুন্দৰ।"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

print(f"Embeddings shape: {embeddings.shape}")

Limitations

  • The model is trained exclusively on Assamese text and does not perform well on other languages
  • Performance may vary on specialized domains not well-represented in the training data
  • The model inherits biases present in the training data
  • Code-mixed text (Assamese-English) may not be handled optimally

Ethical Considerations

  • This model may reflect biases present in the training corpus
  • Users should evaluate the model's outputs in their specific context before deployment
  • The model should not be used for generating harmful or misleading content
  • Consider fairness implications when deploying in real-world applications

Citation

If you use this model in your research, please cite:

@misc{assamese-roberta-2025,
  author = {MWire Labs},
  title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
}

Contact

For questions or feedback, please contact:

License

This model is released under the Creative Commons Attribution 4.0 International License (CC-BY-4.0).

You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

See the full license at: https://creativecommons.org/licenses/by/4.0/

Downloads last month
48
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train MWirelabs/assamese-roberta

Collection including MWirelabs/assamese-roberta

Evaluation results