AssameseRoBERTa

Model Description

AssameseRoBERTa is a RoBERTa-based language model trained from scratch on Assamese monolingual text. The model is designed to provide robust language understanding capabilities for the Assamese language, which is spoken by over 15 million people primarily in the Indian state of Assam.

This model was developed by MWire Labs, an AI research organization focused on building language technologies for Northeast Indian languages.

Model Details

Model Type: RoBERTa (Robustly Optimized BERT Pretraining Approach)
Language: Assamese (as)
Training Data: 1.6M Assamese sentences from diverse sources
Parameters: ~110M
Training Epochs: 10
Training Duration: ~12 hours on A40 GPU
Vocabulary Size: 50,265 tokens
Max Sequence Length: 128 tokens

Performance

Perplexity Scores (Final Evaluation)

Model	Training Domain PPL	Unseen Text PPL
AssameseRoBERTa (Ours)	1.7819	2.5332
Assamese-BERT	48.8211	12.5911
MuRIL	85.7272	8.7032
mBERT	26.7085	18.1564
IndicBERT	3194.1843	595.4611
AxomiyaBERTa	83615627.1696	30861455.2924

📄 Unseen evaluation set (10 Assamese sentences):
https://huggingface.co/MWirelabs/assamese-roberta/blob/main/assamese_unseen_eval_10.txt

The model significantly outperforms existing multilingual and Assamese models on both seen and unseen Assamese text.

Intended Use

Direct Use

Masked language modeling
Feature extraction
Downstream Assamese NLP tasks such as:
- Text classification
- NER
- Sentiment analysis
- Question answering
- Token classification

Out-of-Scope Use

Generating factual information without verification
High-risk decision making
Real-time critical systems

Training Data

The model was trained on the MWirelabs/assamese-monolingual-corpus dataset (~1.6M sentences), sourced from:

News
Web crawl
Literature
Government text
Social media

Training Procedure

Preprocessing

Assamese script normalization
Byte-Level BPE tokenization
Custom Assamese vocabulary

Tokenizer

Type: Byte-Level BPE
Vocab Size: 50,265
Special Tokens: <s>, </s>, <pad>, <unk>, <mask>

Training Hyperparameters

Architecture: RoBERTa-base
Optimizer: AdamW
Scheduler: Warmup + Linear decay
Precision: BF16
Device: NVIDIA A40 (48GB)
Epochs: 10

Usage

Masked LM Example

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
model = AutoModelForMaskedLM.from_pretrained("MWirelabs/assamese-roberta")

text = "অসম হৈছে [MASK] এখন সুন্দৰ ৰাজ্য।"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = outputs.logits[0, masked_index].argmax(-1)
predicted_token = tokenizer.decode(predicted_token_id)

print("Predicted:", predicted_token)

Feature Extraction

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
model = AutoModel.from_pretrained("MWirelabs/assamese-roberta")

text = "অসমীয়া ভাষা অতি সুন্দৰ।"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

print(f"Embeddings shape: {embeddings.shape}")

Limitations

The model is trained exclusively on Assamese text and does not perform well on other languages
Performance may vary on specialized domains not well-represented in the training data
The model inherits biases present in the training data
Code-mixed text (Assamese-English) may not be handled optimally

Ethical Considerations

This model may reflect biases present in the training corpus
Users should evaluate the model's outputs in their specific context before deployment
The model should not be used for generating harmful or misleading content
Consider fairness implications when deploying in real-world applications

Citation

If you use this model in your research, please cite:

@misc{assamese-roberta-2025,
  author = {MWire Labs},
  title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
}

Contact

For questions or feedback, please contact:

Website: https://mwirelabs.com
Email: [email protected]

License

This model is released under the Creative Commons Attribution 4.0 International License (CC-BY-4.0).

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

See the full license at: https://creativecommons.org/licenses/by/4.0/