English
nielsr HF Staff commited on
Commit
8685fbf
·
verified ·
1 Parent(s): cd05548

Improve model card for mmBERT training checkpoints with metadata and usage

Browse files

This PR significantly enhances the model card for the mmBERT raw training checkpoints by:

* Updating the `language` tag to `mul` to accurately reflect its multilingual nature (over 1800 languages).
* Adding `pipeline_tag: feature-extraction` to improve discoverability on the Hub and enable the automated widget. While this repository contains raw checkpoints, the associated models are commonly used for feature extraction.
* Specifying `library_name: transformers` as the linked GitHub repository provides extensive usage examples with the `transformers` library.
* Adding descriptive `tags` like `bert`, `multilingual`, and `encoder-only` for better searchability.
* Updating the main title to reflect the paper's title and linking to the Hugging Face paper page.
* Incorporating the paper's abstract.
* Including comprehensive usage examples from the GitHub README's "Quick Start" and "Getting Started" sections, showcasing how to use the associated `mmBERT` models (e.g., `mmbert-small`, `mmbert-base`) for various tasks like feature extraction, masked language modeling, classification, and retrieval.
* Clarifying the purpose of this repository as containing raw training checkpoints and directing users to the model collection for runnable models.
* Integrating other valuable sections from the GitHub README such as "Model Family", "Training Details", "Evaluation", "FAQ", and "Limitations" to provide a complete overview of the project.

These changes make the model card more informative, discoverable, and user-friendly on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +323 -14
README.md CHANGED
@@ -1,32 +1,341 @@
1
  ---
2
- license: mit
3
  language:
4
- - en
 
 
 
 
 
 
 
5
  ---
6
 
7
- # Ettin Checkpoints
8
 
9
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
10
- [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888)
11
- [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-12%20Models-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
12
  [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT)
13
 
14
- This repository contains the raw training checkpoints for the mmBERT models. Each model contains three subfolders for `decay`, `ext`, and `pretrain`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- These files work with Composer and contain all state needed to resume pre-training. Please see the [ModernBERT repository](https://github.com/AnswerDotAI/ModernBERT) for usage details.
17
 
 
 
 
18
 
19
- ## 🔗 Related Resources
 
 
20
 
21
- - **Models**: [mmBERT Model Suite](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
22
- - **Phase 1**: [Pre-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) (2.3T tokens)
23
- - **Phase 2**: [Mid-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining) (600B tokens)
24
- - **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/mmbert-decay) (100B tokens)
25
- - **Paper**: [Arxiv link](https://arxiv.org/abs/2509.06888)
26
- - **Code**: [GitHub Repository](https://github.com/jhu-clsp/mmBERT)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## Citation
29
 
 
 
30
  ```bibtex
31
  @misc{marone2025mmbertmodernmultilingualencoder,
32
  title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
 
1
  ---
 
2
  language:
3
+ - mul
4
+ license: mit
5
+ pipeline_tag: feature-extraction
6
+ library_name: transformers
7
+ tags:
8
+ - bert
9
+ - multilingual
10
+ - encoder-only
11
  ---
12
 
13
+ # mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
14
 
15
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
16
+ [![Paper](https://img.shields.io/badge/Paper-%F0%9F%A4%97Hugging_Face-red)](https://huggingface.co/papers/2509.06888)
17
+ [![Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-12%20Models-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
18
  [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT)
19
 
20
+ > 🌍 **TL;DR**: State-of-the-art multilingual encoder models trained on 3T tokens across 1833 languages with novel annealed language learning. Outperforms XLM-R and can even beat OpenAI's o3 and Google's Gemini 2.5 Pro.
21
+
22
+ This repository contains the raw training checkpoints for the mmBERT models, presented in the paper [mmBERT: A Modern Multilingual Encoder with Annealed Language Learning](https://huggingface.co/papers/2509.06888). Each model contains three subfolders for `decay`, `ext`, and `pretrain`.
23
+
24
+ These raw checkpoints are intended for resuming pre-training using Composer and contain all state needed for this purpose. For direct inference with the runnable `mmbert-small` and `mmbert-base` models, please refer to the [mmBERT Model Collection](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4).
25
+
26
+ ## Abstract
27
+ Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.
28
+
29
+ ## 🚀 Quick Start
30
+
31
+ ### Installation
32
+ ```bash
33
+ pip install torch>=1.9.0
34
+ pip install transformers>=4.48.0
35
+ ```
36
+
37
+ ### 30-Second Examples
38
+ The following examples demonstrate how to use the pre-trained `mmbert-small` and `mmbert-base` models (available in the [mmBERT Model Collection](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)) for various tasks.
39
+
40
+ **Small Model for Fast Inference:**
41
+ ```python
42
+ from transformers import AutoTokenizer, AutoModel
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-small")
45
+ model = AutoModel.from_pretrained("jhu-clsp/mmbert-small")
46
+
47
+ # Example: Get multilingual embeddings
48
+ inputs = tokenizer("Hello world! 你好世界! Bonjour le monde!", return_tensors="pt")
49
+ outputs = model(**inputs)
50
+ embeddings = outputs.last_hidden_state.mean(dim=1)
51
+ ```
52
+
53
+ **Base Model for Masked Language Modeling:**
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
56
+ import torch
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
59
+ model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmbert-base")
60
+
61
+ # Example: Multilingual masked language modeling
62
+ text = "The capital of [MASK] is Paris."
63
+ inputs = tokenizer(text, return_tensors="pt")
64
+ with torch.no_grad():
65
+ outputs = model(**inputs)
66
+
67
+ # Get predictions for [MASK] tokens
68
+ mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
69
+ predictions = outputs.logits[mask_indices]
70
+ top_tokens = torch.topk(predictions, 5, dim=-1)
71
+ predicted_words = [tokenizer.decode(token) for token in top_tokens.indices[0]]
72
+ print(f"Predictions: {predicted_words}")
73
+ ```
74
+
75
+ ## 🌍 Model Family
76
+
77
+ ### Main Models
78
+
79
+ | Size | Model | Parameters | Languages | Context | Best For | Download |
80
+ |:-----|:------|:-----------|:----------|:--------|:---------|:---------|
81
+ | Small | [mmbert-small](https://huggingface.co/jhu-clsp/mmbert-small) | 140M | 1833 | 8192 | Fast inference, edge deployment | [![Download](https://img.shields.io/badge/%F0%9F%A4%97-Download-blue)](https://huggingface.co/jhu-clsp/mmbert-small) |
82
+ | Base | [mmbert-base](https://huggingface.co/jhu-clsp/mmbert-base) | 307M | 1833 | 8192 | Best performance, production use | [![Download](https://img.shields.io/badge/%F0%9F%A4%97-Download-blue)](https://huggingface.co/jhu-clsp/mmbert-base) |
83
+
84
+ ### Key Features
85
+
86
+ - **1833 Languages**: Covers more languages than any previous multilingual encoder
87
+ - **Extended Context**: Up to 8192 tokens (vs 512 for XLM-R)
88
+ - **Efficiency**: 2-4x faster inference than previous multilingual models
89
+ - **Modern Architecture**: Based on ModernBERT with RoPE, GLU activations, and Flash Attention 2
90
+ - **Open Training**: Complete training data, recipes, and checkpoints available
91
+
92
+ ## 🔬 Getting Started
93
+
94
+ ### Training Data
95
+
96
+ The complete multilingual training dataset spans 3T tokens:
97
+
98
+ - **Pre-training Data**: 2.0T tokens across 60 languages
99
+ - **Mid-training Data**: 600B tokens across 110 languages
100
+ - **Decay Phase Data**: 100B tokens across 1833 languages
101
+ - **Data Sources**: FineWeb2, DCLM, Dolmino, Wikipedia, ArXiv, and curated multilingual corpora
102
+
103
+ ### Usage Examples
104
+
105
+ <details>
106
+ <summary><strong>Classification Task</strong></summary>
107
+
108
+ ```python
109
+ from transformers import AutoTokenizer, AutoModel
110
+ import torch.nn as nn
111
+
112
+ # Load model for classification
113
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
114
+ encoder = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
115
+
116
+ # Add classification head
117
+ class MultilingualClassifier(nn.Module):
118
+ def __init__(self, encoder, num_classes):
119
+ super().__init__()
120
+ self.encoder = encoder
121
+ self.classifier = nn.Linear(encoder.config.hidden_size, num_classes)
122
+ self.dropout = nn.Dropout(0.1)
123
+
124
+ def forward(self, input_ids, attention_mask=None):
125
+ outputs = self.encoder(input_ids, attention_mask=attention_mask)
126
+ pooled_output = outputs.last_hidden_state[:, 0] # Use [CLS] token
127
+ pooled_output = self.dropout(pooled_output)
128
+ return self.classifier(pooled_output)
129
+
130
+ # Initialize classifier
131
+ model = MultilingualClassifier(encoder, num_classes=3)
132
+
133
+ # Example multilingual inputs
134
+ texts = [
135
+ "This is a positive review.",
136
+ "Ceci est un avis négatif.",
137
+ "这是一个中性评价。"
138
+ ]
139
+ inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
140
+ predictions = model(**inputs)
141
+ ```
142
+
143
+ </details>
144
+
145
+ <details>
146
+ <summary><strong>Multilingual Retrieval</strong></summary>
147
+
148
+ ```python
149
+ from transformers import AutoTokenizer, AutoModel
150
+ import torch
151
+ import numpy as np
152
+
153
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
154
+ model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
155
+
156
+ def get_embeddings(texts):
157
+ inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
158
+ with torch.no_grad():
159
+ outputs = model(**inputs)
160
+ # Mean pooling
161
+ embeddings = outputs.last_hidden_state.mean(dim=1)
162
+ return embeddings.numpy()
163
+
164
+ # Multilingual document retrieval
165
+ documents = [
166
+ "Artificial intelligence is transforming healthcare.",
167
+ "L'intelligence artificielle transforme les soins de santé.",
168
+ "人工智能正在改变医疗保健。",
169
+ "Climate change requires immediate action.",
170
+ "El cambio climático requiere acción inmediata."
171
+ ]
172
 
173
+ query = "AI in medicine"
174
 
175
+ # Get embeddings
176
+ doc_embeddings = get_embeddings(documents)
177
+ query_embedding = get_embeddings([query])
178
 
179
+ # Compute similarities
180
+ similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
181
+ ranked_docs = np.argsort(similarities)[::-1]
182
 
183
+ print("Most similar documents:")
184
+ for i, doc_idx in enumerate(ranked_docs[:3]):
185
+ print(f"{i+1}. {documents[doc_idx]} (score: {similarities[doc_idx]:.3f})")
186
+ ```
187
+
188
+ </details>
189
+
190
+ <details>
191
+ <summary><strong>Long Context Processing</strong></summary>
192
+
193
+ ```python
194
+ from transformers import AutoTokenizer, AutoModel
195
+
196
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
197
+ model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
198
+
199
+ # Process long multilingual document (up to 8192 tokens)
200
+ long_text = """
201
+ This is a very long multilingual document...
202
+ Ceci est un très long document multilingue...
203
+ 这是一个非常长的多语言文档...
204
+ """ * 100 # Simulate long text
205
+
206
+ # Tokenize with extended context
207
+ inputs = tokenizer(
208
+ long_text,
209
+ return_tensors="pt",
210
+ max_length=8192,
211
+ truncation=True
212
+ )
213
+
214
+ # Process efficiently with Flash Attention
215
+ with torch.no_grad():
216
+ outputs = model(**inputs)
217
+
218
+ print(f"Processed {inputs['input_ids'].shape[1]} tokens")
219
+ print(f"Output shape: {outputs.last_hidden_state.shape}")
220
+ ```
221
+
222
+ </details>
223
+
224
+ ## 📋 Training
225
+
226
+ Using 8xH100s, training took approximately 10 days for mmBERT-small and 40 days for mmBERT-base.
227
+
228
+ ### Training Recipe: Cascading Annealed Language Learning
229
+
230
+ mmBERT introduces novel training techniques:
231
+
232
+ 1. **Inverse Masking Schedule**: Start with 30% masking, gradually reduce to 5%
233
+ 2. **Language Progression**: 60 → 110 → 1833 languages across training phases
234
+ 3. **Temperature Annealing**: 0.7 → 0.5 → 0.3 for increasingly uniform language sampling
235
+ 4. **High-Quality Data**: Progressive upgrade from web crawl to filtered premium sources
236
+
237
+ ## Training Details
238
+
239
+ ### Architecture
240
+
241
+ | Component | Small | Base |
242
+ |:----------|:------|:-----|
243
+ | Layers | 22 | 22 |
244
+ | Hidden Size | 384 | 768 |
245
+ | Intermediate Size | 1152 | 1152 |
246
+ | Attention Heads | 6 | 12 |
247
+ | Parameters (Total) | 140M | 307M |
248
+ | Parameters (Non-Embed) | 42M | 110M |
249
+ | Max Sequence Length | 8192 | 8192 |
250
+ | Vocabulary Size | 256,000 | 256,000 |
251
+
252
+ ### Training Configuration
253
+
254
+ **Data Mixture:**
255
+ - Pre-training (2.0T tokens): Web crawl, code, scientific papers, reference materials
256
+ - Mid-training (600B tokens): Higher quality filtered data with context extension
257
+ - Decay phase (100B tokens): Premium sources including textbooks and curated content
258
+
259
+ **Architecture Features:**
260
+ - ModernBERT-based transformer with RoPE positional embeddings
261
+ - GLU activations and prenorm layer normalization
262
+ - Flash Attention 2 for efficient long-context processing
263
+ - Gemma 2 tokenizer for multilingual coverage
264
+
265
+ **Training Phases:**
266
+ 1. **Base Pre-training**: 60 languages, 30% masking, learning rate warmup
267
+ 2. **Context Extension**: 110 languages, 15% masking, extended context to 8K
268
+ 3. **Decay Phase**: 1833 languages, 5% masking, high-quality data focus
269
+
270
+ ## Evaluation
271
+ Evaluation code for retrieval tasks is the same as [Ettin](https://github.com/JHU-CLSP/ettin-encoder-vs-decoder/tree/main/retrieval_eval).
272
+
273
+ Evaluation code for efficiency is taken from the [ModernBERT](https://github.com/AnswerDotAI/ModernBERT/tree/main/efficiency) repo.
274
+
275
+ Evaluation code for NLU tasks is based on the [mGTE codebase](https://github.com/izhx/nlu-evals) and our fork will be uploaded soon. Please raise an issue or message us if this would be helpful for you.
276
+
277
+ ## ❓ FAQ
278
+
279
+ **Q: How does mmBERT compare to XLM-R?**
280
+ **A:** mmBERT significantly outperforms XLM-R across all benchmarks:
281
+ - +2.4 points average on XTREME
282
+ - +3.0 points on GLUE
283
+ - 16x more languages (1833 vs 100)
284
+ - 16x longer context (8K vs 512 tokens)
285
+ - 2-4x faster inference
286
+
287
+ **Q: Which languages does mmBERT support?**
288
+ **A:** mmBERT supports 1833 languages and scripts from FineWeb2, including:
289
+ - All major world languages (English, Chinese, Spanish, etc.)
290
+ - European languages (including low-resource ones like Faroese)
291
+ - African languages (Swahili, Amharic, etc.)
292
+ - Asian languages (Hindi, Bengali, Thai, etc.)
293
+ - Many low-resource and indigenous languages
294
+
295
+ **Q: How does the annealed language learning work?**
296
+ **A:** We progressively add languages in three phases:
297
+ 1. Start with 60 high-resource languages (pre-training)
298
+ 2. Add 50 mid-resource languages (mid-training)
299
+ 3. Add 1723 low-resource languages (decay phase)
300
+
301
+ This allows efficient learning without overfitting on low-resource data.
302
+
303
+ **Q: Can I fine-tune mmBERT for my specific task?**
304
+ **A:** Yes! mmBERT works as a drop-in replacement for XLM-R:
305
+ ```python
306
+ from transformers import AutoModel, AutoTokenizer
307
+
308
+ # Load for fine-tuning
309
+ model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
310
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
311
+
312
+ # Add task-specific head and fine-tune normally
313
+ ```
314
+
315
+ **Q: What about efficiency and memory requirements?**
316
+ **A:** mmBERT is significantly more efficient:
317
+ - 2-4x faster inference than XLM-R
318
+ - Flash Attention 2 reduces memory usage for long sequences
319
+ - Support for variable-length batching
320
+ - Optimized for both CPU and GPU deployment
321
+
322
+ **Q: How do I access the training data and checkpoints?**
323
+ **A:** All data and checkpoints are publicly available:
324
+ - Training data: [jhu-clsp/mmbert-pretraining-data](https://huggingface.co/datasets/jhu-clsp/mmbert-pretraining-data)
325
+ - Checkpoints: [jhu-clsp/mmbert-checkpoints](https://huggingface.co/models/jhu-clsp/mmbert-checkpoints)
326
+ - Github code: [GitHub repository](https://github.com/jhu-clsp/mmBERT)
327
+ - Data processing code: [Same as Ettin models](https://github.com/jhu-clsp/ettin-encoder-vs-decoder)
328
+
329
+ ## Limitations
330
+
331
+ - Structured prediction tasks (NER, POS) show slightly lower scores due to tokenizer prefix space handling
332
+ - Very low-resource languages still have limited training data
333
+ - High-quality educational content filtering could benefit from more languages
334
 
335
  ## Citation
336
 
337
+ If you use mmBERT models in your research, please cite our work:
338
+
339
  ```bibtex
340
  @misc{marone2025mmbertmodernmultilingualencoder,
341
  title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},