File size: 9,937 Bytes
28d0aaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b625497
 
28d0aaf
 
d1fae38
 
03b7ed8
 
 
 
 
d1fae38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e7e15c
d1fae38
1e7e15c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1fae38
 
 
 
 
 
0a4bd6b
 
 
d1fae38
24d3454
d1fae38
0a4bd6b
d1fae38
0a4bd6b
d1fae38
65f0700
89a1c24
d1fae38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
---
license: apache-2.0
language:
- en
pipeline_tag: fill-mask
library_name: transformers
tags:
- ecommerce
- e-commerce
- retail
- marketplace
- shopping
- amazon
- ebay
- alibaba
- google
- rakuten
- bestbuy
- walmart
- flipkart
- wayfair
- shein
- target
- etsy
- shopify
- taobao
- asos
- carrefour
- costco
- overstock
- pretraining
- encoder
- language-modeling
- foundation-model
datasets:
- thebajajra/Ecom-niverse
---

# RexBERT-large

[![License: Apache2.0](https://img.shields.io/badge/License-Apache2.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-Models-red)](https://huggingface.co/collections/thebajajra/rexbert-68cc4b1b8a272f6beae5ebb8)
[![Data](https://img.shields.io/badge/🤗%20Training%20Data-Ecomniverse-yellow)](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
[![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/bajajra/RexBERT)

> **TL;DR**: An encoder-only transformer (ModernBERT-style) for **e-commerce** applications, trained in three phases—**Pre-training**, **Context Extension**, and **Decay**—to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 2.3T+ tokens along with 350B+ e-commerce-specific tokens

---

## Table of Contents
- [Quick Start](#quick-start)
- [Intended Uses & Limitations](#intended-uses--limitations)
- [Model Description](#model-description)
- [Training Recipe](#training-recipe)
- [Data Overview](#data-overview)
- [Evaluation](#evaluation)
- [Usage Examples](#usage-examples)
  - [Masked language modeling](#1-masked-language-modeling)
  - [Embeddings / feature extraction](#2-embeddings--feature-extraction)
  - [Text classification fine-tune](#3-text-classification-fine-tune)
- [Model Architecture & Compatibility](#model-architecture--compatibility)
- [Efficiency & Deployment Tips](#efficiency--deployment-tips)
- [Responsible & Safe Use](#responsible--safe-use)
- [License](#license)
- [Maintainers & Contact](#maintainers--contact)
- [Citation](#citation)

---

## Quick Start

```python
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline

MODEL_ID = "thebajajra/RexBERT-large"

# Tokenizer
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

# 1) Fill-Mask (if MLM head is present)
mlm = pipeline("fill-mask", model=MODEL_ID, tokenizer=tok)
print(mlm("These running shoes are great for [MASK] training."))

# 2) Feature extraction (CLS or mean-pooled embeddings)
enc = AutoModel.from_pretrained(MODEL_ID)
inputs = tok(["wireless mouse", "ergonomic mouse pad"], padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    out = enc(**inputs, output_hidden_states=True)
# Mean-pool last hidden state for sentence embeddings
emb = (out.last_hidden_state * inputs.attention_mask.unsqueeze(-1)).sum(dim=1) / inputs.attention_mask.sum(dim=1, keepdim=True)
```


---

## Intended Uses & Limitations

**Use cases**
- Product & query **retrieval/semantic search** (titles, descriptions, attributes)
- **Attribute extraction** / slot filling (brand, color, size, material)
- **Classification** (category assignment, unsafe/regulated item filtering, review sentiment)
- **Reranking** and **query understanding** (spelling/ASR normalization, acronym expansion)

**Out of scope**
- Long-form **generation** (use a decoder/seq-to-seq LM instead)
- High-stakes decisions without human review (pricing, compliance, safety flags)

**Target users**
- Search/recs engineers, e-commerce data teams, ML researchers working on domain-specific encoders

---

## Model Description

RexBERT-large is an **encoder-only**, 400M parameter transformer trained with a masked-language-modeling objective and optimized for **e-commerce related text**. The three-phase training curriculum improves general language understanding, extends context handling, and then **specializes** on a very large corpus of commerce data to capture domain-specific terminology and entity distributions.

---

## Training Recipe

RexBERT-large was trained in **three phases**:

1) **Pre-training**  
   General-purpose MLM pre-training on diverse English text for robust linguistic representations.

2) **Context Extension**  
   Continued training with **increased max sequence length** to better handle long product pages, concatenated attribute blocks, multi-turn queries, and facet strings. This preserves prior capabilities while expanding context handling.

3) **Decay on 350B+ e-commerce tokens**  
   Final specialization stage on **350B+ domain-specific tokens** (product catalogs, queries, reviews, taxonomy/attributes). Learning rate and sampling weights are annealed (decayed) to consolidate domain knowledge and stabilize performance on commerce tasks.

**Training details (fill in):**
- Optimizer / LR schedule: TODO
- Effective batch size / steps per phase: TODO
- Context lengths per phase (e.g., 512 → 1k/2k): TODO
- Tokenizer/vocab: TODO
- Hardware & wall-clock: TODO
- Checkpoint tags: TODO (e.g., `pretrain`, `ext`, `decay`)

---

## Data Overview

- **Dataset:** [Ecom-niverse](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
- **Domain mix:**  

We identified 9 E-commerce overlapping domains which have significant amount of relevant tokens but required filteration. Below is the domain list and their filtered size
| Domain | Size (GBs) |
|---|---|
| Hobby | 114 |
| News | 66 |
| Health | 66 |
| Entertainment | 64 |
| Travel | 52 |
| Food | 22 |
| Automotive | 19 |
| Sports | 12 |
| Music and Dance | 7 |

Additionally, there are 6 more domains which had almost complete overlap and were picked directly out of FineFineWeb.
| Domain | Size (GBs) |
|---|---|
| Fashion | 37 |
| Beauty | 37 |
| Celebrity | 28 |
| Movie | 26 |
| Photo | 15 |
| Painting | 2 |

By focusing on these domains, we narrow the search space to parts of the web data where shopping-related text is likely to appear. However, even within a chosen domain, not every item is actually about buying or selling, many may be informational articles, news, or unrelated discussions. Thus, a more fine-grained filtering within each domain is required to extract only the e-commerce-specific lines. We accomplish this by training lightweight classifiers per domain to distinguish e-commerce context vs. non-e-commerce content.



---
## Evaluation

### Token Classification

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/DuUWO7SyzxJsN53dOSV60.png)

> With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.

### Semantic Similarity

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/CPrf6J1ioUGzr6vohJ4xU.png)

> RexBERT models outperform all the models in their parameter/size category.

---

## Usage Examples

### 1) Masked language modeling
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexBERT-large")
t = AutoTokenizer.from_pretrained("thebajajra/RexBERT-large")
fill = pipeline("fill-mask", model=m, tokenizer=t)

fill("Best [MASK] headphones under $100.")
```

### 2) Embeddings / feature extraction
```python
import torch
from transformers import AutoTokenizer, AutoModel

tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-large")
enc = AutoModel.from_pretrained("thebajajra/RexBERT-large")

texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
batch = tok(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    out = enc(**batch)
# Mean-pool last hidden state
attn = batch["attention_mask"].unsqueeze(-1)
emb = (out.last_hidden_state * attn).sum(1) / attn.sum(1)
# Normalize for cosine similarity (recommended for retrieval)
emb = torch.nn.functional.normalize(emb, p=2, dim=1)
```

### 3) Text classification fine-tune
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-large")
model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexBERT-large", num_labels=NUM_LABELS)

# Prepare your Dataset objects: train_ds, val_ds (text→label)
args = TrainingArguments(
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=3e-5,
    num_train_epochs=3,
    evaluation_strategy="steps",
    fp16=True,
    report_to="none",
    load_best_model_at_end=True,
)

trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=tok)
trainer.train()
```

---

## Model Architecture & Compatibility

- **Architecture:** Encoder-only, ModernBERT-style **large** model.  
- **Libraries:** Works with **🤗 Transformers**; supports **fill-mask** and **feature-extraction** pipelines.  
- **Context length:** Increased during the **Context Extension** phase—ensure `max_position_embeddings` in `config.json` matches your desired max length.  
- **Files:** `config.json`, tokenizer files, and (optionally) heads for MLM or classification.  
- **Export:** Standard PyTorch weights; you can export ONNX / TorchScript for production if needed.

---

## Responsible & Safe Use

- **Biases:** Commerce data can encode brand, price, and region biases; audit downstream classifiers/retrievers for disparate error rates across categories/regions.
- **Sensitive content:** Add filters for adult/regulated items; document moderation thresholds if you release classifiers.
- **Privacy:** Do not expose PII; ensure training data complies with terms and applicable laws.
- **Misuse:** This model is **not** a substitute for legal/compliance review for listings.

---

## License

- **License:** `apache-2.0`.  
---

## Maintainers & Contact

- **Author/maintainer:** [Rahul Bajaj](https://huggingface.co/thebajajra) 

---