Update README.md
Browse files
README.md
CHANGED
|
@@ -1063,6 +1063,8 @@ model-index:
|
|
| 1063 |
|
| 1064 |
**新闻 | News**
|
| 1065 |
|
|
|
|
|
|
|
| 1066 |
**[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单,**不需要任何前缀文本**。
|
| 1067 |
Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
|
| 1068 |
and **do not need any prefix text**.\
|
|
@@ -1072,12 +1074,13 @@ stella是一个通用的文本编码模型,主要有以下模型:
|
|
| 1072 |
|
| 1073 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
| 1074 |
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
|
|
|
| 1075 |
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
| 1076 |
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
| 1077 |
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
| 1078 |
| stella-base-zh | 0.2 | 768 | 1024 | Chinese | Yes |
|
| 1079 |
|
| 1080 |
-
完整的训练思路和训练过程已记录在[博客](https://zhuanlan.zhihu.com/p/655322183),欢迎阅读讨论。
|
| 1081 |
|
| 1082 |
**训练数据:**
|
| 1083 |
|
|
@@ -1104,6 +1107,7 @@ stella is a general-purpose text encoder, which mainly includes the following mo
|
|
| 1104 |
|
| 1105 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
| 1106 |
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
|
|
|
| 1107 |
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
| 1108 |
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
| 1109 |
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
|
@@ -1142,9 +1146,15 @@ Based on stella models, stella-v2 use more training data and remove instruction
|
|
| 1142 |
| stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
|
| 1143 |
| stella-base-zh | 0.2 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
|
| 1144 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1145 |
#### Reproduce our results
|
| 1146 |
|
| 1147 |
-
|
| 1148 |
|
| 1149 |
```python
|
| 1150 |
import torch
|
|
@@ -1186,6 +1196,10 @@ if __name__ == '__main__':
|
|
| 1186 |
|
| 1187 |
```
|
| 1188 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1189 |
#### Evaluation for long text
|
| 1190 |
|
| 1191 |
经过实际观察发现,C-MTEB的评测数据长度基本都是小于512的,
|
|
@@ -1244,7 +1258,6 @@ stella中文系列模型均使用mean pooling做为文本向量。
|
|
| 1244 |
在sentence-transformer库中的使用方法:
|
| 1245 |
|
| 1246 |
```python
|
| 1247 |
-
# 对于短对短数据集,下面是通用的使用方式
|
| 1248 |
from sentence_transformers import SentenceTransformer
|
| 1249 |
|
| 1250 |
sentences = ["数据1", "数据2"]
|
|
@@ -1282,7 +1295,43 @@ print(vectors.shape) # 2,768
|
|
| 1282 |
|
| 1283 |
#### stella models for English
|
| 1284 |
|
| 1285 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1286 |
|
| 1287 |
## Training Detail
|
| 1288 |
|
|
@@ -1320,3 +1369,4 @@ developing...
|
|
| 1320 |
9. https://github.com/THUDM/LongBench
|
| 1321 |
|
| 1322 |
|
|
|
|
|
|
| 1063 |
|
| 1064 |
**新闻 | News**
|
| 1065 |
|
| 1066 |
+
**[2023-10-19]** 开源stella-base-en-v2 使用简单,**不需要任何前缀文本**。
|
| 1067 |
+
Release stella-base-en-v2. This model **does not need any prefix text**.\
|
| 1068 |
**[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单,**不需要任何前缀文本**。
|
| 1069 |
Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
|
| 1070 |
and **do not need any prefix text**.\
|
|
|
|
| 1074 |
|
| 1075 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
| 1076 |
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
| 1077 |
+
| stella-base-en-v2 | 0.2 | 768 | 512 | English | No |
|
| 1078 |
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
| 1079 |
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
| 1080 |
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
| 1081 |
| stella-base-zh | 0.2 | 768 | 1024 | Chinese | Yes |
|
| 1082 |
|
| 1083 |
+
完整的训练思路和训练过程已记录在[博客1](https://zhuanlan.zhihu.com/p/655322183)和[博客2](https://zhuanlan.zhihu.com/p/662209559),欢迎阅读讨论。
|
| 1084 |
|
| 1085 |
**训练数据:**
|
| 1086 |
|
|
|
|
| 1107 |
|
| 1108 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
| 1109 |
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
| 1110 |
+
| stella-base-en-v2 | 0.2 | 768 | 512 | English | No |
|
| 1111 |
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
| 1112 |
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
| 1113 |
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
|
|
|
| 1146 |
| stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
|
| 1147 |
| stella-base-zh | 0.2 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
|
| 1148 |
|
| 1149 |
+
#### MTEB leaderboard (English)
|
| 1150 |
+
|
| 1151 |
+
| Model Name | Model Size (GB) | Dimension | Sequence Length | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Summarization (1) |
|
| 1152 |
+
|:-----------------:|:---------------:|:---------:|:---------------:|:------------:|:-------------------:|:---------------:|:-----------------------:|:-------------:|:--------------:|:--------:|:------------------:|
|
| 1153 |
+
| stella-base-en-v2 | 0.2 | 768 | 512 | 62.61 | 75.28 | 44.9 | 86.45 | 58.77 | 50.1 | 83.02 | 32.52 |
|
| 1154 |
+
|
| 1155 |
#### Reproduce our results
|
| 1156 |
|
| 1157 |
+
**C-MTEB:**
|
| 1158 |
|
| 1159 |
```python
|
| 1160 |
import torch
|
|
|
|
| 1196 |
|
| 1197 |
```
|
| 1198 |
|
| 1199 |
+
**MTEB:**
|
| 1200 |
+
|
| 1201 |
+
You can use official script to reproduce our result. [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py)
|
| 1202 |
+
|
| 1203 |
#### Evaluation for long text
|
| 1204 |
|
| 1205 |
经过实际观察发现,C-MTEB的评测数据长度基本都是小于512的,
|
|
|
|
| 1258 |
在sentence-transformer库中的使用方法:
|
| 1259 |
|
| 1260 |
```python
|
|
|
|
| 1261 |
from sentence_transformers import SentenceTransformer
|
| 1262 |
|
| 1263 |
sentences = ["数据1", "数据2"]
|
|
|
|
| 1295 |
|
| 1296 |
#### stella models for English
|
| 1297 |
|
| 1298 |
+
**Using Sentence-Transformers:**
|
| 1299 |
+
|
| 1300 |
+
```python
|
| 1301 |
+
from sentence_transformers import SentenceTransformer
|
| 1302 |
+
|
| 1303 |
+
sentences = ["one car come", "one car go"]
|
| 1304 |
+
model = SentenceTransformer('infgrad/stella-base-en-v2')
|
| 1305 |
+
print(model.max_seq_length)
|
| 1306 |
+
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
|
| 1307 |
+
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
|
| 1308 |
+
similarity = embeddings_1 @ embeddings_2.T
|
| 1309 |
+
print(similarity)
|
| 1310 |
+
```
|
| 1311 |
+
|
| 1312 |
+
**Using HuggingFace Transformers:**
|
| 1313 |
+
|
| 1314 |
+
```python
|
| 1315 |
+
from transformers import AutoModel, AutoTokenizer
|
| 1316 |
+
from sklearn.preprocessing import normalize
|
| 1317 |
+
|
| 1318 |
+
model = AutoModel.from_pretrained('infgrad/stella-base-en-v2')
|
| 1319 |
+
tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-en-v2')
|
| 1320 |
+
sentences = ["one car come", "one car go"]
|
| 1321 |
+
batch_data = tokenizer(
|
| 1322 |
+
batch_text_or_text_pairs=sentences,
|
| 1323 |
+
padding="longest",
|
| 1324 |
+
return_tensors="pt",
|
| 1325 |
+
max_length=512,
|
| 1326 |
+
truncation=True,
|
| 1327 |
+
)
|
| 1328 |
+
attention_mask = batch_data["attention_mask"]
|
| 1329 |
+
model_output = model(**batch_data)
|
| 1330 |
+
last_hidden = model_output.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
|
| 1331 |
+
vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
|
| 1332 |
+
vectors = normalize(vectors, norm="l2", axis=1, )
|
| 1333 |
+
print(vectors.shape) # 2,768
|
| 1334 |
+
```
|
| 1335 |
|
| 1336 |
## Training Detail
|
| 1337 |
|
|
|
|
| 1369 |
9. https://github.com/THUDM/LongBench
|
| 1370 |
|
| 1371 |
|
| 1372 |
+
|