yjoonjang commited on
Commit
eec4cbd
·
verified ·
1 Parent(s): 2985c04

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -218
README.md CHANGED
@@ -13,9 +13,9 @@ language:
13
  - ko
14
  ---
15
 
16
- # ColBERT-ko-v1.0
17
 
18
- **ColBERT-ko-v1.0** is a Korean ColBERT model finetuned with [PyLate](https://github.com/lightonai/pylate). This model is trained exclusively on Korean dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
19
 
20
  ## Model Details
21
 
@@ -50,19 +50,20 @@ ColBERT(
50
  | [MultiLongDocRetrieval](https://huggingface.co/datasets/Shitao/MLDR) | Korean long document retrieval dataset covering various domains | 13,813.44 |
51
  <!-- | [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy) | Korean document retrieval dataset based on Wikipedia | 166.90 | -->
52
  <!-- | [MIRACLRetrieval](https://huggingface.co/datasets/miracl/miracl) | Korean document retrieval dataset based on Wikipedia | 166.63 | -->
 
53
 
54
 
55
  ### Average Results
56
 
57
  | Model | Parameters | Average Recall@10 | Average Precision@10 | Average NDCG@10 | Average F1@10 |
58
  |-----------------------------------------------|------------|----------------|-------------------|--------------|------------|
59
- | **ColBERT-ko-v1.0** | **0.1B** | **0.7999** | **0.0930** | **0.7172** | **0.1655**|
60
  | [jina-colbert-v2](https://huggingface.co/jinaai/jina-colbert-v2) | 0.5B | 0.7518 | 0.0888 | 0.6671 | 0.1577 |
61
 
62
  ## Usage
63
  ### PyLate for reranking
64
 
65
- If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
66
 
67
  ```python
68
  from pylate import rank, models
@@ -83,7 +84,7 @@ documents_ids = [
83
  ]
84
 
85
  model = models.ColBERT(
86
- model_name_or_path="pylate_model_id",
87
  )
88
 
89
  queries_embeddings = model.encode(
@@ -107,223 +108,12 @@ reranked_documents = rank.rerank(
107
  First install the [muvera-py](https://github.com/sionic-ai/muvera-py) (Python implementation of MUVERA):
108
 
109
  ```bash
110
- git clone https://github.com/sionic-ai/muvera-py.git
111
  cd muvera-py
112
  ```
113
 
114
  Then run the main file:
115
 
116
- ```python
117
- # main_pylate.py
118
- import time
119
- from dataclasses import replace
120
- import numpy as np
121
- import torch
122
- from datasets import load_dataset
123
- import logging
124
- from pylate.models import ColBERT as PylateColBERT
125
-
126
- from fde_generator import (
127
- FixedDimensionalEncodingConfig,
128
- generate_query_fde,
129
- generate_document_fde_batch,
130
- )
131
-
132
- DATASET_REPO_ID = "yjoonjang/markers_bm"
133
- COLBERT_MODEL_NAME = "yjoonjang/ColBERT-ko-v1.0" # Supports pylate models
134
- TOP_K = 10
135
- DEVICE = "cuda" if torch.cuda.is_available() else "mps"
136
-
137
- logging.basicConfig(
138
- level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
139
- )
140
- logging.info(f"Using device: {DEVICE}")
141
-
142
-
143
- # --- Helper Functions ---
144
- def load_autorag_dataset(repo_id: str) -> (dict, dict, dict):
145
- logging.info(f"Loading dataset from Hugging Face Hub: '{repo_id}'...")
146
- corpus_ds = load_dataset(repo_id, "corpus", split="corpus")
147
- queries_ds = load_dataset(repo_id, "queries", split="queries")
148
- qrels_ds = load_dataset(repo_id, "default", split="test")
149
-
150
- corpus = {
151
- row["_id"]: {"title": row.get("title", ""), "text": row.get("text", "")}
152
- for row in corpus_ds
153
- }
154
- queries = {row["_id"]: row["text"] for row in queries_ds}
155
- qrels = {str(row["query-id"]): {str(row["corpus-id"]): 1} for row in qrels_ds}
156
-
157
- logging.info(f"Dataset loaded: {len(corpus)} documents, {len(queries)} queries.")
158
- return corpus, queries, qrels
159
-
160
-
161
- def evaluate_recall(results: dict, qrels: dict, k: int) -> float:
162
- hits, total_queries = 0, 0
163
- for query_id, ranked_docs in results.items():
164
- relevant_docs = set(qrels.get(str(query_id), {}).keys())
165
- if not relevant_docs:
166
- continue
167
- total_queries += 1
168
- top_k_docs = set(list(ranked_docs.keys())[:k])
169
- if not relevant_docs.isdisjoint(top_k_docs):
170
- hits += 1
171
- return hits / total_queries if total_queries > 0 else 0.0
172
-
173
-
174
- def to_numpy(tensor_or_array) -> np.ndarray:
175
- """Safely convert a PyTorch Tensor or a NumPy array to a float32 NumPy array."""
176
- if isinstance(tensor_or_array, torch.Tensor):
177
- return tensor_or_array.cpu().detach().numpy().astype(np.float32)
178
- elif isinstance(tensor_or_array, np.ndarray):
179
- return tensor_or_array.astype(np.float32)
180
- else:
181
- raise TypeError(f"Unsupported type for conversion: {type(tensor_or_array)}")
182
-
183
-
184
- class ColbertNativeRetriever:
185
- """Uses pylate's native ColBERT ranking (non-FDE)."""
186
-
187
- def __init__(self, model_name=COLBERT_MODEL_NAME):
188
- self.model = PylateColBERT(model_name_or_path=model_name, device=DEVICE)
189
- if hasattr(self.model[0].tokenizer, "model_max_length"): # For modernbert support
190
- self.model[0].tokenizer.model_input_names = ["input_ids", "attention_mask"]
191
- self.doc_embeddings_map = {}
192
- self.doc_ids = []
193
-
194
- def index(self, corpus: dict):
195
- self.doc_ids = list(corpus.keys())
196
- documents_for_ranker = [{"id": doc_id, **corpus[doc_id]} for doc_id in self.doc_ids]
197
- doc_texts = [f"{doc.get('title', '')} {doc.get('text', '')}".strip() for doc in documents_for_ranker]
198
-
199
- logging.info(
200
- f"[{self.__class__.__name__}] Generating ColBERT embeddings for all documents..."
201
- )
202
- doc_embeddings_list = self.model.encode(
203
- sentences=doc_texts,
204
- is_query=False,
205
- convert_to_tensor=True,
206
- normalize_embeddings=True,
207
- )
208
- self.doc_embeddings_map = dict(zip(self.doc_ids, doc_embeddings_list))
209
-
210
- def search(self, query: str) -> dict:
211
- query_embedding = self.model.encode(
212
- sentences=query,
213
- is_query=True,
214
- convert_to_tensor=True,
215
- normalize_embeddings=True,
216
- )
217
-
218
- scores = {}
219
- with torch.no_grad():
220
- for doc_id, doc_embedding in self.doc_embeddings_map.items():
221
- late_interaction = torch.einsum("sh,th->st", query_embedding.to(DEVICE), doc_embedding.to(DEVICE))
222
- score = late_interaction.max(dim=1).values.sum()
223
- scores[doc_id] = score.item()
224
-
225
- return dict(sorted(scores.items(), key=lambda item: item[1], reverse=True))
226
-
227
-
228
- class ColbertFdeRetriever:
229
- """Uses a real ColBERT model to generate embeddings, then FDE for search."""
230
-
231
- def __init__(self, model_name=COLBERT_MODEL_NAME):
232
- self.model = PylateColBERT(model_name_or_path=model_name, device=DEVICE)
233
- if hasattr(self.model[0].tokenizer, "model_max_length"): # For modernbert support
234
- self.model[0].tokenizer.model_input_names = ["input_ids", "attention_mask"]
235
- self.doc_config = FixedDimensionalEncodingConfig(
236
- dimension=128,
237
- num_repetitions=64,
238
- num_simhash_projections=8,
239
- seed=42,
240
- fill_empty_partitions=True, # Config for documents
241
- )
242
- self.fde_index, self.doc_ids = None, []
243
-
244
- def index(self, corpus: dict):
245
- self.doc_ids = list(corpus.keys())
246
- documents_for_ranker = [{"id": doc_id, **corpus[doc_id]} for doc_id in self.doc_ids]
247
- doc_texts = [f"{doc.get('title', '')} {doc.get('text', '')}".strip() for doc in documents_for_ranker]
248
-
249
- logging.info(f"[{self.__class__.__name__}] Generating native multi-vector embeddings...")
250
- doc_embeddings_list = self.model.encode(
251
- sentences=doc_texts,
252
- is_query=False,
253
- convert_to_numpy=True,
254
- normalize_embeddings=True,
255
- )
256
-
257
- logging.info(f"[{self.__class__.__name__}] Generating FDEs from ColBERT embeddings in BATCH mode...")
258
- self.fde_index = generate_document_fde_batch(doc_embeddings_list, self.doc_config)
259
-
260
- def search(self, query: str) -> dict:
261
- query_embeddings = self.model.encode(
262
- sentences=query,
263
- is_query=True,
264
- convert_to_numpy=True,
265
- normalize_embeddings=True,
266
- )
267
-
268
- query_config = replace(self.doc_config, fill_empty_partitions=False)
269
- query_fde = generate_query_fde(query_embeddings, query_config)
270
- scores = self.fde_index @ query_fde
271
- return dict(sorted(zip(self.doc_ids, scores), key=lambda item: item[1], reverse=True))
272
-
273
-
274
- if __name__ == "__main__":
275
- corpus, queries, qrels = load_autorag_dataset(DATASET_REPO_ID)
276
-
277
- logging.info("Initializing retrieval models...")
278
- retrievers = {
279
- "1. ColBERT (Native)": ColbertNativeRetriever(),
280
- "2. ColBERT + FDE": ColbertFdeRetriever(),
281
- }
282
-
283
- timings, final_results = {}, {}
284
-
285
- logging.info("--- PHASE 1: INDEXING ---")
286
- for name, retriever in retrievers.items():
287
- start_time = time.perf_counter()
288
- retriever.index(corpus)
289
- timings[name] = {"indexing_time": time.perf_counter() - start_time}
290
- logging.info(f"'{name}' indexing finished in {timings[name]['indexing_time']:.2f} seconds.")
291
-
292
- logging.info("--- PHASE 2: SEARCH & EVALUATION ---")
293
- for name, retriever in retrievers.items():
294
- logging.info(f"Running search for '{name}' on {len(queries)} queries...")
295
- query_times = []
296
- results = {}
297
- for query_id, query_text in queries.items():
298
- start_time = time.perf_counter()
299
- results[str(query_id)] = retriever.search(query_text)
300
- query_times.append(time.perf_counter() - start_time)
301
-
302
- timings[name]["avg_query_time"] = np.mean(query_times)
303
- final_results[name] = results
304
- logging.info(f"'{name}' search finished. Avg query time: {timings[name]['avg_query_time'] * 1000:.2f} ms.")
305
-
306
- print("\n" + "=" * 85)
307
- print(f"{'FINAL REPORT':^85}")
308
- print(f"(Dataset: {DATASET_REPO_ID})")
309
- print("=" * 85)
310
- print(
311
- f"{'Retriever':<25} | {'Indexing Time (s)':<20} | {'Avg Query Time (ms)':<22} | {'Recall@{k}'.format(k=TOP_K):<10}"
312
- )
313
- print("-" * 85)
314
-
315
- for name in retrievers.keys():
316
- recall = evaluate_recall(final_results[name], qrels, k=TOP_K)
317
- idx_time = timings[name]["indexing_time"]
318
- query_time_ms = timings[name]["avg_query_time"] * 1000
319
-
320
- print(
321
- f"{name:<25} | {idx_time:<20.2f} | {query_time_ms:<22.2f} | {recall:<10.4f}"
322
- )
323
-
324
- print("=" * 85)
325
- ```
326
-
327
  ```bash
328
  uv run main_pylate.py
329
  ```
@@ -351,7 +141,7 @@ from pylate import indexes, models, retrieve
351
 
352
  # Step 1: Load the ColBERT model
353
  model = models.ColBERT(
354
- model_name_or_path="pylate_model_id",
355
  )
356
 
357
  # Step 2: Initialize the PLAID index
 
13
  - ko
14
  ---
15
 
16
+ # colbert-ko-v1.0
17
 
18
+ **colbert-ko-v1.0** is a Korean colbert model finetuned with [PyLate](https://github.com/lightonai/pylate). This model is trained exclusively on Korean dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
19
 
20
  ## Model Details
21
 
 
50
  | [MultiLongDocRetrieval](https://huggingface.co/datasets/Shitao/MLDR) | Korean long document retrieval dataset covering various domains | 13,813.44 |
51
  <!-- | [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy) | Korean document retrieval dataset based on Wikipedia | 166.90 | -->
52
  <!-- | [MIRACLRetrieval](https://huggingface.co/datasets/miracl/miracl) | Korean document retrieval dataset based on Wikipedia | 166.63 | -->
53
+ We omit MIRACLRetrieval and MrTidyRetrieval in evalution due to our device conditions.
54
 
55
 
56
  ### Average Results
57
 
58
  | Model | Parameters | Average Recall@10 | Average Precision@10 | Average NDCG@10 | Average F1@10 |
59
  |-----------------------------------------------|------------|----------------|-------------------|--------------|------------|
60
+ | **colbert-ko-v1.0** | **0.1B** | **0.7999** | **0.0930** | **0.7172** | **0.1655**|
61
  | [jina-colbert-v2](https://huggingface.co/jinaai/jina-colbert-v2) | 0.5B | 0.7518 | 0.0888 | 0.6671 | 0.1577 |
62
 
63
  ## Usage
64
  ### PyLate for reranking
65
 
66
+ If you only want to use the colbert model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
67
 
68
  ```python
69
  from pylate import rank, models
 
84
  ]
85
 
86
  model = models.ColBERT(
87
+ model_name_or_path="yjoonjang/colbert-ko-v1.0",
88
  )
89
 
90
  queries_embeddings = model.encode(
 
108
  First install the [muvera-py](https://github.com/sionic-ai/muvera-py) (Python implementation of MUVERA):
109
 
110
  ```bash
111
+ git clone --branch feat/pylate https://github.com/yjoonjang/muvera-py.git
112
  cd muvera-py
113
  ```
114
 
115
  Then run the main file:
116
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
  ```bash
118
  uv run main_pylate.py
119
  ```
 
141
 
142
  # Step 1: Load the ColBERT model
143
  model = models.ColBERT(
144
+ model_name_or_path="yjoonjang/colbert-ko-v1.0",
145
  )
146
 
147
  # Step 2: Initialize the PLAID index