|
|
--- |
|
|
license: cc-by-nc-sa-4.0 |
|
|
library_name: sentence-transformers |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
- patent |
|
|
- embeddings |
|
|
- mteb |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: sentence-similarity |
|
|
--- |
|
|
|
|
|
# patembed-small |
|
|
|
|
|
This is a **sentence-transformers** model trained specifically for **patent text embeddings**. It is part of the **PatenTEB** project, which provides state-of-the-art models for patent document understanding and retrieval. |
|
|
|
|
|
**Note:** This model uses task-specific instruction prompts during inference for optimal performance. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Sentence Transformer |
|
|
- **Base Architecture**: Distilled from patembed-large using layers {0,4,8,12,16,20} |
|
|
- **Parameters**: 117M |
|
|
- **Number of Layers**: 6 |
|
|
- **Hidden Size**: 1024 |
|
|
- **Embedding Dimension**: 384 |
|
|
- **Max Sequence Length**: 512 tokens |
|
|
- **Language**: English |
|
|
- **License**: CC BY-NC-SA 4.0 |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Resource-limited deployment variant. Maintains 1024 hidden size with projection to 384-dim embeddings. |
|
|
|
|
|
This model is part of the **patembed family**, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper. |
|
|
|
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using Sentence Transformers |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
# Load the model |
|
|
model = SentenceTransformer('datalyes/patembed-small') |
|
|
|
|
|
# Encode patent texts |
|
|
patent_texts = [ |
|
|
"A method for manufacturing semiconductor devices...", |
|
|
"An apparatus for processing chemical compounds...", |
|
|
] |
|
|
embeddings = model.encode(patent_texts) |
|
|
|
|
|
# Compute similarity |
|
|
from sentence_transformers import util |
|
|
similarity = util.cos_sim(embeddings[0], embeddings[1]) |
|
|
print(f"Similarity: {similarity.item():.4f}") |
|
|
``` |
|
|
|
|
|
### Using Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-small') |
|
|
model = AutoModel.from_pretrained('datalyes/patembed-small') |
|
|
|
|
|
def mean_pooling(model_output, attention_mask): |
|
|
token_embeddings = model_output[0] |
|
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
|
|
# Tokenize and encode |
|
|
texts = ["A method for manufacturing semiconductor devices..."] |
|
|
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') |
|
|
|
|
|
with torch.no_grad(): |
|
|
model_output = model(**encoded) |
|
|
embeddings = mean_pooling(model_output, encoded['attention_mask']) |
|
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
|
``` |
|
|
|
|
|
### Patent Retrieval Example |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer, util |
|
|
|
|
|
model = SentenceTransformer('datalyes/patembed-small') |
|
|
|
|
|
# Query patent |
|
|
query = "Method for reducing power consumption in mobile devices" |
|
|
|
|
|
# Candidate patents |
|
|
candidates = [ |
|
|
"A power management system for portable electronic devices...", |
|
|
"Chemical composition for battery manufacturing...", |
|
|
"Method for wireless data transmission in mobile networks...", |
|
|
] |
|
|
|
|
|
# Encode and retrieve |
|
|
query_emb = model.encode(query) |
|
|
candidate_embs = model.encode(candidates) |
|
|
|
|
|
# Compute similarities |
|
|
scores = util.cos_sim(query_emb, candidate_embs)[0] |
|
|
|
|
|
# Get ranked results |
|
|
results = [(candidates[i], scores[i].item()) for i in range(len(candidates))] |
|
|
results.sort(key=lambda x: x[1], reverse=True) |
|
|
|
|
|
for patent, score in results: |
|
|
print(f"Score: {score:.4f} - {patent[:100]}...") |
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for patent-specific tasks including: |
|
|
- Patent search and retrieval |
|
|
- Prior art search |
|
|
- Patent classification and clustering |
|
|
- Technology landscape analysis |
|
|
|
|
|
For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{ayaou2025patentebcomprehensivebenchmarkmodel, |
|
|
title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding}, |
|
|
author={Iliass Ayaou and Denis Cavallucci}, |
|
|
year={2025}, |
|
|
eprint={2510.22264}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2510.22264} |
|
|
} |
|
|
``` |
|
|
|
|
|
**Paper**: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264) |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license. |
|
|
|
|
|
**Key Terms:** |
|
|
- ✅ You can use, share, and adapt the model |
|
|
- ✅ You must give appropriate credit |
|
|
- ❌ You may not use the model for commercial purposes |
|
|
- ⚠️ If you adapt or build upon this model, you must distribute under the same license |
|
|
|
|
|
For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/ |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **Authors**: Iliass Ayaou, Denis Cavallucci |
|
|
- **Institution**: ICUBE Laboratory, INSA Strasbourg |
|
|
- **GitHub**: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb) |
|
|
- **HuggingFace**: [datalyes](https://huggingface.co/datalyes) |
|
|
|