patembed-small / README.md
datalyes's picture
Upload PatentTEB model: patembed-small
d0829c7 verified
---
license: cc-by-nc-sa-4.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- patent
- embeddings
- mteb
language:
- en
pipeline_tag: sentence-similarity
---
# patembed-small
This is a **sentence-transformers** model trained specifically for **patent text embeddings**. It is part of the **PatenTEB** project, which provides state-of-the-art models for patent document understanding and retrieval.
**Note:** This model uses task-specific instruction prompts during inference for optimal performance.
## Model Details
- **Model Type**: Sentence Transformer
- **Base Architecture**: Distilled from patembed-large using layers {0,4,8,12,16,20}
- **Parameters**: 117M
- **Number of Layers**: 6
- **Hidden Size**: 1024
- **Embedding Dimension**: 384
- **Max Sequence Length**: 512 tokens
- **Language**: English
- **License**: CC BY-NC-SA 4.0
## Model Description
Resource-limited deployment variant. Maintains 1024 hidden size with projection to 384-dim embeddings.
This model is part of the **patembed family**, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper.
## Usage
### Using Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('datalyes/patembed-small')
# Encode patent texts
patent_texts = [
"A method for manufacturing semiconductor devices...",
"An apparatus for processing chemical compounds...",
]
embeddings = model.encode(patent_texts)
# Compute similarity
from sentence_transformers import util
similarity = util.cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
```
### Using Transformers
```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-small')
model = AutoModel.from_pretrained('datalyes/patembed-small')
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Tokenize and encode
texts = ["A method for manufacturing semiconductor devices..."]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded)
embeddings = mean_pooling(model_output, encoded['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
```
### Patent Retrieval Example
```python
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('datalyes/patembed-small')
# Query patent
query = "Method for reducing power consumption in mobile devices"
# Candidate patents
candidates = [
"A power management system for portable electronic devices...",
"Chemical composition for battery manufacturing...",
"Method for wireless data transmission in mobile networks...",
]
# Encode and retrieve
query_emb = model.encode(query)
candidate_embs = model.encode(candidates)
# Compute similarities
scores = util.cos_sim(query_emb, candidate_embs)[0]
# Get ranked results
results = [(candidates[i], scores[i].item()) for i in range(len(candidates))]
results.sort(key=lambda x: x[1], reverse=True)
for patent, score in results:
print(f"Score: {score:.4f} - {patent[:100]}...")
```
## Intended Use
This model is designed for patent-specific tasks including:
- Patent search and retrieval
- Prior art search
- Patent classification and clustering
- Technology landscape analysis
For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper.
## Citation
If you use this model, please cite our paper:
```bibtex
@misc{ayaou2025patentebcomprehensivebenchmarkmodel,
title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding},
author={Iliass Ayaou and Denis Cavallucci},
year={2025},
eprint={2510.22264},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.22264}
}
```
**Paper**: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264)
## License
This model is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license.
**Key Terms:**
- ✅ You can use, share, and adapt the model
- ✅ You must give appropriate credit
- ❌ You may not use the model for commercial purposes
- ⚠️ If you adapt or build upon this model, you must distribute under the same license
For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/
## Contact
- **Authors**: Iliass Ayaou, Denis Cavallucci
- **Institution**: ICUBE Laboratory, INSA Strasbourg
- **GitHub**: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb)
- **HuggingFace**: [datalyes](https://huggingface.co/datalyes)