---
license: cc-by-nc-sa-4.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- patent
- embeddings
- mteb
language:
- en
pipeline_tag: sentence-similarity
---

# patembed-small

This is a **sentence-transformers** model trained specifically for **patent text embeddings**. It is part of the **PatenTEB** project, which provides state-of-the-art models for patent document understanding and retrieval.

**Note:** This model uses task-specific instruction prompts during inference for optimal performance.

## Model Details

- **Model Type**: Sentence Transformer
- **Base Architecture**: Distilled from patembed-large using layers {0,4,8,12,16,20}
- **Parameters**: 117M
- **Number of Layers**: 6
- **Hidden Size**: 1024
- **Embedding Dimension**: 384
- **Max Sequence Length**: 512 tokens
- **Language**: English
- **License**: CC BY-NC-SA 4.0

## Model Description

Resource-limited deployment variant. Maintains 1024 hidden size with projection to 384-dim embeddings.

This model is part of the **patembed family**, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper.


## Usage

### Using Sentence Transformers

```python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('datalyes/patembed-small')

# Encode patent texts
patent_texts = [
    "A method for manufacturing semiconductor devices...",
    "An apparatus for processing chemical compounds...",
]
embeddings = model.encode(patent_texts)

# Compute similarity
from sentence_transformers import util
similarity = util.cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")
```

### Using Transformers

```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-small')
model = AutoModel.from_pretrained('datalyes/patembed-small')

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Tokenize and encode
texts = ["A method for manufacturing semiconductor devices..."]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded)
    embeddings = mean_pooling(model_output, encoded['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)
```

### Patent Retrieval Example

```python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('datalyes/patembed-small')

# Query patent
query = "Method for reducing power consumption in mobile devices"

# Candidate patents
candidates = [
    "A power management system for portable electronic devices...",
    "Chemical composition for battery manufacturing...",
    "Method for wireless data transmission in mobile networks...",
]

# Encode and retrieve
query_emb = model.encode(query)
candidate_embs = model.encode(candidates)

# Compute similarities
scores = util.cos_sim(query_emb, candidate_embs)[0]

# Get ranked results
results = [(candidates[i], scores[i].item()) for i in range(len(candidates))]
results.sort(key=lambda x: x[1], reverse=True)

for patent, score in results:
    print(f"Score: {score:.4f} - {patent[:100]}...")
```

## Intended Use

This model is designed for patent-specific tasks including:
- Patent search and retrieval
- Prior art search
- Patent classification and clustering
- Technology landscape analysis

For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper.

## Citation

If you use this model, please cite our paper:

```bibtex
@misc{ayaou2025patentebcomprehensivebenchmarkmodel,
      title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding}, 
      author={Iliass Ayaou and Denis Cavallucci},
      year={2025},
      eprint={2510.22264},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.22264}
}
```

**Paper**: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264)

## License

This model is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license.

**Key Terms:**
- ✅ You can use, share, and adapt the model
- ✅ You must give appropriate credit
- ❌ You may not use the model for commercial purposes
- ⚠️ If you adapt or build upon this model, you must distribute under the same license

For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/

## Contact

- **Authors**: Iliass Ayaou, Denis Cavallucci
- **Institution**: ICUBE Laboratory, INSA Strasbourg
- **GitHub**: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb)
- **HuggingFace**: [datalyes](https://huggingface.co/datalyes)