--- license: cc-by-nc-sa-4.0 library_name: sentence-transformers tags: - sentence-transformers - sentence-similarity - feature-extraction - patent - embeddings - mteb language: - en pipeline_tag: sentence-similarity --- # patembed-small This is a **sentence-transformers** model trained specifically for **patent text embeddings**. It is part of the **PatenTEB** project, which provides state-of-the-art models for patent document understanding and retrieval. **Note:** This model uses task-specific instruction prompts during inference for optimal performance. ## Model Details - **Model Type**: Sentence Transformer - **Base Architecture**: Distilled from patembed-large using layers {0,4,8,12,16,20} - **Parameters**: 117M - **Number of Layers**: 6 - **Hidden Size**: 1024 - **Embedding Dimension**: 384 - **Max Sequence Length**: 512 tokens - **Language**: English - **License**: CC BY-NC-SA 4.0 ## Model Description Resource-limited deployment variant. Maintains 1024 hidden size with projection to 384-dim embeddings. This model is part of the **patembed family**, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper. ## Usage ### Using Sentence Transformers ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer('datalyes/patembed-small') # Encode patent texts patent_texts = [ "A method for manufacturing semiconductor devices...", "An apparatus for processing chemical compounds...", ] embeddings = model.encode(patent_texts) # Compute similarity from sentence_transformers import util similarity = util.cos_sim(embeddings[0], embeddings[1]) print(f"Similarity: {similarity.item():.4f}") ``` ### Using Transformers ```python from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-small') model = AutoModel.from_pretrained('datalyes/patembed-small') def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Tokenize and encode texts = ["A method for manufacturing semiconductor devices..."] encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded) embeddings = mean_pooling(model_output, encoded['attention_mask']) embeddings = F.normalize(embeddings, p=2, dim=1) ``` ### Patent Retrieval Example ```python from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('datalyes/patembed-small') # Query patent query = "Method for reducing power consumption in mobile devices" # Candidate patents candidates = [ "A power management system for portable electronic devices...", "Chemical composition for battery manufacturing...", "Method for wireless data transmission in mobile networks...", ] # Encode and retrieve query_emb = model.encode(query) candidate_embs = model.encode(candidates) # Compute similarities scores = util.cos_sim(query_emb, candidate_embs)[0] # Get ranked results results = [(candidates[i], scores[i].item()) for i in range(len(candidates))] results.sort(key=lambda x: x[1], reverse=True) for patent, score in results: print(f"Score: {score:.4f} - {patent[:100]}...") ``` ## Intended Use This model is designed for patent-specific tasks including: - Patent search and retrieval - Prior art search - Patent classification and clustering - Technology landscape analysis For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper. ## Citation If you use this model, please cite our paper: ```bibtex @misc{ayaou2025patentebcomprehensivebenchmarkmodel, title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding}, author={Iliass Ayaou and Denis Cavallucci}, year={2025}, eprint={2510.22264}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.22264} } ``` **Paper**: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264) ## License This model is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license. **Key Terms:** - ✅ You can use, share, and adapt the model - ✅ You must give appropriate credit - ❌ You may not use the model for commercial purposes - ⚠️ If you adapt or build upon this model, you must distribute under the same license For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/ ## Contact - **Authors**: Iliass Ayaou, Denis Cavallucci - **Institution**: ICUBE Laboratory, INSA Strasbourg - **GitHub**: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb) - **HuggingFace**: [datalyes](https://huggingface.co/datalyes)