|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: sentence-similarity |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- feature-extraction |
|
|
- sentence-similarity |
|
|
- embeddings |
|
|
- text-embeddings |
|
|
library_name: sentence-transformers |
|
|
base_model: sentence-transformers/all-MiniLM-L6-v2 |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://imgur.com/sk6NekE.png" alt="Helion-V1 Logo" width="100%"/> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
# Helion-V1-Embeddings |
|
|
|
|
|
Helion-V1-Embeddings is a lightweight text embedding model designed for semantic similarity, search, and retrieval tasks. It converts text into dense vector representations optimized for the Helion ecosystem. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Developed by:** DeepXR |
|
|
- **Model type:** Sentence Transformer / Text Embedding Model |
|
|
- **Base model:** sentence-transformers/all-MiniLM-L6-v2 |
|
|
- **Language:** English |
|
|
- **License:** Apache 2.0 |
|
|
- **Embedding Dimension:** 384 |
|
|
- **Max Sequence Length:** 256 tokens |
|
|
|
|
|
## Model Parameters |
|
|
|
|
|
| Parameter | Value | Description | |
|
|
|-----------|-------|-------------| |
|
|
| **Architecture** | BERT-based | 6-layer transformer encoder | |
|
|
| **Hidden Size** | 384 | Dimension of hidden layers | |
|
|
| **Attention Heads** | 12 | Number of attention heads | |
|
|
| **Intermediate Size** | 1536 | Feed-forward layer size | |
|
|
| **Vocab Size** | 30,522 | WordPiece vocabulary | |
|
|
| **Max Position Embeddings** | 512 | Maximum sequence length | |
|
|
| **Pooling Strategy** | Mean Pooling | Average of token embeddings | |
|
|
| **Output Dimension** | 384 | Final embedding size | |
|
|
| **Total Parameters** | ~22.7M | Trainable parameters | |
|
|
| **Model Size** | ~80MB | Disk footprint | |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
Helion-V1-Embeddings is designed for: |
|
|
- Semantic search and information retrieval |
|
|
- Document similarity comparison |
|
|
- Clustering and categorization |
|
|
- Question-answering systems (retrieval component) |
|
|
- Recommendation systems |
|
|
- Duplicate detection |
|
|
|
|
|
### Primary Users |
|
|
- Developers building search systems |
|
|
- Data scientists working on NLP tasks |
|
|
- Applications requiring text similarity |
|
|
- RAG (Retrieval-Augmented Generation) pipelines |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Fast Inference**: Optimized for quick embedding generation |
|
|
- **Compact Size**: Small model footprint (~80MB) |
|
|
- **Good Performance**: Balanced accuracy and speed |
|
|
- **Easy Integration**: Compatible with sentence-transformers library |
|
|
- **Batch Processing**: Efficient for large datasets |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
# Load model |
|
|
model = SentenceTransformer('DeepXR/Helion-V1-embeddings') |
|
|
|
|
|
# Encode sentences |
|
|
sentences = [ |
|
|
"How do I reset my password?", |
|
|
"What is the process for password recovery?", |
|
|
"I forgot my login credentials" |
|
|
] |
|
|
|
|
|
embeddings = model.encode(sentences) |
|
|
print(embeddings.shape) # (3, 384) |
|
|
``` |
|
|
|
|
|
### Similarity Search |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer, util |
|
|
|
|
|
model = SentenceTransformer('DeepXR/Helion-V1-embeddings') |
|
|
|
|
|
# Encode query and documents |
|
|
query = "How to train a machine learning model?" |
|
|
documents = [ |
|
|
"Machine learning training requires data preprocessing", |
|
|
"The best way to cook pasta is boiling water", |
|
|
"Neural networks need proper hyperparameter tuning" |
|
|
] |
|
|
|
|
|
query_embedding = model.encode(query) |
|
|
doc_embeddings = model.encode(documents) |
|
|
|
|
|
# Calculate similarity |
|
|
similarities = util.cos_sim(query_embedding, doc_embeddings) |
|
|
print(similarities) |
|
|
``` |
|
|
|
|
|
### Integration with FAISS |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
import faiss |
|
|
import numpy as np |
|
|
|
|
|
model = SentenceTransformer('DeepXR/Helion-V1-embeddings') |
|
|
|
|
|
# Create embeddings |
|
|
documents = ["doc1", "doc2", "doc3"] |
|
|
embeddings = model.encode(documents) |
|
|
|
|
|
# Create FAISS index |
|
|
dimension = embeddings.shape[1] |
|
|
index = faiss.IndexFlatL2(dimension) |
|
|
index.add(embeddings.astype('float32')) |
|
|
|
|
|
# Search |
|
|
query_embedding = model.encode(["search query"]) |
|
|
distances, indices = index.search(query_embedding.astype('float32'), k=3) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Benchmark Results |
|
|
|
|
|
| Task | Score | Notes | |
|
|
|------|-------|-------| |
|
|
| STS Benchmark | ~0.78 | Semantic Textual Similarity | |
|
|
| Retrieval (BEIR) | ~0.42 | Average across datasets | |
|
|
| Speed (CPU) | ~2000 sentences/sec | Batch size 32 | |
|
|
| Speed (GPU) | ~15000 sentences/sec | Batch size 128 | |
|
|
|
|
|
*Note: These are approximate values. Actual performance may vary.* |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was fine-tuned on: |
|
|
- Question-answer pairs |
|
|
- Semantic similarity datasets |
|
|
- Document-query pairs |
|
|
- Paraphrase detection examples |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
- **Base Model:** sentence-transformers/all-MiniLM-L6-v2 |
|
|
- **Training Method:** Contrastive learning with cosine similarity |
|
|
- **Loss Function:** MultipleNegativesRankingLoss |
|
|
- **Batch Size:** 64 |
|
|
- **Epochs:** 3 |
|
|
- **Pooling:** Mean pooling |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture |
|
|
- **Type:** Transformer-based encoder |
|
|
- **Layers:** 6 |
|
|
- **Hidden Size:** 384 |
|
|
- **Attention Heads:** 12 |
|
|
- **Parameters:** ~22.7M |
|
|
- **Pooling Strategy:** Mean pooling |
|
|
|
|
|
### Input Format |
|
|
- **Max Length:** 256 tokens |
|
|
- **Tokenizer:** WordPiece |
|
|
- **Normalization:** Applied automatically |
|
|
|
|
|
### Output Format |
|
|
- **Embedding Dimension:** 384 |
|
|
- **Dtype:** float32 |
|
|
- **Normalization:** L2 normalized (optional) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Sequence Length:** Limited to 256 tokens (longer texts are truncated) |
|
|
- **Language:** Primarily optimized for English |
|
|
- **Domain:** General-purpose, may need fine-tuning for specialized domains |
|
|
- **Context:** Does not maintain conversation context across multiple inputs |
|
|
- **Model Size:** Smaller than state-of-the-art models, trading some accuracy for speed |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
### ✅ Good For: |
|
|
- Semantic search in document collections |
|
|
- Finding similar questions/answers |
|
|
- Content recommendation |
|
|
- Duplicate detection |
|
|
- Clustering similar documents |
|
|
- Quick similarity comparisons |
|
|
|
|
|
### ❌ Not Suitable For: |
|
|
- Long document encoding (>256 tokens) |
|
|
- Real-time generation tasks |
|
|
- Multilingual applications (without fine-tuning) |
|
|
- Highly specialized domains without adaptation |
|
|
- Tasks requiring deep reasoning |
|
|
|
|
|
## Comparison with Other Models |
|
|
|
|
|
| Model | Dim | Speed | Accuracy | Size | |
|
|
|-------|-----|-------|----------|------| |
|
|
| Helion-V1-Embeddings | 384 | Fast | Good | 80MB | |
|
|
| all-MiniLM-L6-v2 | 384 | Fast | Good | 80MB | |
|
|
| all-mpnet-base-v2 | 768 | Medium | Better | 420MB | |
|
|
| text-embedding-ada-002 | 1536 | API | Best | API | |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- **Bias:** May reflect biases present in training data |
|
|
- **Privacy:** Do not embed sensitive personal information |
|
|
- **Fairness:** Performance may vary across different text types |
|
|
- **Use Responsibly:** Consider implications of similarity matching |
|
|
|
|
|
## Integration Examples |
|
|
|
|
|
### LangChain Integration |
|
|
|
|
|
```python |
|
|
from langchain.embeddings import HuggingFaceEmbeddings |
|
|
|
|
|
embeddings = HuggingFaceEmbeddings( |
|
|
model_name="DeepXR/Helion-V1-embeddings" |
|
|
) |
|
|
|
|
|
text = "This is a sample document" |
|
|
embedding = embeddings.embed_query(text) |
|
|
``` |
|
|
|
|
|
### LlamaIndex Integration |
|
|
|
|
|
```python |
|
|
from llama_index.embeddings import HuggingFaceEmbedding |
|
|
|
|
|
embed_model = HuggingFaceEmbedding( |
|
|
model_name="DeepXR/Helion-V1-embeddings" |
|
|
) |
|
|
|
|
|
embeddings = embed_model.get_text_embedding("Hello world") |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{helion-v1-embeddings, |
|
|
author = {DeepXR}, |
|
|
title = {Helion-V1-Embeddings: Lightweight Text Embedding Model}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/DeepXR/Helion-V1-embeddings} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
DeepXR Team |
|
|
|
|
|
## Contact |
|
|
|
|
|
- Repository: https://huggingface.co/DeepXR/Helion-V1-embeddings |
|
|
- Issues: https://huggingface.co/DeepXR/Helion-V1-embeddings/discussions |