--- license: apache-2.0 language: - en pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - embeddings - text-embeddings library_name: sentence-transformers base_model: sentence-transformers/all-MiniLM-L6-v2 ---
Helion-V1 Logo
--- # Helion-V1-Embeddings Helion-V1-Embeddings is a lightweight text embedding model designed for semantic similarity, search, and retrieval tasks. It converts text into dense vector representations optimized for the Helion ecosystem. ## Model Description - **Developed by:** DeepXR - **Model type:** Sentence Transformer / Text Embedding Model - **Base model:** sentence-transformers/all-MiniLM-L6-v2 - **Language:** English - **License:** Apache 2.0 - **Embedding Dimension:** 384 - **Max Sequence Length:** 256 tokens ## Intended Use Helion-V1-Embeddings is designed for: - Semantic search and information retrieval - Document similarity comparison - Clustering and categorization - Question-answering systems (retrieval component) - Recommendation systems - Duplicate detection ### Primary Users - Developers building search systems - Data scientists working on NLP tasks - Applications requiring text similarity - RAG (Retrieval-Augmented Generation) pipelines ## Key Features - **Fast Inference**: Optimized for quick embedding generation - **Compact Size**: Small model footprint (~80MB) - **Good Performance**: Balanced accuracy and speed - **Easy Integration**: Compatible with sentence-transformers library - **Batch Processing**: Efficient for large datasets ## Usage ### Basic Usage ```python from sentence_transformers import SentenceTransformer # Load model model = SentenceTransformer('DeepXR/Helion-V1-embeddings') # Encode sentences sentences = [ "How do I reset my password?", "What is the process for password recovery?", "I forgot my login credentials" ] embeddings = model.encode(sentences) print(embeddings.shape) # (3, 384) ``` ### Similarity Search ```python from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('DeepXR/Helion-V1-embeddings') # Encode query and documents query = "How to train a machine learning model?" documents = [ "Machine learning training requires data preprocessing", "The best way to cook pasta is boiling water", "Neural networks need proper hyperparameter tuning" ] query_embedding = model.encode(query) doc_embeddings = model.encode(documents) # Calculate similarity similarities = util.cos_sim(query_embedding, doc_embeddings) print(similarities) ``` ### Integration with FAISS ```python from sentence_transformers import SentenceTransformer import faiss import numpy as np model = SentenceTransformer('DeepXR/Helion-V1-embeddings') # Create embeddings documents = ["doc1", "doc2", "doc3"] embeddings = model.encode(documents) # Create FAISS index dimension = embeddings.shape[1] index = faiss.IndexFlatL2(dimension) index.add(embeddings.astype('float32')) # Search query_embedding = model.encode(["search query"]) distances, indices = index.search(query_embedding.astype('float32'), k=3) ``` ## Performance ### Benchmark Results | Task | Score | Notes | |------|-------|-------| | STS Benchmark | ~0.78 | Semantic Textual Similarity | | Retrieval (BEIR) | ~0.42 | Average across datasets | | Speed (CPU) | ~2000 sentences/sec | Batch size 32 | | Speed (GPU) | ~15000 sentences/sec | Batch size 128 | *Note: These are approximate values. Actual performance may vary.* ## Training Details ### Training Data The model was fine-tuned on: - Question-answer pairs - Semantic similarity datasets - Document-query pairs - Paraphrase detection examples ### Training Procedure - **Base Model:** sentence-transformers/all-MiniLM-L6-v2 - **Training Method:** Contrastive learning with cosine similarity - **Loss Function:** MultipleNegativesRankingLoss - **Batch Size:** 64 - **Epochs:** 3 - **Pooling:** Mean pooling ## Technical Specifications ### Model Architecture - **Type:** Transformer-based encoder - **Layers:** 6 - **Hidden Size:** 384 - **Attention Heads:** 12 - **Parameters:** ~22.7M - **Pooling Strategy:** Mean pooling ### Input Format - **Max Length:** 256 tokens - **Tokenizer:** WordPiece - **Normalization:** Applied automatically ### Output Format - **Embedding Dimension:** 384 - **Dtype:** float32 - **Normalization:** L2 normalized (optional) ## Limitations - **Sequence Length:** Limited to 256 tokens (longer texts are truncated) - **Language:** Primarily optimized for English - **Domain:** General-purpose, may need fine-tuning for specialized domains - **Context:** Does not maintain conversation context across multiple inputs - **Model Size:** Smaller than state-of-the-art models, trading some accuracy for speed ## Use Cases ### ✅ Good For: - Semantic search in document collections - Finding similar questions/answers - Content recommendation - Duplicate detection - Clustering similar documents - Quick similarity comparisons ### ❌ Not Suitable For: - Long document encoding (>256 tokens) - Real-time generation tasks - Multilingual applications (without fine-tuning) - Highly specialized domains without adaptation - Tasks requiring deep reasoning ## Comparison with Other Models | Model | Dim | Speed | Accuracy | Size | |-------|-----|-------|----------|------| | Helion-V1-Embeddings | 384 | Fast | Good | 80MB | | all-MiniLM-L6-v2 | 384 | Fast | Good | 80MB | | all-mpnet-base-v2 | 768 | Medium | Better | 420MB | | text-embedding-ada-002 | 1536 | API | Best | API | ## Ethical Considerations - **Bias:** May reflect biases present in training data - **Privacy:** Do not embed sensitive personal information - **Fairness:** Performance may vary across different text types - **Use Responsibly:** Consider implications of similarity matching ## Integration Examples ### LangChain Integration ```python from langchain.embeddings import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings( model_name="DeepXR/Helion-V1-embeddings" ) text = "This is a sample document" embedding = embeddings.embed_query(text) ``` ### LlamaIndex Integration ```python from llama_index.embeddings import HuggingFaceEmbedding embed_model = HuggingFaceEmbedding( model_name="DeepXR/Helion-V1-embeddings" ) embeddings = embed_model.get_text_embedding("Hello world") ``` ## Citation ```bibtex @misc{helion-v1-embeddings, author = {DeepXR}, title = {Helion-V1-Embeddings: Lightweight Text Embedding Model}, year = {2024}, publisher = {HuggingFace}, url = {https://huggingface.co/DeepXR/Helion-V1-embeddings} } ``` ## Model Card Authors DeepXR Team ## Contact - Repository: https://huggingface.co/DeepXR/Helion-V1-embeddings - Issues: https://huggingface.co/DeepXR/Helion-V1-embeddings/discussions