Spaces:
Running
title: Api Embedding
emoji: π
colorFrom: green
colorTo: purple
sdk: docker
pinned: false
Unified Embedding API
π§© A self-hosted embedding service for dense, sparse, and reranking models with OpenAI-compatible API.
Overview
The Unified Embedding API is a modular, self-hosted solution designed to simplify the development and management of embedding models for Retrieval-Augmented Generation (RAG) and semantic search applications. Built on FastAPI and Sentence Transformers, this API provides a unified interface for dense embeddings, sparse embeddings (SPLADE), and document reranking through CrossEncoder models.
Key Differentiation: Unlike traditional embedding services that require separate infrastructure for each model type, this API consolidates all embedding operations into a single, configurable endpoint with OpenAI-compatible responses.
Project Motivation
During the development of RAG and agentic systems for production environments and portfolio projects, several operational challenges emerged:
- Development Environment Overhead: Each experiment required setting up isolated environments with PyTorch, Transformers, and associated dependencies (often 5-10GB per environment)
- Model Experimentation Costs: Testing different models for optimal precision, MRR, and recall metrics necessitated downloading multiple model versions, consuming significant disk space and compute resources
- Hardware Limitations: Running models locally on CPU-only machines frequently resulted in thermal throttling and system instability
Solution Approach: After evaluating Hugging Face's Text Embeddings Inference (TEI), the need for a more flexible, configuration-driven solution became apparent. This project addresses these challenges by:
- Providing a single API endpoint that can serve multiple model types
- Enabling model switching through configuration files without code changes
- Leveraging Hugging Face Spaces for free, serverless hosting
- Maintaining compatibility with OpenAI's client libraries for seamless integration
Technical Motivation
Architecture Decisions
1. Framework Selection: SentenceTransformers + FastAPI
SentenceTransformers was chosen as the core embedding library for several technical reasons:
- Unified Model Interface: Provides consistent APIs across diverse model architectures (BERT, RoBERTa, SPLADE, CrossEncoders)
- Model Ecosystem: Direct compatibility with 5,000+ pre-trained models on Hugging Face Hub
FastAPI serves as the web framework due to:
- Async-First Architecture: Non-blocking I/O operations critical for handling concurrent embedding requests
- Automatic API Documentation: OpenAPI/Swagger generation reduces documentation overhead
- Type Safety: Pydantic integration ensures request validation at the schema level
2. Hosting Strategy: Hugging Face Spaces
Deploying on Hugging Face Spaces provides several operational advantages:
- Zero infrastructure cost for CPU-based workloads (2vCPU, 16GB RAM)
- Eliminates need for dedicated VPS or cloud compute instances
- No egress fees for model weight downloads from HF Hub
- Built-in CI/CD through git-based deployments
- Easy transition to paid GPU instances for larger models
- Native support for Docker-based deployments
Features
Core Capabilities
- Multi-Model Support: Serve dense embeddings (transformers), sparse embeddings (SPLADE), and reranking models (CrossEncoders) from a single API
- OpenAI Compatibility: Drop-in replacement for OpenAI's embedding API with client library support
- Configuration-Driven: Switch models through YAML configuration without code modifications
- Batch Processing: Automatic optimization for single and batch requests
- Type Safety: Full Pydantic validation with OpenAPI schema generation
- Async Operations: Non-blocking request handling with FastAPI's async/await
Architecture
System Components
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Server β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β Embeddings β β Reranking β β Models β β
β β Endpoint β β Endpoint β β Endpoint β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Model Manager β
β β’ Configuration Loading β
β β’ Model Lifecycle Management β
β β’ Thread-Safe Model Access β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Embedding Implementations β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Dense β β Sparse β β Reranking β β
β β(Transformer) β β (SPLADE) β β(CrossEncoder)β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Project Structure
unified-embedding-api/
βββ src/
β βββ api/ # API layer
β β βββ dependencies.py # Dependency injection
β β βββ routes/
β β βββ embeddings.py # Dense/sparse endpoints
β β βββ model_list.py # Model management
β β βββ health.py # Health checks
β β βββ rerank.py # Reranking endpoint
β βββ core/ # Business logic
β β βββ base.py # Abstract base classes
β β βββ config.py # Configuration models
β β βββ exceptions.py # Custom exceptions
β β βββ manager.py # Model lifecycle management
β βββ models/ # Domain models
β β βββ embeddings/
β β β βββ dense.py # Dense embedding implementation
β β β βββ sparse.py # Sparse embedding implementation
β β β βββ rank.py # Reranking implementation
β β βββ schemas/
β β βββ common.py # Shared schemas
β β βββ requests.py # Request models
β β βββ responses.py # Response models
β βββ config/
β β βββ settings.py # Application settings
β β βββ models.yaml # Model configuration
β βββ utils/
β βββ logger.py # Logging configuration
β βββ validators.py # Validation kwrags, token etc
βββ app.py # Application entry point
βββ requirements.txt # Development dependencies
βββ Dockerfile # Container definition
Quick Start
Deployment on Hugging Face Spaces
Prerequisites:
- Hugging Face account
- Git installed locally
Steps:
Duplicate Space
- Navigate to fahmiaziz/api-embedding
- Click the three-dot menu β "Duplicate this Space"
Configure Environment
- In Space settings, add
HF_TOKENas a repository secret (for private model access) - Ensure Space visibility is set to "Public"
- In Space settings, add
Clone Repository
git clone https://huggingface.co/spaces/YOUR_USERNAME/api-embedding cd api-embeddingConfigure Models Edit
src/config/models.yaml:models: custom-model: name: "organization/model-name" type: "embeddings" # Options: embeddings, sparse-embeddings, rerankDeploy Changes
git add src/config/models.yaml git commit -m "Configure custom models" git pushAccess API
- Click β― β Embed this Space β copy Direct URL
- Base URL:
https://YOUR_USERNAME-api-embedding.hf.space - Documentation:
https://YOUR_USERNAME-api-embedding.hf.space/docs
Local Development (NOT RECOMMENDED)
System Requirements:
- Python 3.10+
- 8GB RAM minimum
- 10GB++ disk space
Setup:
# Clone repository
git clone https://github.com/fahmiaziz98/unified-embedding-api.git
cd unified-embedding-api
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Start server
python app.py
Server will be available at http://localhost:7860
Docker Deployment
# Build image
docker build -t unified-embedding-api .
# Run container
docker run -p 7860:7860 unified-embedding-api
Usage
Native API (requests)
import requests
BASE_URL = "https://fahmiaziz-api-embedding.hf.space/api/v1"
# Generate embeddings
response = requests.post(
f"{BASE_URL}/embeddings",
json={
"input": "Natural language processing",
"model": "qwen3-0.6b"
}
)
data = response.json()
embedding = data["data"][0]["embedding"]
print(f"Embedding dimensions: {len(embedding)}")
OpenAI Client Integration
The API implements OpenAI's embedding API specification, enabling direct integration with OpenAI's Python client:
from openai import OpenAI
client = OpenAI(
base_url="https://fahmiaziz-api-embedding.hf.space/api/v1",
api_key="not-required" # Placeholder required by client
)
# Single text embedding
response = client.embeddings.create(
input="Text to embed",
model="qwen3-0.6b"
)
embedding_vector = response.data[0].embedding
Async Operations:
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://fahmiaziz-api-embedding.hf.space/api/v1",
api_key="not-required"
)
async def generate_embeddings(texts: list[str]):
response = await client.embeddings.create(
input=texts,
model="qwen3-0.6b"
)
return [item.embedding for item in response.data]
# Usage in async context
embeddings = await generate_embeddings(["text1", "text2"])
Document Reranking
import requests
response = requests.post(
f"{BASE_URL}/rerank",
json={
"query": "machine learning frameworks",
"documents": [
"TensorFlow is a comprehensive ML platform",
"React is a JavaScript UI library",
"PyTorch provides flexible neural networks"
],
"model": "bge-v2-m3",
"top_k": 2
}
)
results = response.json()["results"]
for result in results:
print(f"Score: {result['score']:.3f} - {result['text']}")
API Reference
Endpoints
| Endpoint | Method | Description | OpenAI Compatible |
|---|---|---|---|
/api/v1/embeddings |
POST | Generate embeddings | Yes |
/api/v1/embed_sparse |
POST | Generate sparse embeddings | No |
/api/v1/rerank |
POST | Rerank documents | No |
/api/v1/models |
GET | List available models | Partial |
/health |
GET | Health check | No |
Request Format
Embeddings (OpenAI-compatible):
{
"input": "text" ["text1", "text2"],
"model": "model-identifier",
"encoding_format": "float"
}
Sparse Embeddings:
{
"input": "text" ["text1", "text2"],
"model": "splade-model-id"
}
Reranking:
{
"query": "search query",
"documents": ["doc1", "doc2"],
"model": "reranker-id",
"top_k": 10
}
Response Format
Standard Embedding Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [0.123, -0.456,],
"index": 0
}
],
"model": "qwen3-0.6b",
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}
Configuration
Model Configuration
Default configuration is optimized for CPU 2vCPU / 16GB RAM. See MTEB Leaderboard for model recommendations and memory usage reference.
Edit src/config/models.yaml to add or modify models:
models:
# Dense embedding model
custom-dense:
name: "sentence-transformers/all-MiniLM-L6-v2"
type: "embeddings"
# Sparse embedding model
custom-sparse:
name: "prithivida/Splade_PP_en_v1"
type: "sparse-embeddings"
# Reranking model
custom-reranker:
name: "BAAI/bge-reranker-base"
type: "rerank"
Model Type Reference:
| Type | Description | Use Case |
|---|---|---|
embeddings |
Dense vector embeddings | Semantic search, similarity |
sparse-embeddings |
Sparse vectors (SPLADE) | Keyword + semantic hybrid |
rerank |
CrossEncoder scoring | Precision reranking |
β οΈ If you plan to use larger models like Qwen2-embedding-8B, please upgrade your Space.
Application Settings
Configure through src/config/settings.py file:
# Application
APP_NAME="Unified Embedding API"
VERSION="3.0.0"
# Server
HOST=0.0.0.0
PORT=7860 # don't change port
WORKERS=1
# Models
MODEL_CONFIG_PATH=src/config/models.yaml
PRELOAD_MODELS=true
DEVICE=cpu
# Logging
LOG_LEVEL=INFO
Performance Optimization
Recommended Practices
Batch Processing
- Always send multiple texts in a single request when possible
- Batch size of 16-32 provides optimal throughput/latency balance
Normalization
- Enable
normalize_embeddingsfor cosine similarity operations - Reduces downstream computation in vector databases
- Enable
Model Selection
- Dense models: Best for semantic similarity
- Sparse models: Better for keyword matching + semantics
- Reranking: Use as second-stage after initial retrieval
Migration from OpenAI
Replace OpenAI embedding calls with minimal code changes:
Before (OpenAI):
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.embeddings.create(
input="Hello world",
model="text-embedding-3-small"
)
After (Self-hosted):
from openai import OpenAI
client = OpenAI(
base_url="https://your-space.hf.space/api/v1",
api_key="not-required"
)
response = client.embeddings.create(
input="Hello world",
model="qwen3-0.6b" # Your configured model
)
Compatibility Matrix:
| Feature | Supported | Notes |
|---|---|---|
input (string) |
β | Converted to list internally |
input (list) |
β | Batch processing |
model parameter |
β | Use configured model IDs |
encoding_format |
Partial | Always returns float |
dimensions |
β | Returns model's native dimensions |
user parameter |
β | Ignored |
β οΈ Note: This is a development API.
For production deployment, host it on cloud platforms such as Hugging Face TEI, AWS, GCP, or any cloud provider of your choice.
Contributing
Contributions are welcome. Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License. See LICENSE for details.
References
- Sentence Transformers Documentation: https://www.sbert.net/
- FastAPI Documentation: https://fastapi.tiangolo.com/
- OpenAI API Specification: https://platform.openai.com/docs/api-reference/embeddings
- MTEB Benchmark: https://huggingface.co/spaces/mteb/leaderboard
- Hugging Face Spaces: https://huggingface.co/docs/hub/spaces
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Live Demo: Hugging Face Space
Maintained by: Fahmi Aziz
Project Status: Active Development
β¨ "Unify your embeddings. Simplify your AI stack."