Spaces:

fahmiaziz
/

api-embedding

Running

App Files Files Community

api-embedding / README.md

fahmiaziz98

[UPDATE]: update readme

95cd425 11 days ago

preview code

raw

history blame contribute delete

18 kB

metadata

title: Api Embedding
emoji: 🐠
colorFrom: green
colorTo: purple
sdk: docker
pinned: false

Unified Embedding API

🧩 A self-hosted embedding service for dense, sparse, and reranking models with OpenAI-compatible API.

Overview

The Unified Embedding API is a modular, self-hosted solution designed to simplify the development and management of embedding models for Retrieval-Augmented Generation (RAG) and semantic search applications. Built on FastAPI and Sentence Transformers, this API provides a unified interface for dense embeddings, sparse embeddings (SPLADE), and document reranking through CrossEncoder models.

Key Differentiation: Unlike traditional embedding services that require separate infrastructure for each model type, this API consolidates all embedding operations into a single, configurable endpoint with OpenAI-compatible responses.

Project Motivation

During the development of RAG and agentic systems for production environments and portfolio projects, several operational challenges emerged:

Development Environment Overhead: Each experiment required setting up isolated environments with PyTorch, Transformers, and associated dependencies (often 5-10GB per environment)
Model Experimentation Costs: Testing different models for optimal precision, MRR, and recall metrics necessitated downloading multiple model versions, consuming significant disk space and compute resources
Hardware Limitations: Running models locally on CPU-only machines frequently resulted in thermal throttling and system instability

Solution Approach: After evaluating Hugging Face's Text Embeddings Inference (TEI), the need for a more flexible, configuration-driven solution became apparent. This project addresses these challenges by:

Providing a single API endpoint that can serve multiple model types
Enabling model switching through configuration files without code changes
Leveraging Hugging Face Spaces for free, serverless hosting
Maintaining compatibility with OpenAI's client libraries for seamless integration

Technical Motivation

Architecture Decisions

1. Framework Selection: SentenceTransformers + FastAPI

SentenceTransformers was chosen as the core embedding library for several technical reasons:

Unified Model Interface: Provides consistent APIs across diverse model architectures (BERT, RoBERTa, SPLADE, CrossEncoders)
Model Ecosystem: Direct compatibility with 5,000+ pre-trained models on Hugging Face Hub

FastAPI serves as the web framework due to:

Async-First Architecture: Non-blocking I/O operations critical for handling concurrent embedding requests
Automatic API Documentation: OpenAPI/Swagger generation reduces documentation overhead
Type Safety: Pydantic integration ensures request validation at the schema level

2. Hosting Strategy: Hugging Face Spaces

Deploying on Hugging Face Spaces provides several operational advantages:

Zero infrastructure cost for CPU-based workloads (2vCPU, 16GB RAM)
Eliminates need for dedicated VPS or cloud compute instances
No egress fees for model weight downloads from HF Hub
Built-in CI/CD through git-based deployments
Easy transition to paid GPU instances for larger models
Native support for Docker-based deployments

Features

Core Capabilities

Multi-Model Support: Serve dense embeddings (transformers), sparse embeddings (SPLADE), and reranking models (CrossEncoders) from a single API
OpenAI Compatibility: Drop-in replacement for OpenAI's embedding API with client library support
Configuration-Driven: Switch models through YAML configuration without code modifications
Batch Processing: Automatic optimization for single and batch requests
Type Safety: Full Pydantic validation with OpenAPI schema generation
Async Operations: Non-blocking request handling with FastAPI's async/await

Architecture

System Components

┌─────────────────────────────────────────────────────────┐
│                     FastAPI Server                      │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ Embeddings │  │  Reranking │  │   Models   │         │
│  │  Endpoint  │  │  Endpoint  │  │  Endpoint  │         │
│  └────────────┘  └────────────┘  └────────────┘         │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                   Model Manager                         │
│  • Configuration Loading                                │
│  • Model Lifecycle Management                           │
│  • Thread-Safe Model Access                             │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│              Embedding Implementations                  │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐     │
│  │    Dense     │ │    Sparse    │ │   Reranking  │     │
│  │(Transformer) │ │   (SPLADE)   │ │(CrossEncoder)│     │
│  └──────────────┘ └──────────────┘ └──────────────┘     │
└─────────────────────────────────────────────────────────┘

Project Structure

unified-embedding-api/
├── src/
│   ├── api/                    # API layer
│   │   ├── dependencies.py     # Dependency injection
│   │   └── routes/
│   │       ├── embeddings.py   # Dense/sparse endpoints
│   │       ├── model_list.py   # Model management
│   │       └── health.py       # Health checks
│   │       └── rerank.py       # Reranking endpoint
│   ├── core/                   # Business logic
│   │   ├── base.py             # Abstract base classes
│   │   ├── config.py           # Configuration models
│   │   ├── exceptions.py       # Custom exceptions     
│   │   └── manager.py          # Model lifecycle management
│   ├── models/                 # Domain models
│   │   ├── embeddings/
│   │   │   ├── dense.py        # Dense embedding implementation
│   │   │   ├── sparse.py       # Sparse embedding implementation
│   │   │   └── rank.py         # Reranking implementation
│   │   └── schemas/
│   │       ├── common.py       # Shared schemas
│   │       ├── requests.py     # Request models
│   │       └── responses.py    # Response models
│   ├── config/
│   │   ├── settings.py         # Application settings
│   │   └── models.yaml         # Model configuration
│   └── utils/
│       ├── logger.py           # Logging configuration
│       └── validators.py       # Validation kwrags, token etc
├── app.py                      # Application entry point
├── requirements.txt            # Development dependencies
└── Dockerfile                  # Container definition

Quick Start

Deployment on Hugging Face Spaces

Prerequisites:

Hugging Face account
Git installed locally

Steps:

Duplicate Space
- Navigate to fahmiaziz/api-embedding
- Click the three-dot menu → "Duplicate this Space"
Configure Environment
- In Space settings, add HF_TOKEN as a repository secret (for private model access)
- Ensure Space visibility is set to "Public"

Clone Repository

git clone https://huggingface.co/spaces/YOUR_USERNAME/api-embedding
cd api-embedding

Configure Models Edit src/config/models.yaml:

models:
  custom-model:
    name: "organization/model-name"
    type: "embeddings"  # Options: embeddings, sparse-embeddings, rerank

Deploy Changes

git add src/config/models.yaml
git commit -m "Configure custom models"
git push

Access API
- Click ⋯ → Embed this Space → copy Direct URL
- Base URL: https://YOUR_USERNAME-api-embedding.hf.space
- Documentation: https://YOUR_USERNAME-api-embedding.hf.space/docs

Local Development (NOT RECOMMENDED)

System Requirements:

Python 3.10+
8GB RAM minimum
10GB++ disk space

Setup:

# Clone repository
git clone https://github.com/fahmiaziz98/unified-embedding-api.git
cd unified-embedding-api

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Start server
python app.py

Server will be available at http://localhost:7860

Docker Deployment

# Build image
docker build -t unified-embedding-api .

# Run container
docker run -p 7860:7860 unified-embedding-api

Usage

Native API (requests)

import requests

BASE_URL = "https://fahmiaziz-api-embedding.hf.space/api/v1"

# Generate embeddings
response = requests.post(
    f"{BASE_URL}/embeddings",
    json={
        "input": "Natural language processing",
        "model": "qwen3-0.6b"
    }
)

data = response.json()
embedding = data["data"][0]["embedding"]
print(f"Embedding dimensions: {len(embedding)}")

OpenAI Client Integration

The API implements OpenAI's embedding API specification, enabling direct integration with OpenAI's Python client:

from openai import OpenAI

client = OpenAI(
    base_url="https://fahmiaziz-api-embedding.hf.space/api/v1",
    api_key="not-required"  # Placeholder required by client
)

# Single text embedding
response = client.embeddings.create(
    input="Text to embed",
    model="qwen3-0.6b"
)

embedding_vector = response.data[0].embedding

Async Operations:

from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://fahmiaziz-api-embedding.hf.space/api/v1",
    api_key="not-required"
)

async def generate_embeddings(texts: list[str]):
    response = await client.embeddings.create(
        input=texts,
        model="qwen3-0.6b"
    )
    return [item.embedding for item in response.data]

# Usage in async context
embeddings = await generate_embeddings(["text1", "text2"])

Document Reranking

import requests

response = requests.post(
    f"{BASE_URL}/rerank",
    json={
        "query": "machine learning frameworks",
        "documents": [
            "TensorFlow is a comprehensive ML platform",
            "React is a JavaScript UI library",
            "PyTorch provides flexible neural networks"
        ],
        "model": "bge-v2-m3",
        "top_k": 2
    }
)

results = response.json()["results"]
for result in results:
    print(f"Score: {result['score']:.3f} - {result['text']}")

API Reference

Endpoints

Endpoint	Method	Description	OpenAI Compatible
`/api/v1/embeddings`	POST	Generate embeddings	Yes
`/api/v1/embed_sparse`	POST	Generate sparse embeddings	No
`/api/v1/rerank`	POST	Rerank documents	No
`/api/v1/models`	GET	List available models	Partial
`/health`	GET	Health check	No

Request Format

Embeddings (OpenAI-compatible):

{
  "input": "text" ["text1", "text2"],
  "model": "model-identifier",
  "encoding_format": "float"
}

Sparse Embeddings:

{
  "input": "text" ["text1", "text2"],
  "model": "splade-model-id"
}

Reranking:

{
  "query": "search query",
  "documents": ["doc1", "doc2"],
  "model": "reranker-id",
  "top_k": 10
}

Response Format

Standard Embedding Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.123, -0.456,],
      "index": 0
    }
  ],
  "model": "qwen3-0.6b",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

Configuration

Model Configuration

Default configuration is optimized for CPU 2vCPU / 16GB RAM. See MTEB Leaderboard for model recommendations and memory usage reference.

Edit src/config/models.yaml to add or modify models:

models:
  # Dense embedding model
  custom-dense:
    name: "sentence-transformers/all-MiniLM-L6-v2"
    type: "embeddings"

  # Sparse embedding model
  custom-sparse:
    name: "prithivida/Splade_PP_en_v1"
    type: "sparse-embeddings"

  # Reranking model
  custom-reranker:
    name: "BAAI/bge-reranker-base"
    type: "rerank"

Model Type Reference:

Type	Description	Use Case
`embeddings`	Dense vector embeddings	Semantic search, similarity
`sparse-embeddings`	Sparse vectors (SPLADE)	Keyword + semantic hybrid
`rerank`	CrossEncoder scoring	Precision reranking

⚠️ If you plan to use larger models like Qwen2-embedding-8B, please upgrade your Space.

Application Settings

Configure through src/config/settings.py file:

# Application
APP_NAME="Unified Embedding API"
VERSION="3.0.0"

# Server
HOST=0.0.0.0
PORT=7860  # don't change port
WORKERS=1

# Models
MODEL_CONFIG_PATH=src/config/models.yaml
PRELOAD_MODELS=true
DEVICE=cpu

# Logging
LOG_LEVEL=INFO

Performance Optimization

Recommended Practices

Batch Processing
- Always send multiple texts in a single request when possible
- Batch size of 16-32 provides optimal throughput/latency balance
Normalization
- Enable normalize_embeddings for cosine similarity operations
- Reduces downstream computation in vector databases
Model Selection
- Dense models: Best for semantic similarity
- Sparse models: Better for keyword matching + semantics
- Reranking: Use as second-stage after initial retrieval

Migration from OpenAI

Replace OpenAI embedding calls with minimal code changes:

Before (OpenAI):

from openai import OpenAI
client = OpenAI(api_key="sk-...")

response = client.embeddings.create(
    input="Hello world",
    model="text-embedding-3-small"
)

After (Self-hosted):

from openai import OpenAI
client = OpenAI(
    base_url="https://your-space.hf.space/api/v1",
    api_key="not-required"
)

response = client.embeddings.create(
    input="Hello world",
    model="qwen3-0.6b"  # Your configured model
)

Compatibility Matrix:

Feature	Supported	Notes
`input` (string)	✓	Converted to list internally
`input` (list)	✓	Batch processing
`model` parameter	✓	Use configured model IDs
`encoding_format`	Partial	Always returns float
`dimensions`	✗	Returns model's native dimensions
`user` parameter	✗	Ignored

⚠️ Note: This is a development API.

For production deployment, host it on cloud platforms such as Hugging Face TEI, AWS, GCP, or any cloud provider of your choice.

Contributing

Contributions are welcome. Please follow these guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License. See LICENSE for details.

References

Sentence Transformers Documentation: https://www.sbert.net/
FastAPI Documentation: https://fastapi.tiangolo.com/
OpenAI API Specification: https://platform.openai.com/docs/api-reference/embeddings
MTEB Benchmark: https://huggingface.co/spaces/mteb/leaderboard
Hugging Face Spaces: https://huggingface.co/docs/hub/spaces

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Live Demo: Hugging Face Space

Maintained by: Fahmi Aziz
Project Status: Active Development

✨ "Unify your embeddings. Simplify your AI stack."