--- title: Tokeniser Py emoji: 🌍 colorFrom: green colorTo: gray sdk: streamlit sdk_version: 1.43.2 app_file: app.py pinned: false license: mit short_description: Demonstrating the custom tokeniser library (tokeniser-py) --- # tokeniser-py 🔣 - Interactive Tokenization Visualizer **Imp Links: [PyPI Main Library (tokeniser-py)](https://pypi.org/project/tokeniser-py/) | [PyPI Lite Library (tokeniser-py-lite)](https://pypi.org/project/tokeniser-py-lite/) | [Main Library GitHub (tokeniser-py)](https://github.com/Tasmay-Tibrewal/tokeniser-py) | [Lite Library GitHub (tokeniser-py-lite)](https://github.com/Tasmay-Tibrewal/tokeniser-py-lite) | [Complete repo (unchunked) - HF](https://huggingface.co/datasets/Tasmay-Tib/Tokeniser) | [Complete repo (chunked) - GitHub](https://github.com/Tasmay-Tibrewal/Tokeniser) | [Imp Files Github](https://github.com/Tasmay-Tibrewal/Tokeniser-imp)** This Hugging Face Space demonstrates **tokeniser-py**, a custom tokenizer built from scratch for language model preprocessing. Unlike traditional tokenizers like BPE (Byte Pair Encoding), tokeniser-py uses a unique algorithm developed independently and trained on over 1 billion tokens from the SlimPajama dataset. ## 🚀 Features of this Demo - **Interactive Tokenization**: Enter any text and see how it's broken down into tokens - **Visual Token Representation**: Each token is displayed in a different color with token IDs on hover - **Multiple Model Options**: Choose between different tokenizer configurations (1b/0.5b models, ordered/unordered tokens) - **Real-time Statistics**: See token count, character count, and characters per token ratio - **Educational Content**: Learn about tokenization efficiency and how tokeniser-py compares to other tokenizers ## 📊 About tokeniser-py tokeniser-py offers: - A vocabulary of **131,072 tokens** - Two vocabulary versions: - `0.5B`: Trained on validation-only data - `1B`: Trained on validation + test data (default) - Tokens can be ordered or unordered (by frequency) - Efficient token segmentation for out-of-vocabulary words using dynamic programming - One-hot encoding support (NumPy or PyTorch) - Customizable tokenization parameters ## 🔍 Tokenization Efficiency When comparing tokeniser-py to standard tokenizers like GPT-2/GPT-4: - Typical tokenizers: ~3.9 characters per token (~80 words per 100 tokens) - tokeniser-py: ~2.52 characters per token overall, ~3.97 for alphanumeric tokens (~90 words per 100 tokens) - tokeniser-py separates special characters from alphanumeric tokens, prioritizing semantic representation ## 💻 Usage in Python ```python from tokeniser import Tokeniser # Initialize tokenizer (defaults to 1b unordered model) t = Tokeniser() # For other models: # t = Tokeniser(ln="0.5b", token_ordered=True) # Tokenize text tokens, count = t.tokenise("Your input text here.") # Get token IDs token_ids = t.token_ids(tokens) # Convert to one-hot encoding (NumPy) one_hot = t.one_hot_tokens(token_ids) # For PyTorch: # one_hot_torch = t.one_hot_tokens(token_ids, op='torch') ``` ## 🔗 Resources - [GitHub Repository](https://github.com/Tasmay-Tibrewal/tokeniser-py) - [PyPI Package](https://pypi.org/project/tokeniser-py/) - [Hugging Face Dataset](https://huggingface.co/datasets/Tasmay-Tib/Tokeniser) - [GitHub Dataset (chunked)](https://github.com/Tasmay-Tibrewal/Tokeniser) - [GitHub Implementation Files](https://github.com/Tasmay-Tibrewal/Tokeniser-imp) ## 🧠 Design Philosophy tokeniser-py prioritizes semantic representation over token count minimization. By separating special characters from alphanumeric tokens, it provides more available alphanumeric tokens in the vocabulary, which improves semantic representation at the cost of slightly higher token counts. ## 🔧 Installation ```bash pip install tokeniser-py ``` Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference