--- language: - ka tags: - kenlm - ngram - georgian - language-model - asr license: gpl-3.0 model-index: - name: Georgian KenLM Language Model results: [] --- # ๐Ÿฆ‰ Georgian KenLM Language Model (3-gram) This is a **KenLM 3-gram language model** trained on Georgian (แƒฅแƒแƒ แƒ—แƒฃแƒšแƒ˜) text data, intended for use in **automatic speech recognition (ASR)** and other **language modeling** tasks. --- ## ๐Ÿงพ Model Details - **Language**: Georgian (`ka`) - **Model Type**: KenLM n-gram - **n-gram size**: 3-gram - **Format**: `.arpa` - **Tooling**: [KenLM](https://github.com/kpu/kenlm) --- ## ๐Ÿ“‚ Files - `ge_model9.arpa` โ€“ ARPA plaintext format --- ## ๐Ÿ“š Training Data The model was trained on a curated collection of Georgian text from various domains: - News articles - Subtitles - Books and web content Data was cleaned, tokenized with whitespace, and normalized to standard Georgian orthography. --- ## ๐Ÿ’ฌ Intended Use This model is ideal for: - **Beam search decoding** in ASR systems (e.g., Whisper, DeepSpeech, Vosk) - **Scoring and reranking** ASR hypotheses - **Basic text modeling** or **spelling correction** in Georgian ### ๐Ÿงช Example Usage ```python import kenlm def transliterate_georgian(text): georgian_to_latin = { 'แƒ': 'a', 'แƒ‘': 'b', 'แƒ’': 'g', 'แƒ“': 'd', 'แƒ”': 'e', 'แƒ•': 'v', 'แƒ–': 'z', 'แƒ—': 'T', 'แƒ˜': 'i', 'แƒ™': 'k', 'แƒš': 'l', 'แƒ›': 'm', 'แƒœ': 'n', 'แƒ': 'o', 'แƒž': 'p', 'แƒŸ': 'J', 'แƒ ': 'r', 'แƒก': 's', 'แƒข': 't', 'แƒฃ': 'u', 'แƒค': 'f', 'แƒฅ': 'q', 'แƒฆ': 'R', 'แƒง': 'y', 'แƒจ': 'S', 'แƒฉ': 'C', 'แƒช': 'c', 'แƒซ': 'Z', 'แƒฌ': 'w', 'แƒญ': 'W', 'แƒฎ': 'x', 'แƒฏ': 'j', 'แƒฐ': 'h'} return ''.join(georgian_to_latin.get(char, char) for char in text) model = kenlm.Model("ge_model9.arpa") sentence = "แƒ”แƒก แƒแƒ แƒ˜แƒก แƒขแƒ”แƒกแƒขแƒ˜" print(model.score(transliterate_georgian(sentence), bos=True, eos=True)) ``` --- ### Citation ```none @misc{georgian-kenlm, title={Georgian KenLM Language Model}, author={Giorgi G}, year={2025}, howpublished={\url{https://huggingface.co/psyfreak/GEO-KenLM}} } ```