---
language:
- ka
tags:
- kenlm
- ngram
- georgian
- language-model
- asr
license: gpl-3.0
model-index:
- name: Georgian KenLM Language Model
  results: []
---

# 🦉 Georgian KenLM Language Model (3-gram)

This is a **KenLM 3-gram language model** trained on Georgian (ქართული) text data, intended for use in **automatic speech recognition (ASR)** and other **language modeling** tasks.

---

## 🧾 Model Details

- **Language**: Georgian (`ka`)
- **Model Type**: KenLM n-gram
- **n-gram size**: 3-gram
- **Format**: `.arpa`
- **Tooling**: [KenLM](https://github.com/kpu/kenlm)

---

## 📂 Files

- `ge_model9.arpa` – ARPA plaintext format

---

## 📚 Training Data

The model was trained on a curated collection of Georgian text from various domains:

- News articles
- Subtitles
- Books and web content

Data was cleaned, tokenized with whitespace, and normalized to standard Georgian orthography.

---

## 💬 Intended Use

This model is ideal for:

- **Beam search decoding** in ASR systems (e.g., Whisper, DeepSpeech, Vosk)
- **Scoring and reranking** ASR hypotheses
- **Basic text modeling** or **spelling correction** in Georgian

### 🧪 Example Usage

```python
import kenlm

def transliterate_georgian(text):
    georgian_to_latin = {
    'ა': 'a', 'ბ': 'b', 'გ': 'g', 'დ': 'd', 'ე': 'e', 'ვ': 'v', 'ზ': 'z', 'თ': 'T', 'ი': 'i',
    'კ': 'k', 'ლ': 'l', 'მ': 'm', 'ნ': 'n', 'ო': 'o', 'პ': 'p', 'ჟ': 'J', 'რ': 'r', 'ს': 's',
    'ტ': 't', 'უ': 'u', 'ფ': 'f', 'ქ': 'q', 'ღ': 'R', 'ყ': 'y', 'შ': 'S', 'ჩ': 'C', 'ც': 'c',
    'ძ': 'Z', 'წ': 'w', 'ჭ': 'W', 'ხ': 'x', 'ჯ': 'j', 'ჰ': 'h'}
    
    return ''.join(georgian_to_latin.get(char, char) for char in text)

model = kenlm.Model("ge_model9.arpa")
sentence = "ეს არის ტესტი"
print(model.score(transliterate_georgian(sentence), bos=True, eos=True))
```
---

### Citation

```none
@misc{georgian-kenlm,
  title={Georgian KenLM Language Model},
  author={Giorgi G},
  year={2025},
  howpublished={\url{https://huggingface.co/psyfreak/GEO-KenLM}}
}
```