minilingua-ai
/

MiniLingua-1b

Model card Files Files and versions

aaaksenova commited on Jul 27

Commit

a5f3f3c

·

verified ·

1 Parent(s): 14eec9d

Update README.md

Files changed (1) hide show

README.md +55 -3

README.md CHANGED Viewed

@@ -1,3 +1,55 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+  - bg
+  - cs
+  - nl
+  - en
+  - fi
+  - fr
+  - de
+  - el
+  - it
+  - pl
+  - pt
+  - es
+  - sv
+  - code
+tags:
+  - multilingual
+  - base-model
+  - transformer
+  - decoder-only
+  - LLM
+  - smol
+  - MiniLingua
+---
+# MiniLingua-1b
+**MiniLingua-1b** is a multilingual base language model with approximately 1 billion parameters, trained from scratch with a custom sentencepiece 128k token tokenizer supporting the following languages:
+Bulgarian, Czech, Dutch, English, Finnish, French, German, Greek, Italian, Polish, Portuguese, Spanish, Swedish, and programming code.
+### Training Details
+MiniLingua-1b was trained on a 1 trillion token corpus that includes:
+- [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
+- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)
+- Curated high-quality multilingual and code data from public sources
+The model was trained for 1.5 epochs over 12 days on the [LUMI supercomputer](https://lumi-supercomputer.eu/), using:
+- 256 AMD MI250X GPUs
+- bf16 precision
+- Megatron-LM library
+- Data parellelism
+### Intended Use
+This model serves as a multilingual base LLM, suitable for instruction tuning, research, and language understanding tasks in low- and high-resource European languages.
+### License
+Apache 2.0 — free for research and commercial use, subject to the terms.
+---