aaaksenova commited on
Commit
a5f3f3c
·
verified ·
1 Parent(s): 14eec9d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -3
README.md CHANGED
@@ -1,3 +1,55 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - bg
5
+ - cs
6
+ - nl
7
+ - en
8
+ - fi
9
+ - fr
10
+ - de
11
+ - el
12
+ - it
13
+ - pl
14
+ - pt
15
+ - es
16
+ - sv
17
+ - code
18
+ tags:
19
+ - multilingual
20
+ - base-model
21
+ - transformer
22
+ - decoder-only
23
+ - LLM
24
+ - smol
25
+ - MiniLingua
26
+ ---
27
+
28
+ # MiniLingua-1b
29
+
30
+ **MiniLingua-1b** is a multilingual base language model with approximately 1 billion parameters, trained from scratch with a custom sentencepiece 128k token tokenizer supporting the following languages:
31
+
32
+ Bulgarian, Czech, Dutch, English, Finnish, French, German, Greek, Italian, Polish, Portuguese, Spanish, Swedish, and programming code.
33
+
34
+ ### Training Details
35
+
36
+ MiniLingua-1b was trained on a 1 trillion token corpus that includes:
37
+ - [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
38
+ - [The Stack](https://huggingface.co/datasets/bigcode/the-stack)
39
+ - Curated high-quality multilingual and code data from public sources
40
+
41
+ The model was trained for 1.5 epochs over 12 days on the [LUMI supercomputer](https://lumi-supercomputer.eu/), using:
42
+ - 256 AMD MI250X GPUs
43
+ - bf16 precision
44
+ - Megatron-LM library
45
+ - Data parellelism
46
+
47
+ ### Intended Use
48
+
49
+ This model serves as a multilingual base LLM, suitable for instruction tuning, research, and language understanding tasks in low- and high-resource European languages.
50
+
51
+ ### License
52
+
53
+ Apache 2.0 — free for research and commercial use, subject to the terms.
54
+
55
+ ---