mmBERT-small-m-edu-classifier / README.md

adorkin

Update README.md

4e800b7 verified 2 months ago

preview code

raw

history blame contribute delete

2.42 kB

metadata

library_name: transformers
license: mit
datasets:
  - data-is-better-together/fineweb-c
  - HuggingFaceFW/fineweb-edu-llama3-annotations
  - tartuNLP/fineweb-c-combined-resample
base_model:
  - jhu-clsp/mmBERT-small
language:
  - ekk
  - eng
  - lvs
  - kor
  - kin
  - ita
  - hin
  - gom
  - glg
  - fra
  - fas
  - eus
  - deu
  - dan
  - cat
  - bho
  - bak
  - arz
  - ary
  - arb
  - yor
  - vie
  - ukr
  - tur
  - tir
  - tel
  - tam
  - swe
  - spa
  - slk
  - rus
  - por
  - pbt
  - nld
  - nds
  - vls
  - srp
  - sco
  - lat
  - ind
  - gmh
  - bul
  - bre
  - bar
  - apc
  - aeb
  - ron
  - mar
  - kas
  - hin
  - crh
  - arb
  - zsm
  - yue
  - uzn
  - uzn
  - udm
  - tha
  - tat
  - som
  - sin
  - pol
  - pcm
  - npi
  - npi
  - nan
  - mal
  - lug
  - lit
  - lez
  - asm
  - asm
  - fil
  - cmn
  - ces
  - ben
  - ast
  - ars
  - jpn
  - guj
  - gsw
  - fin
  - tok
  - srp
  - quz
  - pfl
  - pdc
  - nob
  - lij
  - hun
  - hsb
  - goh
  - fao
pipeline_tag: text-classification

Multilingual Educational Content Classifier

Trained on full documents of up to 8192 tokens in total. The train set of tartuNLP/fineweb-c-combined-resample was used, which itself is a mix and a resample of HuggingFaceFW/fineweb-edu-llama3-annotations and data-is-better-together/fineweb-c.

Labels

{0: '❗ Problematic Content ❗', 1: 'None', 2: 'Minimal', 3: 'Basic', 4: 'Good', 5: 'Excellent'}

Classification Report

Evaluated on the development set of tartuNLP/fineweb-c-combined-resample organized so that each language appears at least once.

              precision    recall  f1-score   support

           0       0.89      0.78      0.83       602
           1       0.65      0.88      0.75       916
           2       0.41      0.29      0.34       345
           3       0.40      0.30      0.34       179
           4       0.53      0.15      0.23       127
           5       0.55      0.39      0.45        44

    accuracy                           0.66      2213
   macro avg       0.57      0.46      0.49      2213
weighted avg       0.65      0.66      0.64      2213

Confusion Matrix

[[471 114  10   6   0   1]
 [ 33 806  59  13   5   0]
 [ 10 204 101  28   2   0]
 [  7  72  37  53   8   2]
 [  7  35  27  28  19  11]
 [  2   7  10   6   2  17]]