Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
15
8
33
Youngjoon Jang
yjoonjang
Follow
whybe-choi's profile picture
hongst's profile picture
drexalt's profile picture
12 followers
Β·
19 following
https://yjoonjang.github.io/
yjoonjang
yjoonjang
AI & ML interests
Information Retrieval (IR), Retrieval-Augmented Generation (RAG)
Recent Activity
new
activity
10 days ago
yjoonjang/colbert-ko-v1.0:
Produce ColBERT-KO Evaluation Results
reacted
to
Norod78
's
post
with π
12 days ago
Multilingual Tokenization Showdown Analyzing 12 LLM Tokenizers Across 204 Languages. First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages: https://huggingface.co/datasets/Norod78/WikiCat-Multilingual For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language). You can see a slideshow summary of the results here: https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo: https://github.com/Norod/wikicat-tokenizer-eval Post on X: https://x.com/Norod78/status/1984366900550266999
liked
a dataset
25 days ago
whybe-choi/trec-dl-2020
View all activity
Organizations
yjoonjang
's datasets
32
Sort:Β Recently updated
yjoonjang/squad_kor_v1
Viewer
β’
Updated
Jun 7
β’
12.5k
β’
26
yjoonjang/toolret
Viewer
β’
Updated
Apr 7
β’
66.5k
β’
13
β’
1
yjoonjang/squad_v2_ragsoluted_qwen
Viewer
β’
Updated
Mar 29
β’
13.1k
β’
12
yjoonjang/nanoscidocs_ragsoluted_qwen
Viewer
β’
Updated
Mar 29
β’
2.5k
β’
22
yjoonjang/boolq_ragsoluted_qwen
Viewer
β’
Updated
Mar 29
β’
9.48k
β’
7
yjoonjang/belebele_ragsoluted_qwen
Viewer
β’
Updated
Mar 29
β’
2.29k
β’
20
yjoonjang/kure-eng-kor-dev
Viewer
β’
Updated
Mar 11
β’
2.9k
β’
13
yjoonjang/kure-kor-eng-dev
Viewer
β’
Updated
Mar 11
β’
2.9k
β’
10
yjoonjang/nanodbpedia_ragsoluted
Viewer
β’
Updated
Feb 12
β’
7.25k
β’
9
yjoonjang/nanoscidocs_ragsoluted
Viewer
β’
Updated
Feb 12
β’
2.5k
β’
13
yjoonjang/nanonq_ragsoluted
Viewer
β’
Updated
Feb 12
β’
5.14k
β’
14
yjoonjang/xquad_ragsoluted
Viewer
β’
Updated
Feb 11
β’
2.62k
β’
14
yjoonjang/xquad
Viewer
β’
Updated
Feb 11
β’
2.62k
β’
17
yjoonjang/nanomsmarco_ragsoluted
Viewer
β’
Updated
Feb 11
β’
5.14k
β’
24
yjoonjang/nanofiqa_ragsoluted
Viewer
β’
Updated
Feb 11
β’
4.77k
β’
47
yjoonjang/mlqa_ragsoluted
Viewer
β’
Updated
Feb 11
β’
33.1k
β’
9
yjoonjang/boolq_ragsoluted
Viewer
β’
Updated
Feb 10
β’
9.48k
β’
23
yjoonjang/nq_simplified
Viewer
β’
Updated
Feb 10
β’
47k
β’
21
yjoonjang/belebele_ragsoluted
Viewer
β’
Updated
Feb 10
β’
2.29k
β’
40
yjoonjang/belebele
Viewer
β’
Updated
Feb 10
β’
2.29k
β’
9
yjoonjang/ms_marco_ragsoluted
Viewer
β’
Updated
Feb 10
β’
29.5k
β’
4
yjoonjang/squad_v2_ragsoluted
Viewer
β’
Updated
Feb 10
β’
13.1k
β’
8
β’
1
yjoonjang/squad_v2_ragsoluted_v2
Viewer
β’
Updated
Feb 10
β’
13.1k
β’
42
yjoonjang/mlqa
Viewer
β’
Updated
Feb 10
β’
33.1k
β’
17
yjoonjang/sciq
Viewer
β’
Updated
Feb 10
β’
14k
β’
15
yjoonjang/boolq
Viewer
β’
Updated
Feb 10
β’
9.48k
β’
10
yjoonjang/ifqa
Viewer
β’
Updated
Feb 10
β’
2.1k
β’
4
yjoonjang/ms_marco
Viewer
β’
Updated
Feb 9
β’
29.5k
β’
17
yjoonjang/musique
Viewer
β’
Updated
Feb 9
β’
15.2k
β’
17
yjoonjang/squad_v2
Viewer
β’
Updated
Feb 8
β’
13.1k
β’
16
Previous
1
2
Next