--- library_name: transformers license: apache-2.0 datasets: - pints-ai/Expository-Prose-V1 language: - en --- # bpe tokenizer w byte-fallback: 32k vocab BPE tokenizer for encoders/MLM objective with byte-pair fallback: - Trained on `pints-ai/Expository-Prose-V1`; this tokenizer is primarily for English and code. - this tokenizer is cased: "HELLO WORLD" is different than "hello world" - `model_max_length` is set to 1e9 to not cause hidden issues. **Set `tokenizer.model_max_length` to your model's max position embeddings** when training. ## Details MLM Tokenizer Configuration (EN/Code) Model: - Type: BPE with byte_fallback - Vocab: [UNK]=0, [CLS]=1, [SEP]=2, [PAD]=3, [MASK]=4 Pre-tokenization: - ByteLevel (add_prefix_space=true, trim_offsets=true, use_regex=true) Normalization: - Remove null bytes (U+0000, U+FFFD) - Remove control chars (except \t, \n, \r) - NFC normalization Post-processing: - Single: [CLS] text [SEP] - Pair: [CLS] text_a [SEP] text_b [SEP] (type_ids: 0, 1) Decoder: - ByteLevel (add_prefix_space=true, trim_offsets=true, use_regex=true) Key Features: - byte_fallback=true (no UNK on unknown chars) - No dropout, no continuing_subword_prefix/end_of_word_suffix - BERT-style sequence formatting with GPT-2 style byte-level encoding after loading with autotokenizer and calling `print(tokenizer)`: ```py PreTrainedTokenizerFast( name_or_path=( "repo-name", ), vocab_size=32000, model_max_length=1000000000.0, is_fast=True, padding_side='right', truncation_side='right', special_tokens={ 'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]', }, clean_up_tokenization_spaces=False, added_tokens_decoder={ 0: AddedToken( "[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True, ), 1: AddedToken( "[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True, ), 2: AddedToken( "[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True, ), 3: AddedToken( "[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True, ), 4: AddedToken( "[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True, ), }, ) ```