Skip to content

Building a tokenizer, block by block

Building a new tokenizer

  • To build your tokenizer you need to design all its components:

    • Normalization
    • Pre-tokenization
    • Model
    • Post-Processing
    • Decoding
  • In a fast tokenizer all the components are gathered in the backend_tokenizer which is an instance of Tokenizer from the HuggingFace tokenizer library.

from transformers import AutoTokenizerFast
tokenizer = AutoTokenizerFast.from_pretrained("...")
type(tokenizer.backend_tokenizer)
  • tokenizers.Tokenizer <= Tokenizer from HuggingFace library.

  • The main steps to create your own tokenizer:

    1. Gather a corpus
    2. Create a backend_tokenizer with HuggingFace tokenizers
    3. Load the backend_tokenizer in a HuggingFace transformers tokenizer
  • Let's try to rebuild a BERT tokenizer together!

1. Gather a corpus

from datasets import load_dataset
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

2. Build a BERT tokenizer with HuggingFace tokenizers

from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, trainers, processors, decoders
  • Initialize a model
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
  • Define a normalizer
tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace(Regex(r"[\p{Other}&&[^\n\t\r]]"), ""),
        normalizers.Replace(Regex(r"[\s]"), " "),
        normalizers.Lowercase(),
        normalizers.NFD(), normalizers.StripAccents()]
)
  • Define pre-tokenization
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
  • Define a trainer
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
  • Train with an iterator
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
  • Define a template processing class
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)
  • Define a decoder
tokenizer.decoder = decoders.WordPiece(prefix="##")
  • Load your tokenizer into a HuggingFace transformers tokenizer