Skip to content

Unigram tokenization

  • The overall training strategy is to start with a very large vocabulary and then iteratively reduce it.
  • Unigram model is a type of Statistical Language Model assuming that the occurrence of each word is independent of its previous word.
  • Let's look at a toy example to understand how to train a Unigram LM tokenizer and how to use it to tokenize a new text.
  • 1st iteration:
    • E-step: Estimate the probabilities
  • Additional explanations:
    • How do we tokenize a text with Unigram LM?
    • How do we calculate the loss on the training corpus?
  • 1st iteration:
    • M-step: Remove the token that least impacts the loss on the corpus.
  • 2nd iteration:
    • E-step: Estimate the probabilities
  • 2nd iteration:
    • M-step: Remove the token that least impacts the loss on the corpus.
  • In practice, when we want to find the optimal tokenization of a word according to a Unigram model, we use the Viterbi algorithm instead of listing and calculating and comparing all the possibilities.