Skip to content

Byte-Pair Encoding tokenization

  • Byte-Pair algorithm was initially proposed as a text compression algorithm
  • But it is also very well suited as a tokenizer for language models.
  • The idea of BPE is to divide words into the sequence of 2 words units.
  • BPE training is done on a standardized and pre-tokenized corpus.
  • BPE training starts with an initial vocabulary and increses it to the desired size.
  • To tokenize a text, it is sufficient to divide it into elementary units and then apply the merging rules successively.