Byte-Pair Encoding tokenization
- Byte-Pair algorithm was initially proposed as a text compression algorithm
- But it is also very well suited as a tokenizer for language models.
- The idea of BPE is to divide words into the sequence of 2 words units.
- BPE training is done on a standardized and pre-tokenized corpus.
- BPE training starts with an initial vocabulary and increses it to the desired size.
- To tokenize a text, it is sufficient to divide it into elementary units and then apply the merging rules successively.
Back to top