Tokenizers:
- In NLP, most of the data we handle is raw text.
- Tokenizers is used to transform raw text to numbers.
- Tokenizer's objective is to find a meaningful representation.
- 3 distict tokenizations:
- Word-based
- Character-based
- Subword-based
Word-based
- The text is split on spaces
- Other rules, such as punctuation, may be added
- In this algorithm, each word has a specific ID/number attributed to it
- Limits:
- very similar words have entirely different meanings. eg: dog vs dogs
- the vocabulary can end up very large due to lot of different words
- large vocabularies result in heavy models
- loss of meaning across very similar words
- large quantity of out-of-vocabulary tokens
- We can limit the amount of words we add to the vocabulary by ignoring certain words which are not necessary
- Out of vocabulary words result in a loss of information, since model will have the exact same representation for all words that it doesn't know, [UNK], unknown.
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)
# ['Jim', 'Henson', 'was', 'a', 'puppeteer']
Character-based
- Splitting a raw text into characters
- Character-based vocabularies are slimmer
- There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.
- Limits:
- intuitively, it’s less meaningful: each character doesn’t mean a lot on its own in compared to any word
- Their sequences are translated into very large amounts of tokens to be processed by the model and this can have an impact on the size of the contants the model will carry around and it will reduce the size of the text we can use as input for a model which is often limited
- very long sequences
- less meaningful individual tokens
Subword tokenization
- Splitting a raw text into subwords
- Finding a middle ground between word and character-based algorithms
- Subword-based tokenization lies between character and worf-based algorithm
- Frequently used words should not be split into smaller subwords
- Rare words should be decomposed into meaningful subwords
- Subwords help identify similar syntactic or semantic situations in text
- Subword tokenization algorithms can identify start of word tokens and which tokens complete start of words
-
Most models obtaining state-of-the-art results in English today use some kind of subword-tokenization algorithm.
- WordPiece: BERT, DistilBERT
- Unigram: XLNet, ALBERT
- Byte-Pair Encoding: GPT-2, RoBERTa
-
These approaches help in reducing the vocabulary sizes by sharing information across different words having the ability to have prefixes and suffixes understood as such.
- They keep meaning across very similar words by recognizing similar tokens making them up.
Tokenizer Pipeline
- A tokenizer takes texts as inputs and outputs numbers the associated model can make sense of
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Let's try to tokenize!"
inputs = tokenizer(sequence)
print(inputs)
#[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]
-
Tokenization pipeline: from input text to a list of numbers:
- Raw text: Let's try to tokenize!
- Tokens: [let, ', s, try, to, token, ##ize, !]
- Special tokens: [[CLS],let,',s, try, to, token, ##ize,!,[SEP]]
- Input IDs: [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]
-
The first step of the pipeline is to split the text into tokens.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Let's try to tokenize!"
tokens = tokenizer.tokenize(sequence)
print(tokens)
#[let, ', s, try, to, token, ##ize, !]
- Albert tokenize will add a long underscore in front of all the tokens that added space before them which is a convention shared by all sentence-based tokenizers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")
sequence = "Let's try to tokenize!"
tokens = tokenizer.tokenize(sequence)
print(tokens)
#[_let, ', s, _try, _to, _to, ken, ize, !]
- The second step of the tokenization pipeline is to map those tokens to their respective input ids as defined by the vocabulary of the tokenizer.
- This is why we need to download the file when we instantiate a tokenizer with from_pretrained() method. We have to make sure we use the same mapping as when the model was pretrained. To do this, we use convert_tokens_to_ids() method:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Let's try to tokenize!"
tokens = tokenizer.tokenize(sequence)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)
#[2292, 1005, 1055, 3046, 2000, 19204, 4697, 999]
- Lastly, the tokenizer adds special tokens the model expects by prepare_for_model() method:
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs)
#[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]
Decoding
- The decode method allows us to check how the final output of the tokenizer translates back into text.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Let's try to tokenize!"
inputs = tokenizer(sequence)
decoded_string = tokenizer.decode(inputs["input_ids"])
print(decoded_string)
# "[CLS] let's try to tokenize! [SEP]"
- Roberta tokenizers uses special tokens as :
&
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta")
sequence = "Let's try to tokenize!"
inputs = tokenizer(sequence)
decoded_string = tokenizer.decode(inputs["input_ids"])
print(decoded_string)
# "<s> let's try to tokenize! </s>"
- Hence, a tokenizer takes texts as inputs and outputs numbers the associated model can make sense of.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Let's try to tokenize!"
inputs = tokenizer(sequence)
print(inputs)
# {'input_ids': [101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102],
# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}