Normalization and pre-tokenization
What is normalization?
- The normalization step involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents.
- Here is the result of the normalization of several tokenizers on the same sentence.
FNetTokenizerFast.from_pretrained("google/fnet-base")
RetriBertTokenizerFast.from_pretrained("yjernite/retribert-base-uncased") # and so on
- Fast tokenizers provide easy access to the normalization operation.
from transformers import AutoTokenizer
text = "This is a text with àccënts and CAPITAL LETTERS"
tokenizer = AutoTokenizerFast.from_pretrained('distilbert-base-uncased')
print(tokenizer.backend_tokenizer.normalizer.normalize_str(text))
- And it's really handy that the normalization operation is automatically included when you tokenize a text.
# With saved normalizer
from transformers import AutoTokenizer
text = "This is a text with àccënts and CAPITAL LETTERS"
tokenizer = AutoTokenizer.from_pretrained("albert-large-v2")
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))
# Without saved normalizer
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/albert-tokenizer-without-normalizer")
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))
- Some normalizations may not be visible to the eye and but change many things for the computer.
-
There are some Unicode normalization standards: NFC, NFD, NFKC and NFKD
-
But beware, not all normalizations are suitable for all corpus.
from transformers import AutoTokenizer
text = "un père indigné"
tokenizer = AutoTokenizerFast.from_pretrained('distilbert-base-uncased')
print(tokenizer.backend_tokenizer.normalizer.normalize_str(text))
What is pre-tokenization?
- The pre-tokenization applies rules to realize a first split of the text.
- Let's look at the result of the pre-tokenization of several tokenizers.: "3.2.1: let's get started!"
- 'gpt2': 3 . 2 . 1 : Ġlet 's Ġget Ġstarted !
- 'albert-base-v1': _3.2.1: _let's _get _started!
- 'bert-base-uncased': 3 . 2 . 1 : let ' s get started !
-
Pre-tokenization can modify text - such as replacing a space with a special underscore - and split text into tokens.
-
Fast tokenizers provide easy access to the pre-tokenization operation.