Normalization and pre-tokenization

What is normalization?

The normalization step involves some general cleanup, such as removing needless whitespace, lowercasing, and/or removing accents.
Here is the result of the normalization of several tokenizers on the same sentence.

FNetTokenizerFast.from_pretrained("google/fnet-base")
RetriBertTokenizerFast.from_pretrained("yjernite/retribert-base-uncased") # and so on

Fast tokenizers provide easy access to the normalization operation.

from transformers import AutoTokenizer
text = "This is a text with àccënts and CAPITAL LETTERS"
tokenizer = AutoTokenizerFast.from_pretrained('distilbert-base-uncased')
print(tokenizer.backend_tokenizer.normalizer.normalize_str(text))

And it's really handy that the normalization operation is automatically included when you tokenize a text.

# With saved normalizer
from transformers import AutoTokenizer
text = "This is a text with àccënts and CAPITAL LETTERS"
tokenizer = AutoTokenizer.from_pretrained("albert-large-v2")
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))

# Without saved normalizer
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/albert-tokenizer-without-normalizer")
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(text)))

Some normalizations may not be visible to the eye and but change many things for the computer.
There are some Unicode normalization standards: NFC, NFD, NFKC and NFKD
But beware, not all normalizations are suitable for all corpus.

from transformers import AutoTokenizer
text = "un père indigné"
tokenizer = AutoTokenizerFast.from_pretrained('distilbert-base-uncased')
print(tokenizer.backend_tokenizer.normalizer.normalize_str(text))

What is pre-tokenization?

The pre-tokenization applies rules to realize a first split of the text.
Let's look at the result of the pre-tokenization of several tokenizers.: "3.2.1: let's get started!"
- 'gpt2': 3 . 2 . 1 : Ġlet 's Ġget Ġstarted !
- 'albert-base-v1': _3.2.1: _let's _get _started!
- 'bert-base-uncased': 3 . 2 . 1 : let ' s get started !
Pre-tokenization can modify text - such as replacing a space with a special underscore - and split text into tokens.
Fast tokenizers provide easy access to the pre-tokenization operation.

from transformers import AutoTokenizerFast
tokenizer = AutoTokenizerFast.from_pretrained('albert-base-v1’)
text = "3.2.1: let's get started!"
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text))