Fast tokenizers' special powers

Why are fast tokenizers called fast?

We will exactly how much faster the so-called fast tokenizers are compared to the slow tokenizers.
Let's see how fast tokenizers are!
Mnli dataset contains 432000 spares of text.

from datasets import load_dataset
raw_datasets = load_dataset("glue", "mnli")
raw_datasets

We will see how long it takes for the fast and slow versions of a bert tokenizer to process them all.
We define two functions to preprocess the datasets.
We define fast and slow tokenizer using AutoTokenizer api.

from transformers import AutoTokenizer
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_with_fast(examples):
    return fast_tokenizer(
        examples["premise"], examples["hypothesis"], truncation=True
    )

The fast tokenizer is the default when available.
So we pass along use_fast=False to define the slow one.

slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
def tokenize_with_slow(examples):
    return fast_tokenizer(
        examples["premise"], examples["hypothesis"], truncation=True
    )

Let's see an example of this on masked language modeling.
In a notebook, we can time the execution of the cell with %time magic command.
Processing the whole dataset is 4 times faster with fast tokenizer.
That's better but very impressive.
This is because we pass on text to the tokenizer one at a time.
This is a common mistake to do with fast tokenizers which are backed by Rust.

%time tokenized_datasets = raw_datasets.map(tokenize_with_fast)

%time tokenized_datasets = raw_datasets.map(tokenize_with_slow)

Properly using a fast tokenizer requires giving it multiple texts at the same time.
Using fast tokenizers with batched=True is much, much faster.

%time tokenized_datasets = raw_datasets.map(tokenize_with_fast, batched=True)

%time tokenized_datasets = raw_datasets.map(tokenize_with_slow, batched=True)

Fast tokenizer superpowers

When performing tokenization, we lose some information.
eg: here the tokenization is the same for below 2 sentences even if 1 has several more spaces than others.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print(tokenizer("Let's talk about tokenizers superpowers.")["input_ids"])
print(tokenizer("Let's talk about tokenizers      superpowers.")["input_ids"])

It is also difficult to know which word a token belongs to.
It is difficult to know when 2 or more tokens belong to same word or not.
Fast tokenizers keep track of the word each token comes from.

encoding = tokenizer("Let's talk about tokenizers superpowers.")
print(encoding.tokens())
print(encoding.word_ids())

They even keep track of each character span in the original text that gave each token.

encoding = tokenizer(
    "Let's talk about tokenizers     superpowers.",
    return_offsets_mapping=True
)
print(encoding.tokens())
print(encoding["offset_mapping"])

The internal pipeline of the tokenizer looks like this.
- Normalization: "Let's talk about tokenizers superpowers."
- Pre-tokenization: [Let,',s,talk,about,tokenizers,superpowers,.]
- Applying Model: [Let,',s,talk,about,token,##izer,##s,super,##power,##s,.]
- Special tokens: [[CLS],Let,',s,talk,about,token,##izer,##s,super,##power,##s,.,[SEP]]
The fast tokenizers keep track of the original span of text creating each word or token.
Here are a few applications of these features:
- Word IDs application: Whole word masking, Token classification
- Offset mapping application: Token classification, Question Answering

Inside the Token classification pipeline (PyTorch)

The token classification pipeline gives each token in the sentence a label.

from transformers import pipeline
token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

It can also group together tokens corresponding to the same entity.

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

The token classification pipeline follows the general steps of the pipeline we saw before.
- Tokenizer: Raw text -> Input IDs
- Model: Input IDs -> Logits
- Postprocessing: Logits -> Predictions
We have already seen the first tseps of the pipeline: tokenization and model.

from transformers import AutoTokenizer, AutoModelForTokenClassification
model_checkpoint = ""
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
print(inputs["input_ids"].shape)  # [1,19]
print(outputs.logits.shape)       # [1,19,9]

The model outputs logits, which we need to convert to probabilities using softmax.
We also get the predicted level for each token by taking the maximum prediction

import torch
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = probabilities.argmax(dim=-1)[0].tolist()
print(predictions)

The label correspondence then lets us match each prediction to a label.

model.config.id2label

The start and end character positions can be found using the offset mappings.

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]
for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {"entity": label, "score": probabilities[idx][pred],
             "word": tokens[idx], "start": start, "end": end}
        )
print(results)

The last step is to group all the tokens corresponding to the same entity together.
We have to group together in one entity all the corresponding labels.
We group together tokens with the same label unless it's a B-XXX.

import numpy as np
label_map = model.config.id2label
results = []
idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = label_map[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]
        # Grab all the tokens labeled with I-label
        all_scores = []
        while idx < len(predictions) and label_map[predictions[idx]] == f"I-{label}":
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1
        # The score is the mean of all the scores of the token in that grouped entity.
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {"entity_group": label, "score": score,
             "word": word, "start": start, "end": end}
        )
    idx += 1