Full Training

Write your training loop in PyTorch

Preprocessing

Here is how we can easily preprocess the GLUE MRPC dataset using dynamic padding.

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
data_collator = DataCollatorWithPadding(tokenizer)

Once our data is preprocessed, we just have to create our DataLoaders which will be responsible to convert our dataset into batches.

from torch.utils.data import DataLoader
train_dataloader = DataLoader(
  tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
  tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To check everything works as intended, we try to grab a batch of data and inspect it.

for batch in train_dataloader:
    break
print({k: v.shape for k, v in batch.items()})
# {'attention_mask': torch.Size([8, 63]), 'input_ids': torch.Size([8, 63]), 'labels': torch.Size([8]), 'token_type_ids': torch.Size([8, 63])}

Like dataset element it is dictionary, but this time these values are not a single list of integers but a tons of batches of shape batch size by sequence length.

Model

The next step is to create our model and send our training data into model.
We will use from_pretrained() method and adjust the number of labels to the number of classes we have in our dataset(here, 2):

from transformers import AutoModelForSequenceClassification
checkpoint = "bert-base-cased"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

To be sure everything is going well, we pass the batch required to our model and check there is no error

outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)
# tensor(0.7512, grad_fn=<NllLossBackward>) torch.Size([8, 2])

Training:

These labels are provided, so the models of the transformers library always returns the loss directly.
Will use loss.backward() to compute the gradients.
Then we will use the optimizer to do the training step.
The optimizer will be responsible for doing the training updates to the model weights.

from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

loss = outputs.loss
loss.backward()
optimizer.step()
# Don't forget to zero your gradients once your optimizer step is done!
optimizer.zero_grad()

We will add 2 more things to make it as good as it can be.
The first one is lr schedular.
The learning rate schedular will update the optimizer's learning rate at each step.

from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

You can use any Pytorch leaning rate schedular in place of it.
We can make training faster by using GPU instead of CPU.

import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
print(device)

Putting everything together, here is what the training loop looks like.

optimizer = AdamW(model.parameters(), lr=5e-5)
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Evaluation

Evaluation can then be done like this with a datasets metric.
First we put our model in the evaluation mode to deactivate layers like dropout, then go through all the evaluation that are needed.
Model provides logits and we need to provide argmax function to convert them into predictions

from datasets import load_metric
metric= load_metric("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)   
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
metric.compute()

metric.add_batch() to send to the predictions.