[ad_1]
The way to classify pictures with the assistance of Transformer-based mannequin
Since its introduction in 2017, Transformer has been widely known as a robust encoder-decoder mannequin to unravel just about any language modeling process.
BERT, RoBERTa, and XLM-RoBERTa are a couple of examples of state-of-the-art fashions in language processing that use a stack of Transformer encoders because the spine of their structure. ChatGPT and the GPT household additionally use the decoder a part of Transformer to generate texts. It’s secure to say that nearly any state-of-the-art mannequin in pure language processing incorporate Transformer in its structure.
Transformer efficiency is so good that it appears wasteful to not use it for duties past pure language processing, like laptop imaginative and prescient for instance. Nonetheless, the massive query is: can we really use it for laptop imaginative and prescient duties?
It seems that Transformer additionally has a very good potential to be utilized to laptop imaginative and prescient duties. In 2020, Google Mind group launched a Transformer-based mannequin that can be utilized to unravel a picture classification process referred to as Imaginative and prescient Transformer (ViT). Its efficiency may be very aggressive as compared with standard CNNs on a number of picture classification benchmarks.
Due to this fact, on this article, we’re going to speak about this mannequin. Particularly, we’re going to speak about how a ViT mannequin works and the way we are able to fine-tune it on our personal customized dataset with the assistance of HuggingFace library for a picture classification process.
So, as step one, let’s get began with the dataset that we’re going to make use of on this article.
We’ll use a snack dataset you can simply entry from dataset
library from HuggingFace. This dataset is listed as having a CC-BY 2.0 license, which implies that you’re free to share and use it, so long as you cite the dataset supply in your work.
Let’s take a sneak peek of this dataset:
We solely want a couple of strains of code to load the dataset, as you’ll be able to see beneath:
!pip set up -q datasetsfrom datasets import load_dataset
# Load dataset
dataset = load_dataset("Matthijs/snacks")
print(dataset)
# Output
'''
DatasetDict({
practice: Dataset({
options: ['image', 'label'],
num_rows: 4838
})
take a look at: Dataset({
options: ['image', 'label'],
num_rows: 952
})
validation: Dataset({
options: ['image', 'label'],
num_rows: 955
})
})'''
The dataset is a dictionary object that consists of 4898 coaching pictures, 955 validation pictures, and 952 take a look at pictures.
Every picture comes with a label, which belongs to one in all 20 snack courses. We will verify these 20 totally different courses with the next code:
print(dataset["train"].options['label'].names)# Output
'''
['apple','banana','cake','candy','carrot','cookie','doughnut','grape',
'hot dog', 'ice cream','juice','muffin','orange','pineapple','popcorn',
'pretzel','salad','strawberry','waffle','watermelon']'''
And let’s create a mapping between every label and its corresponding index.
# Mapping from label to index and vice versa
labels = dataset["train"].options["label"].names
num_labels = len(dataset["train"].options["label"].names)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
label2id[label] = i
id2label[i] = labelprint(label2id)
print(id2label)
# Output
'''
{'apple': 0, 'banana': 1, 'cake': 2, 'sweet': 3, 'carrot': 4, 'cookie': 5, 'doughnut': 6, 'grape': 7, 'sizzling canine': 8, 'ice cream': 9, 'juice': 10, 'muffin': 11, 'orange': 12, 'pineapple': 13, 'popcorn': 14, 'pretzel': 15, 'salad': 16, 'strawberry': 17, 'waffle': 18, 'watermelon': 19}
{0: 'apple', 1: 'banana', 2: 'cake', 3: 'sweet', 4: 'carrot', 5: 'cookie', 6: 'doughnut', 7: 'grape', 8: 'sizzling canine', 9: 'ice cream', 10: 'juice', 11: 'muffin', 12: 'orange', 13: 'pineapple', 14: 'popcorn', 15: 'pretzel', 16: 'salad', 17: 'strawberry', 18: 'waffle', 19: 'watermelon'}
'''
One essential factor that we have to know earlier than we transfer on is the truth that every picture has various dimension. Due to this fact, we have to carry out some picture preprocessing steps earlier than feeding the photographs into the mannequin for fine-tuning functions.
Now that we all know the dataset that we’re working with, let’s take a better have a look at ViT structure.
Earlier than the introduction of ViT, the truth that a Transformer mannequin depends on self-attention mechanism raised an enormous problem for us to make use of it for laptop imaginative and prescient duties.
The self-attention mechanism is the explanation why Transformer-based fashions can differentiate the semantic that means of a phrase utilized in totally different contexts. For instance, a BERT mannequin can distinguish the that means of the phrase ‘park’ in sentences ‘They park their automotive within the basement’ and ‘She walks her canine in a park’ as a consequence of self-attention mechanism.
Nonetheless, there may be one downside with self-attention: it’s a computationally costly operation because it requires every token to attend each different token in a sequence.
Now if we use self-attention mechanism on picture knowledge, then every pixel in a picture would want to attend and be in comparison with each different pixel. The issue is, if we improve the pixel worth by one, then the computational value would improve quadratically. That is merely not possible if we have now a picture with a fairly large decision.
With a purpose to overcome this downside, ViT introduces the idea of splitting the enter picture into patches. Every patch has a dimension of 16 x 16 pixels. Let’s say that we have now a picture with the dimension of 48 x 48 pixels, then the patches of our picture will look one thing like this:
In its software, there are two choices for a way ViT splits our picture into patches:
- Reshape our enter picture that has a dimension of
top x width x channel
right into a sequence of flattened 2D picture patches with a dimension ofno.of patches x (patch_size^2.channel)
. Then, we challenge the flattened patches right into a fundamental linear layer to get the embedding of every patch. - Challenge our enter picture right into a convolutional layer with the kernel dimension and stride equal to the patch dimension. Then, we flatten the output from that convolutional layer.
After testing the mannequin efficiency on a number of datasets, it seems that the second method results in the higher efficiency. Due to this fact, on this article, we’re going to make use of the second method.
Let’s use a toy instance to exhibit the splitting means of an enter picture into patches with a convolutional layer.
import torch
import torch.nn as nn# Create toy picture with dim (batch x channel x width x top)
toy_img = torch.rand(1, 3, 48, 48)
# Outline conv layer parameters
num_channels = 3
hidden_size = 768 #or emb_dimension
patch_size = 16
# Conv 2D layer
projection = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size,
stride=patch_size)
# Ahead move toy img
out_projection = projection(toy_img)
print(f'Unique picture dimension: {toy_img.dimension()}')
print(f'Measurement after projection: {out_projection.dimension()}')
# Output
'''
Unique picture dimension: torch.Measurement([1, 3, 48, 48])
Measurement after projection: torch.Measurement([1, 768, 3, 3])
'''
The subsequent factor that the mannequin will do is flatten the patches and put them sequentially as you’ll be able to see within the picture beneath:
We will do the flattening course of with the next code:
# Flatten the output after projection with Conv2D layerpatch_embeddings = out_projection.flatten(2).transpose(1, 2)
print(f'Patch embedding dimension: {patch_embeddings.dimension()}')
# Output
'''
Patch embedding dimension: torch.Measurement([1, 9, 768]) #[batch, no. of patches, emb_dim]
'''
What we have now after the flattening course of is mainly the vector embedding of every patch. That is much like token embeddings in lots of Transformer-based language fashions.
Subsequent, much like BERT, ViT will add a particular vector embedding for the [CLS] token within the first place of our patches’ sequence.
# Outline [CLS] token embedding with the identical emb dimension because the patches
batch_size = 1
cls_token = nn.Parameter(torch.randn(1, 1, hidden_size))
cls_tokens = cls_token.broaden(batch_size, -1, -1)# Prepend [CLS] token at first of patch embedding
patch_embeddings = torch.cat((cls_tokens, patch_embeddings), dim=1)
print(f'Patch embedding dimension: {patch_embeddings.dimension()}')
# Output
'''
Patch embedding dimension: torch.Measurement([1, 10, 768]) #[batch, no. of patches+1, emb_dim]
'''
As you’ll be able to see, by prepending the [CLS] token embedding at first of our patch embedding, the size of the sequence will increase by one. The ultimate step after this is able to be including the positional embedding into our sequence of patches. This step is essential in order that our ViT mannequin can study the sequence order of our patches.
This place embedding is a learnable parameter that might be up to date by the mannequin in the course of the coaching course of.
# Outline place embedding with the identical dimension because the patch embedding
position_embeddings = nn.Parameter(torch.randn(batch_size, 10, hidden_size))# Add place embedding into patch embedding
input_embeddings = patch_embeddings + position_embeddings
print(f'Enter embedding dimension: {input_embeddings.dimension()}')
# Output
'''
Enter embedding dimension: torch.Measurement([1, 10, 768]) #[batch, no. of patches+1, emb_dim]
'''
Now, the place embedding plus vector embedding of every patch would be the enter of a stack of Transformer encoders. The variety of Transformer encoders relies on the kind of ViT mannequin that you just use. General, there are three forms of ViT mannequin:
- ViT-base: it has 12 layers, hidden dimension of 768, and the whole of 86M parameters.
- ViT-large: it has 24 layers, hidden dimension of 1024, and the whole of 307M parameters.
- ViT-huge: it has 32 layers, hidden dimension of 1280, and the whole of 632M parameters.
Within the following code snippet, let’s say that we need to use Vit-base. Because of this we have now 12 layers of Transformer encoders:
# Outline parameters for ViT-base (instance)
num_heads = 12
num_layers = 12# Outline Transformer encoders' stack
transformer_encoder_layer = nn.TransformerEncoderLayer(
d_model=hidden_size, nhead=num_heads,
dim_feedforward=int(hidden_size * 4),
dropout=0.1)
transformer_encoder = nn.TransformerEncoder(
encoder_layer=transformer_encoder_layer,
num_layers=num_layers)
# Ahead move
output_embeddings = transformer_encoder(input_embeddings)
print(f' Output embedding dimension: {output_embeddings.dimension()}')
# Output
'''
Output embedding dimension: torch.Measurement([1, 10, 768])
'''
Lastly, the stack of Transformer encoders will output the ultimate vector illustration of every picture patch. The dimensionality of the ultimate vector corresponds to the hidden dimension of the ViT mannequin that we use.
And that’s mainly it.
We will definitely construct and practice our personal ViT mannequin from scratch. Nonetheless, as with different Transformer-based fashions, ViT requires coaching on a considerable amount of picture knowledge (14M-300M of pictures) to ensure that them to generalize nicely on unseen knowledge.
If we need to use ViT on a customized dataset, the most typical method is to fine-tune a pretrained mannequin. The simplest approach to do that is by using HuggingFace library. All we have now to do is name ViTModel.from_pretrained()
methodology and put the trail to our pretrained mannequin as an argument. The VitModel()
class from HuggingFace can even act as a wrapper of all of steps that we’ve mentioned above.
!pip set up transformersfrom transformers import ViTModel
# Load pretrained mannequin
model_checkpoint = 'google/vit-base-patch16-224-in21k'
mannequin = ViTModel.from_pretrained(model_checkpoint, add_pooling_layer=False)
# Instance enter picture
input_img = torch.rand(batch_size, num_channels, 224, 224)
# Ahead move enter picture
output_embedding = mannequin(input_img)
print(output_embedding)
print(f"Ouput embedding dimension: {output_embedding['last_hidden_state'].dimension()}")
# Output
'''
BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.0985, -0.2080, 0.0727, ..., 0.2035, 0.0443, -0.3266],
[ 0.1899, -0.0641, 0.0996, ..., -0.0209, 0.1514, -0.3397],
[ 0.0646, -0.3392, 0.0881, ..., -0.0044, 0.2018, -0.3038],
...,
[-0.0708, -0.2932, -0.1839, ..., 0.1035, 0.0922, -0.3241],
[ 0.0070, -0.3093, -0.0217, ..., 0.0666, 0.1672, -0.4103],
[ 0.1723, -0.1037, 0.0317, ..., -0.0571, 0.0746, -0.2483]]],
grad_fn=<NativeLayerNormBackward0>), pooler_output=None, hidden_states=None, attentions=None)
Output embedding dimension: torch.Measurement([1, 197, 768])
'''
The output of the entire ViT mannequin is a vector embedding representing every picture patch plus the [CLS] token. It has the dimension of [batch_size, image_patches+1, hidden_size]
.
To carry out a picture classification process, we comply with the identical method as with the BERT mannequin. We extract the output vector embedding of the [CLS] token and move it by means of the ultimate linear layer to find out the category of the picture.
num_labels = 20# Outline linear classifier layer
classifier = nn.Linear(hidden_size, num_labels)
# Ahead move on the output embedding of [CLS] token
output_classification = classifier(output_embedding['last_hidden_state'][:, 0, :])
print(f"Output embedding dimension: {output_classification.dimension()}")
# Output
'''
Output embedding dimension: torch.Measurement([1, 20]) #[batch, no. of labels]
'''
On this part, we are going to fine-tune a ViT-base mannequin that was pre-trained on the ImageNet-21K dataset, which consists of roughly 14 million pictures and 21,843 courses. Every picture within the dataset has a dimension of 224 x 224 pixels.
To start, we have to outline the checkpoint path for the pre-trained mannequin and cargo the required libraries.
import numpy as np
import torch
import cv2
import torch.nn as nn
from transformers import ViTModel, ViTConfig
from torchvision import transforms
from torch.optim import Adam
from torch.utils.knowledge import DataLoader
from tqdm import tqdm#Pretrained mannequin checkpoint
model_checkpoint = 'google/vit-base-patch16-224-in21k'
Picture Dataloader
As beforehand talked about, the ViT-base mannequin has been pretrained on a dataset consisting of pictures with the dimension of 224 x 224 pixels. The pictures have additionally been normalized in keeping with a selected imply and commonplace deviation in every of their coloration channels.
In consequence, earlier than we are able to feed our personal dataset into the ViT mannequin for fine-tuning, we should first preprocess our pictures. This entails reworking every picture right into a tensor, resizing it to the suitable dimensions, after which normalizing it utilizing the identical imply and commonplace deviation values because the dataset on which the mannequin was pretrained.
class ImageDataset(torch.utils.knowledge.Dataset):def __init__(self, input_data):
self.input_data = input_data
# Remodel enter knowledge
self.rework = transforms.Compose([
transforms.ToTensor(),
transforms.Resize((224, 224), antialias=True),
transforms.Normalize(mean=[0.5, 0.5, 0.5],
std=[0.5, 0.5, 0.5])
])
def __len__(self):
return len(self.input_data)
def get_images(self, idx):
return self.rework(self.input_data[idx]['image'])
def get_labels(self, idx):
return self.input_data[idx]['label']
def __getitem__(self, idx):
# Get enter knowledge in a batch
train_images = self.get_images(idx)
train_labels = self.get_labels(idx)
return train_images, train_labels
From the picture dataloader above, we are going to then get a batch of preprocessed pictures with their corresponding label. We will use the ouput of picture dataloader above as an enter for our mannequin in the course of the fine-tuning course of.
Mannequin Definition
The structure of our ViT mannequin is easy. Since we’ll be fine-tuning a pretrained mannequin, we are able to use the VitModel.from_pretrained()
methodology and supply the checkpoint of the mannequin as an argument.
We additionally want so as to add a linear layer on the finish, which can act as the ultimate classifier. The output of this layer needs to be equal to the variety of distinct labels in our dataset.
class ViT(nn.Module):def __init__(self, config=ViTConfig(), num_labels=20,
model_checkpoint='google/vit-base-patch16-224-in21k'):
tremendous(ViT, self).__init__()
self.vit = ViTModel.from_pretrained(model_checkpoint, add_pooling_layer=False)
self.classifier = (
nn.Linear(config.hidden_size, num_labels)
)
def ahead(self, x):
x = self.vit(x)['last_hidden_state']
# Use the embedding of [CLS] token
output = self.classifier(x[:, 0, :])
return output
The above ViT mannequin generates closing vector embeddings for every picture patch plus the [CLS] token. To categorise pictures, as you’ll be able to see above, we extract the ultimate vector embedding of the [CLS] token and move it to the ultimate linear layer to acquire the ultimate class prediction.
Mannequin Positive-Tuning
Now that we have now outlined the mannequin structure and ready the enter pictures for batching course of, we are able to begin to fine-tune our ViT mannequin. The coaching script is a typical Pytorch coaching script, as you’ll be able to see beneath:
def model_train(dataset, epochs, learning_rate, bs):use_cuda = torch.cuda.is_available()
machine = torch.machine("cuda" if use_cuda else "cpu")
# Load nodel, loss operate, and optimizer
mannequin = ViT().to(machine)
criterion = nn.CrossEntropyLoss().to(machine)
optimizer = Adam(mannequin.parameters(), lr=learning_rate)
# Load batch picture
train_dataset = ImageDataset(dataset)
train_dataloader = DataLoader(train_dataset, num_workers=1, batch_size=bs, shuffle=True)
# Positive tuning loop
for i in vary(epochs):
total_acc_train = 0
total_loss_train = 0.0
for train_image, train_label in tqdm(train_dataloader):
output = mannequin(train_image.to(machine))
loss = criterion(output, train_label.to(machine))
acc = (output.argmax(dim=1) == train_label.to(machine)).sum().merchandise()
total_acc_train += acc
total_loss_train += loss.merchandise()
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f'Epochs: {i + 1} | Loss: {total_loss_train / len(train_dataset): .3f} | Accuracy: {total_acc_train / len(train_dataset): .3f}')
return mannequin
# Hyperparameters
EPOCHS = 10
LEARNING_RATE = 1e-4
BATCH_SIZE = 8
# Prepare the mannequin
trained_model = model_train(dataset['train'], EPOCHS, LEARNING_RATE, BATCH_SIZE)
Since our snack dataset has 20 distinct courses, then we’re coping with a multiclass classification downside. Due to this fact, CrossEntropyLoss()
can be the suitable loss operate. Within the instance above, we practice our mannequin for 10 epochs, studying fee is ready to be 1e-4, with the batch dimension of 8. You’ll be able to mess around with these hyperparameters to tune the efficiency of the mannequin.
After you skilled the mannequin, you’ll get an output that appears related because the one beneath:
Mannequin Prediction
Since we have now fine-tuned our mannequin, naturally we need to use it for prediction on the take a look at knowledge. To take action, first let’s create a operate that encapsulate the entire vital picture preprocessing steps and the mannequin inference course of.
def predict(img):use_cuda = torch.cuda.is_available()
machine = torch.machine("cuda" if use_cuda else "cpu")
rework = transforms.Compose([
transforms.ToTensor(),
transforms.Resize((224, 224)),
transforms.Normalize(mean=[0.5, 0.5, 0.5],
std=[0.5, 0.5, 0.5])
])
img = rework(img)
output = trained_model(img.unsqueeze(0).to(machine))
prediction = output.argmax(dim=1).merchandise()
return id2label[prediction]
As you’ll be able to see above, the picture preprocessing step throughout inference is precisely the identical because the step that we did on the coaching knowledge. Then, we use the remodeled picture because the enter to our skilled mannequin, and eventually we map its prediction to the corresponding label.
If we need to predict a selected picture on the take a look at knowledge, we are able to simply name the operate above and we’ll get the prediction afterwards. Let’s strive it out.
print(predict(dataset['test'][900]['image']))
# Output: waffle
Our mannequin predicted our take a look at picture appropriately. Let’s strive one other one.
print(predict(dataset['test'][250]['image']))
# Output: cookie
And our mannequin predicted the take a look at knowledge appropriately once more. By fine-tuning a ViT mannequin, we are able to get a very good efficiency on a customized dataset. It’s also possible to do the identical course of for any customized dataset in a picture classification process.
On this article, we have now seen how Transformer can be utilized not just for language modeling duties, but in addition for laptop imaginative and prescient duties, which on this case is picture classification.
To take action, first the enter picture is decomposed into patches with a dimension of 16 x 16 pixels. Then, the Imaginative and prescient Transformer mannequin makes use of a stack of Transformer encoders to study the vector illustration of every picture patch. Lastly, we are able to use the ultimate vector illustration of the [CLS] token prepended firstly of picture patch sequence to foretell the label of our enter picture.
I hope this text is beneficial so that you can get began with Imaginative and prescient Transformer mannequin. As at all times, you could find the code implementation offered on this article in this notebook.
[ad_2]
Source link