Application of Vision Transformers in Online Advertisement Identification

Liyanage, C.R.; Madushika, M.K.S.; Nawarathna, R.D.

IRUOR Home
→
Scholarly Publications
→
Academic Sessions of University of Ruhuna
→
19th Academic Sessions 2022
→
View Item

Application of Vision Transformers in Online Advertisement Identification

Liyanage, C.R.; Madushika, M.K.S.; Nawarathna, R.D.

URI: http://ir.lib.ruh.ac.lk/xmlui/handle/iruor/5709

Citation: Liyanage, C. R.,Madushika, M. K. S. & Nawarathna, R. D. (2022). Application of Vision Transformers in Online Advertisement Identification. 19th Academic Sessions, University of Ruhuna, Matara, Sri Lanka. 13.

Date: 2022-03-02

Abstract:

Advertisements(ads) play an important role in many sectors, such as business, education and government as they can influence cultural and religious aspects of a society by disseminating important messages to people. Generally, image-based advertisements are more creative and different from other images as these contain slogans explaining the message of the ad, symbolic and atypical objects and different placements of objects within an image. Identification of advertisements from other images is important on digital media in getting customer attention or blocking them from websites. This study proposes a method to use a supervised learning approach to classify images into ads or not-ads. Another objective of this study is to verify the application of Vision Transformers (ViT) in the domain of image-based ad analysis. ViT is a novel image classification architecture derived similar to the Convolutional Neural Network (CNN), where images are divided into patches and trained using the technique called “Multi- Headed Self Attention”. The experiment was conducted using 19,700 images that were labelled as ad and not-ad. Two ViT models with different patch sizes, which were pre-trained on ImageNet-21K dataset were used for image classification. These two models were trained as batches of size 10 for a maximum of 20 epochs. The dataset was split into two main parts as training and testing and set the validation split as 0.2. The highest accuracy of 82% was gained from the 32x32 patch sized model during validation. Moreover, an accuracy of 84%, precision of 85%, and recall of 84% resulted during its testing phase. The results of this study were compared with the state of the art research using CNN. The study has proved that the ViT architecture can achieve comparative results with the limited available computational resources.

Show full item record