Abstract:
Lip-reading recognition focuses on recognizing words spoken by a talking
face by only utilizing a video without audio. Lip-reading recognition can help
people with difficulties such as communication difficulties caused by
removing the larynx in the total laryngectomy. It is hard to recognize the lip
reads manually from humans. Therefore, it is necessary to build lip-reading
recognition models with good accuracy and efficiency that results in the
development of good practical applications. Lip-reading recognition involves
preprocessing steps such as face recognition, facial landmark detection and
image preprocessing, followed by mouth Region of Interest (ROI) extraction.
Currently, these preprocessing techniques are much improved in efficiency
and accuracy. Therefore, most recent works are focused on improving the
performance by developing the optimal architecture. In this paper, we propose
new lip-reading recognition model using Temporal Convolutional Networks
(TCN) for classification and utilizing different EfficientNet architectures for
feature extraction. First, we developed our base model for lip-reading
recognition with EfficientNet-B0 and TCN. Secondly, we obtained the
performance of the developed lip-reading recognition models by replacing
EfficientNet-B0 with the scaled versions of the family of EfficientNets,
EfficientNet-B1 to B6. All the models were trained for 80 epochs and Adam
optimizer was used with a batch size of 32. We compared the performance of
models when using different variants of EfficientNets. The results
demonstrate that lip-reading recognition can be improved when TCN is
combined with EfficientNet-B1, B2 and B3 architectures where the
accuracies are 83.8%, 81.8%, and 84.32%, respectively.