Action Recognition in Dark Videos using Spatio-temporal Features and Bidirectional Encoder Representations from Transformers

Singh H.; Suman S.; Subudhi B.N.; Jakhetiya V.; Ghosh A.

Full metadata record

DC Field	Value	Language
dc.contributor.author	Singh H.	en_US
dc.contributor.author	Suman S.	en_US
dc.contributor.author	Subudhi B.N.	en_US
dc.contributor.author	Jakhetiya V.	en_US
dc.contributor.author	Ghosh A.	en_US
dc.date.accessioned	2023-11-30T08:51:08Z	-
dc.date.available	2023-11-30T08:51:08Z	-
dc.date.issued	2022	-
dc.identifier.issn	2691-4581	-
dc.identifier.other	EID(2-s2.0-85142827234)	-
dc.identifier.uri	https://dx.doi.org/10.1109/TAI.2022.3221912	-
dc.identifier.uri	http://localhost:8080/xmlui/handle/123456789/824	-
dc.description.abstract	Several research works have been developed in the area of action recognition. Unfortunately, when these algorithms are applied to low-light or dark videos, their performances are highly affected and found to be very poor or fall rapidly. To address the issue of improving the performance of action recognition in dark or low-light videos	en_US
dc.description.abstract	in this article, we have developed an efficient deep 3D CNN based action recognition model. The proposed algorithm follows two-stages for action recognition. In the first stage, the low-light videos are enhanced using Zero-Reference Deep Curve Estimation (Zero-DCE), followed by the min-max sampling algorithm. In the latter stage, we propose an action classification network to recognize the actions in the enhanced videos. In the proposed action classification network, we explored the capabilities of the <inline-formula><tex-math notation="LaTeX">$R(2+1)D$</tex-math></inline-formula> for spatio-temporal feature extraction. The model&#x0027	en_US
dc.description.abstract	s overall generalization performance depends on how well it can capture long-range temporal structure in videos, which is essential for action recognition. So we have used a Graph convolutional network (GCN) on the top of R(2+1)D as our video feature encoder which captures long-term temporal dependencies of the extracted features. Finally, a Bidirectional Encoder Representations from Transformers (BERT) is adhered to classify the actions from the 3D features extracted from the enhanced video scenes. The effectiveness of the proposed action recognition scheme is verified on ARID V1.0 and ARID V1.5 datasets. It is observed that the proposed algorithm is able to achieve <inline-formula><tex-math notation="LaTeX">$96.60\%$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$99.88\%$</tex-math></inline-formula> as Top-1 and Top-5 accuracy, respectively, on ARID V1.0 dataset. Similarly, on ARID V1.5, the proposed algorithm is able to achieve <inline-formula><tex-math notation="LaTeX">$86.93\%$</tex-math></inline-formula> and <inline-formula><tex-math notation="LaTeX">$99.35\%$</tex-math></inline-formula> as Top-1 and Top-5 accuracies, respectively. To corroborate our findings, we have compared the results obtained by the proposed scheme with those of fifteen state-of-the-art action recognition techniques. IEEE	en_US
dc.language.iso	en	en_US
dc.publisher	Institute of Electrical and Electronics Engineers Inc.	en_US
dc.source	IEEE Transactions on Artificial Intelligence	en_US
dc.subject	Action recognition	en_US
dc.subject	Artificial intelligence	en_US
dc.subject	Bit error rate	en_US
dc.subject	Convolutional neural networks	en_US
dc.subject	Dark video	en_US
dc.subject	Feature extraction	en_US
dc.subject	Image processing	en_US
dc.subject	Three-dimensional displays	en_US
dc.subject	Transformers	en_US
dc.subject	Videos	en_US
dc.title	Action Recognition in Dark Videos using Spatio-temporal Features and Bidirectional Encoder Representations from Transformers	en_US
dc.type	Journal Article	en_US
Appears in Collections:	Journal Article