融合自注意力和自编码器的视频异常检测
梁家菲1,2,3, 李婷4, 杨佳琪5, 李亚楠6,7, 方智文1,2,3, 杨丰1,2,3(1.南方医科大学生物医学工程学院, 广州 510515;2.南方医科大学广东省医学图像处理重点实验室, 广州 510515;3.南方医科大学广东省医学成像与诊断技术工程实验室, 广州 510515;4.南方医科大学护理学院, 广州 510515;5.西北工业大学计算机学院, 西安 710114;6.武汉工程大学计算机科学与工程学院、人工智能学院, 武汉 430205;7.武汉工程大学智能机器人湖北省重点实验室, 武汉 430073) 摘 要
目的 视频异常检测通过挖掘正常事件样本的模式来检测不符合正常模式的异常事件。基于自编码器的模型广泛用于视频异常检测领域,由于自监督学习的特征提取具有一定盲目性,使得网络的特征表达能力有限。为了提升模型对正常模式的学习能力,提出一种基于Transformer和U-Net的视频异常检测方法。方法 首先,编码器对输入的连续帧进行下采样提取低层特征,并将最后一层特征图输入Transformer编码全局信息,学习特征像素之间的相关信息。然后解码器对编码特征进行上采样,通过跳跃连接与编码器中相同分辨率的低层特征融合,将全局空间信息与局部细节信息结合从而实现异常定位。针对近景康复动作的异常反馈需求,本文基于周期性动作收集了一个室内近景数据集,并进一步引入动态图约束引导网络关注近景周期性运动区域。结果 实验在4个室外公开数据集和1个室内近景数据集上与同类方法比较。在室外数据集CUHK(Chinese University of Hong Kong)Avenue,UCSD Ped1(University of California,San Diego,pedestrian1),UCSD Ped2,LV(live videos)中,本文算法的帧级AUC(area under curve)值分别提高了1%,0.4%,1.1%,6.8%。在室内数据集中,本文算法相比同类算法提升了1.6%以上。消融实验结果分别验证了Transformer模块以及动态图约束的有效性。结论 本文将U-Net网络与基于自注意力机制的Transformer网络结合,能够提升模型对正常模式的学习能力,从而有效检测视频中的异常事件。
关键词
Video anomaly detection by fusing self-attention and autoencoder
Liang Jiafei1,2,3, Li Ting4, Yang Jiaqi5, Li Yanan6,7, Fang Zhiwen1,2,3, Yang Feng1,2,3(1.School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China;2.Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou 510515, China;3.Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China;4.School of Nursing, Southern Medical University, Guangzhou 510515, China;5.School of Computer Science, Northwest Polytechnic University, Xi'an 710114, China;6.School of Computer Science and Engineering, School of Artificial Intelligence, Wuhan Institute of Technology, Wuhan 430205, China;7.Hubei Province Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430073, China) Abstract
Objective Anomaly detection has been developing in video surveillance domain. Video anomaly detection is focused on motions-irregular detection and extraction in relevant to long-distance rehabilitation motion analysis. But,it is challenged to obtain training samples that include all types of abnormal events. Therefore,existing anomaly detection methods in videos usually train a model on datasets,which contain normal samples only. In the testing phase,the events whose patterns are different from normal patterns are detected as abnormities. To represent the normal motion patterns in videos, early works are based on hand-crafted feature and concerned about low-level trajectory features. However,it is challenged to get effective trajectory features in complicated scenarios. Spatial-temporal features like the histogram of oriented flows (HOF)and the histogram of oriented gradients(HOG)are commonly used as representations of motion and content in anomaly detection. To model the motion and appearance patterns in anomaly detection,spatial-temporal features-based Markov random field(MRF),the mixture of probabilistic PCA(MPPCA),and the Gaussian mixture model are employed. Based on the assumption that normal patterns can be represented via linear combinations in dictionaries,sparse coding and dictionary learning can be used to encode normal patterns. Due to the insufficient descriptive power of hand-craft features, the robustness of these models is still poor in multiple scenarios. Currently,autoencoder-based deep learning methods are introduced in video anomaly detection. A 3D convolutional Auto-Encoder is designed to model normal patterns in regular frames. A convolutional long short term memory(LSTM)Auto-Encoder is developed to model normal appearance and motion patterns simultaneously in terms of the incorporation between convolutional neural network(CNN)and LSTM. To learn the sparse representation and dictionary of normal patterns,an adaptive iterative hard-thresholding algorithm is designed within an LSTM framework in according to the strong performance of sparse coding-based anomaly detection. Autoencoder-based prediction networks are introduced into anomaly detection in contrast to reconstruction-based models, which can detect anomalies through error computing between predicted frames and ground truth frames. Additionally,to process spatial-temporal information of different scales,a convolutional gate recurrent unit(ConvGRU)based multipath frame prediction network is demonstrated. Due to the blindness of self-supervised learning in anomaly detection,CNNsbased methods have their limitations in mining normal patterns. To improve the capability of feature expression,the vision transformer(ViT)model can used to extend the Transformer from natural language processing to the image domain. It can integrate CNN and Transformer to learn the global context information. Hence,we develop a Transformer and U-Net-based anomaly detection method as well. Method In this study,Transformer is embedded in a naive U-Net to learn local and global spatial-temporal information of normal events. First,an encoder is designed to extract spatial-temporal features from consecutive frames. To encode global information and learn the relevant information between feature pixels,final features of the encoder are fed into the Transformer. Then,a decoder is used to upsample the features of Transformer,and merges them with the low-level features of the encoder with the same resolution via skip connections. The whole network can combine the global spatial-temporal information with the local detail information. The size of the convolution kernel and deconvolution kernel is set to 3×3. The maximum pooling kernel size is 2×2. The encoder and decoder have four layers both. To make predicted frames close to their ground truth,we alleviate the intensity and gradient distances between predicted frames and their ground truth. To meet the requirements for anomaly detection of close-range rehabilitation movement,we collected an indoor motion dataset from published datasets based on hand movements for anomaly analysis because existing anomaly detection datasets are based on outdoor settings with long-distance attribution. For periodic hand movements,in addition to the traditional reconstruction loss,we introduce a dynamic image constraint to guide the network to focus on the periodic close-range motion area further. Result We compare the proposed approach to several anomaly detection methods on four outdoor public datasets and one indoor dataset. The improvements of the frame-level area under curve(AUC)performance on Avenue,Ped1,and Ped2 are 1. 0%,0. 4%,and 1. 1%,respectively. It can detect abnormal events on Ped1/ Ped2 with the low-resolution attribute effectively. On the LV dataset,it achieves an AUC of 65. 1%. Since the Transformerbased network can capture richer feature information in terms of the self-attention mechanism,the proposed network can mine various normal patterns in multiple scenes and improve detection performance effectively. On the collected indoor dataset,our performance of four actions,which are denoted as A1-1,A1-2,A1-3,and A1-4,reached 60. 3%,63. 4%, 67. 7%,and 64. 4%,respectively. To verify the effectiveness of the Transformer module and dynamic image constraint, we conduct the ablation experiments in the training phase through removing the Transformer module and dynamic image constraint. Experimental results show that the Transformer module can improve the performance of anomaly detection. The performance of four actions of using the dynamic image constraint in the indoor dataset are improved by 0. 6%,2. 4%, 1. 1%,and 0. 9%,respectively. It means the dynamic image loss can yield the network to pay attention to the foreground motion area. Conclusion We develop a video anomaly detection method in relevant to Transformer and U-Net. A dataset of indoor motion is collected for the abnormal analysis of indoor close-up rehabilitation movement. Experimental results show that our method has its potentials to detect abnormal behaviors in indoor and outdoor videos effectively.
Keywords
anomaly detection convolutional neural network(CNN) Transformer encoder self-attention mechanism self-supervised learning
|