长短期时间序列关联的视频异常事件检测
摘 要
目的 多示例学习是解决弱监督视频异常事件检测问题的有力工具。异常事件发生往往具有稀疏性、突发性以及局部连续性等特点,然而,目前的多示例学习方法没有充分考虑示例之间的联系,忽略了视频片段之间的时间关联,无法充分分离正常片段和异常片段。针对这一问题,提出了一种长短期时间序列关联的二阶段异常检测网络。方法 第 1 阶段是长短期时间序列关联的异常检测网络(long-and-short-term correlated mil abnormal detectionframework,LSC-transMIL),将 Transformer 结构应用到多示例学习方法中,添加局部和全局时间注意力机制,在学习不同视频片段间的空间关联语义信息的同时强化连续视频片段的时间序列关联;第 2 阶段构建了一个基于时空注意力机制的异常检测网络,将第 1 阶段生成的异常分数作为细粒度伪标签,使用伪标签训练策略训练异常事件检测网络,并微调骨干网络,提高异常事件检测网络的自适应性。结果 实验在两个大型公开数据集上与同类方法比较,两阶段的异常检测模型在 UCF-crime、ShanghaiTech 数据集上曲线下面积(area under curve,AUC)分别达到82. 88% 和 96. 34%,相比同为两阶段的方法分别提高了 1. 58% 和 0. 58%。消融实验表明了关注时间序列的 Transformer 模块以及长短期注意力的有效性。结论 本文将 Transformer 应用于时间序列的多示例学习,并添加长短期注意力,突出局部异常事件和正常事件的区别,有效检测视频中的异常事件。
关键词
Video anomaly detection with long-and-short-term time series correlations
Zhu Xinrui, Qian Xiaoyan, Shi Yuzhou, Tao Xudong, Li Zhiyu(College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China) Abstract
Objective Video anomaly detection has been applied in many fields such as manufacturing, traffic management and security monitoring. However, detailed annotation of video data is labor intensive and cumbersome. Consequently, many researchers have started to employ weakly supervised learning methods to address this issue. Unlike the supervised learning method, the weakly supervised learning only requires video-level labels in the training stage, which greatly reduces the workload of dataset labeling, and only frame-level labeling information is required for the test dataset. Multiple instance learning (MIL) has been recognized as a powerful tool for addressing weakly supervised video abnormal event detection. Abnormal behavior in video is highly correlated with video context information. The traditional MIL method uses convolutional 3D network to extract video features, uses the ordering loss function, and introduces sparsity and time smoothing constraints into the ordering loss function to integrate time information into the ordering model. Introducing time concern only into the loss function is not enough. The use of temporal convolutional network to extract video context information further enhances the effect of video anomaly detection network. However, this global introduction of time information cannot sufficiently separate abnormal video clips from normal video clips. Therefore, the attention MIL builds time-enhancing networks to learn motion features while using the attention mechanism to incorporate temporal information into the ranking model. The learned attention weights can help better distinguish between abnormal and normal video clips. The spatiotemporal fusion graph network constructs spatial similarity graphs and temporal continuity graphs separately for video segments, which are then fused to generate a spatiotemporal fusion graph. This approach strengthens the spatiotemporal correlations among video segments, ultimately enhancing the accuracy of abnormal behavior detection. Multiple instance self-training framework uses pseudo-label training, which is an effective training strategy to improve model quality in weakly supervised learning. It constructs a two-stage training network and uses the pseudo-label trained by the first-stage MIL to guide the training of the second-stage self-guided attention feature extractor, providing a general idea to improve model quality. However, these approaches do not fully exploit temporal correlations, as the feature representation of the instances lacks fusion with neighboring and global features. Abnormal events often exhibit characteristics such as sparsity, suddenness, and local continuity, and the insufficient temporal correlations between video segments result in an inadequate separation between normal and abnormal segments. To address this issue, this paper proposes a two-stage abnormal detection network with long-and-short-term time series association.Method The first stage involves a long-and-short-term time series association abnormal detection network (LSC-transMIL) that applies the Transformer structure to MIL methods. It consists of two layers, each containing a local temporal sequence correlation attention module and a global instance correlation attention module. The former learns information in the temporal dimension between individual instances and neighboring instances, while the latter focuses on the association between individual instances and global information. Combining local and global attention mechanisms makes it possible to establish meaningful information correlations among instances, highlighting the distinctions between local and global features in the video. This approach makes it easier to distinguish abnormal video segments from normal ones. This module generates new instance features, which are then fed into the ranking model to generate video abnormal scores and pseudo-labels. In the second stage, a spatiotemporal attention mechanism-based abnormal detection network is constructed. The SlowFast backbone network is employed to extract video features, and the slow and fast pathway features are weighted and fused using spatiotemporal attention. The slow branch pays attention to the spatiotemporal information of the video frame using the spatiotemporal attention module, while the fast branch guides the attention to the temporal information through the time-dimensional attention module, and then the two branch features are spliced to obtain the final video features. The abnormal scores generated in the first stage are used as fine-grained pseudo-labels to train the abnormal event detection network by using a pseudo-labeling strategy. Furthermore, the backbone network is fine-tuned to enhance the adaptive capability of the abnormal event detection network.Result Extensive experiments were conducted on two large-scale public datasets (UCF-crime and ShanghaiTech) to compare the proposed two-stage abnormal detection model with similar methods. The two-stage model achieved area under the curve scores of 82.88% and 96.34% on the UCF-crime and ShanghaiTech datasets, respectively, demonstrating an improvement of 1.58% and 0.58% compared with other two-stage methods. Sufficient ablation experiments were conducted on the two datasets, and the effects of the proposed LSC-transMIL, traditional MIL method, and attention MIL method were compared under three backbone networks, proving the effectiveness of LSC-transMIL. Qualitative and quantitative explanations are given for the ablation experiments of global attention and global local attention, and the effectiveness of combining local and global attention is proved. The role of local and global time correlation is visualized using heat maps.Conclusion This paper applies the Transformer to time series-based MIL and introduces long-and-short-term attention to highlight the differences between local abnormal events and normal events. The proposed two-stage abnormal detection network utilizes the abnormal scores generated in the first stage as pseudo-labels, trains a network based on the SlowFast backbone network and spatiotemporal attention modules, and fine-tunes the backbone network to enhance the adaptive capability of the abnormal detection network. The proposed approach effectively improves the accuracy of abnormal event detection.
Keywords
anomaly detection Transformer spatio-temporal attention multiple instance learning(MIL) weakly supervised
|