双视图三维卷积网络的工业装箱行为识别
摘 要
目的 在自动化、智能化的现代生产制造过程中,行为识别技术扮演着越来越重要的角色,但实际生产制造环境的复杂性,使其成为一项具有挑战性的任务。目前,基于3D卷积网络结合光流的方法在行为识别方面表现出良好的性能,但还是不能很好地解决人体被遮挡的问题,而且光流的计算成本很高,无法在实时场景中应用。针对实际工业装箱场景中存在的人体被遮挡问题和光流计算成本问题,本文提出一种结合双视图3D卷积网络的装箱行为识别方法。方法 首先,通过使用堆叠的差分图像(residual frames,RF)作为模型的输入来更好地提取运动特征,替代实时场景中无法使用的光流。原始RGB图像和差分图像分别输入到两个并行的3D ResNeXt101中。其次,采用双视图结构来解决人体被遮挡的问题,将3D ResNeXt101优化为双视图模型,使用一个可学习权重的双视图池化层对不同角度的视图做特征融合,然后使用该双视图3D ResNeXt101模型进行行为识别。最后,为进一步提高检测结果的真负率(true negative rate,TNR),本文在模型中加入降噪自编码器和two-class支持向量机(support vector machine,SVM)。结果 本文在实际生产环境下装箱场景进行了实验,采用准确率和真负率两个指标进行评估,得到的装箱行为识别准确率为94.2%、真负率为98.9%。同时在公共数据集UCF (University of Central Florida)101上进行了评估,以准确率为评估指标,得到的装箱行为识别准确率为97.9%。进一步验证了本文方法的有效性和准确性。结论 本文提出的人体行为识别方法能够有效利用多个视图中的人体行为信息,结合传统模型和深度学习模型,显著提高了行为识别准确率和真负率。
关键词
Dual-view 3D ConvNets based industrial packing action recognition
Hu Haiyang, Pan Jian, Li Zhongjin(School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China) Abstract
Objective The action recognition technology is proactive in computer vision contexts, such as intelligent video surveillance, human-computer interaction, virtual reality, and medical image analysis. It plays an important role in the automated and intelligent modern manufacturing process, but the complexity of the actual manufacturing environment has still been challenging. The research direction is attributed to the deep neural networks largely, especially the 3D convolutional networks, which mainly use 3D convolution to capture temporal information. The 3D convolutional networks can extract the spatio-temporal features of videos better with the added temporal dimension compared to 2D convolutional networks. At present, it shows good performance in action recognition through the optical flow melting in 3D convolutional network, but it still cannot solve the problem of human body being occluded, and the computational cost of optical flow is complicated and cannot be applied in real-time scenes. The product qualification rate is required to be satisfied in the context of action recognition application in production scenes. It is necessary to rank out the unqualified products as much as possible while ensuring high accuracy and high true negative rate (TNR) of detection results. It is challenged to optimize the true negative rate among the existing action recognition methods. Our analysis facilitates a packing action recognition method based on dual-view 3D convolutional network. Method First, we extract motion features better through stacked residual frames as inputs, replacing optical flow that is not available in the real-time scene. The original RGB images and the residual frames are input to two parallel 3D ResNeXt101, and a concatenation layer is used to concatenate the features extracted in the last convolution layer of the two 3D ResNext101. Next, we adopts a dual-view structure to resolve the issue of human body being occluded, optimizes 3D ResNeXt101 into a dual-view model, builds up a learnable dual-view pooling layer for multifaceted feature fusion of views, and then uses this dual-view 3D ResNeXt101 model for action recognition. Finally, a noise-reducing self-encoder and two-class support vector machine (SVM) are added in our model to improve the true negative rate (TNR) of the detection results further. The dual-view pooling derived features are input to the noise-reducing self-encoder in the model, and the features are optimized and downscaled by the trained noise-reducing self-encoder, and then the two-class SVM model is used for secondary recognition. Result We conducted experiments in a packing scenario and evaluated using two metrics like accuracy rate and true-negative rate. The accuracy of our packing action recognition model is 94.2%, and the true negative rate is 98.9%, which optimizes current action recognition methods. Our accuracy is increased from 91.1% to 95.8% via the dual-view structure. The accuracy of the model is increased from 88.2% to 95.8% based on the residual frames module. If the residual frames module is altered by optical flow module, the accuracy rate is 96.2%, which is equivalent to the model using the residual frames module. The accuracy is only 91.5% that the unique two-class SVM structure added to the model without the denoising autoencoder. Thanks to the optimization and dimensionality reduction of the feature vectors by the denoising autoencoder, the accuracy reaches 94.2% via the combination of the denoising autoencoder and the two-class SVM both, the highest true negative rate of 98.9% obtained. After adding denoising autoencoder and two-class SVM to the model, the true negative rate of the model increased from 95.7% to 98.9%, while the accuracy rate decreased by 1.6%. Our demonstrated result is evaluated in the public dataset UCF (University of Central Florida) 101.Our single-view model obtained an accuracy of 97.1%, which achieved the second highest accuracy among all compared methods, second only to 3D ResNeXt101's 98.0%. Conclusion We use a dual-view 3D ResNeXt101 model for effective packing action recognition. To obtain richer features from RGB images and differential images, two parallel 3D ResNeXt101 are used to learn spatio-temporal features and a dual-view feature fusion is accomplished using a learnable view pooling layer. In addition, a stacked denoising autoencoder is trained to optimize and downscale the features extracted in terms of the dual-view 3D ResNeXt101 model. To improve the true negative rate, a two-class SVM model is used for secondary detection. Our method can recognize the boxing action of the packing workers accurately and realize the high true negative rate (TNR) of the recognition results.
Keywords
action recognition dual-view 3D convolutional neural network denoising autoencoder support vectormachine(SVM)
|