Current Issue Cover
融合通道层注意力机制的多支路卷积网络抑郁症识别

孙浩浩1, 邵珠宏1, 尚媛园1, 孙晓妮1, 胡强2,3, 孔佑勇4(1.首都师范大学信息工程学院, 北京 100048;2.上海交通大学医学院附属精神卫生中心, 上海 200030;3.上海交通大学生物医学工程学院, 上海 200240;4.东南大学计算机科学与工程学院, 南京 210096)

摘 要
目的 抑郁症是一种常见的情感性精神障碍,会带来诸多情绪和身体问题。在实践中,临床医生主要通过面对面访谈并结合自身经验评估抑郁症的严重程度。这种诊断方式具有较强的主观性,整个过程比较耗时,且易造成误诊、漏诊。为了客观便捷地评估抑郁症的严重程度,本文围绕面部图像研究深度特征提取及其在抑郁症自动识别中的应用,基于人脸图像的全局和局部特征,构建一种融合通道层注意力机制的多支路卷积网络模型,进行抑郁症严重程度的自动识别。方法 首先从原始视频提取图像,使用多任务级联卷积神经网络检测人脸关键点。在对齐后分别裁剪出整幅人脸图像和眼睛、嘴部区域图像,然后将它们分别送入与通道层注意力机制结合的深度卷积神经网络以提取全局特征和局部特征。在训练时,将训练图像进行标准化预处理,并通过翻转、裁剪等操作增强数据。在特征融合层将3个支路网络提取的特征拼接在一起,最后输出抑郁症严重程度的分值。结果 在AVEC2013(The Continuous Audio/Visual Emotion and Depression Recognition Challenge)抑郁症数据库上平均绝对误差为6.74、均方根误差为8.70,相较于Baseline分别降低4.14和4.91;在AVEC2014抑郁症数据库上平均绝对误差和均方根误差分别为6.56和8.56,相较于Baseline分别降低2.30和2.30。同时,相较于其他抑郁症识别方法,本文方法取得了最低的平均绝对误差和均方根误差。结论 本文方法能够以端到端的形式实现抑郁症的自动识别,将特征提取和抑郁症严重程度识别在统一框架下进行和调优,学习到的多种视觉特征更加具有鉴别性,实验结果表明了该算法的有效性和可行性。
关键词
Channel-wise attention mechanism-relevant multi-branches convolutional network-based depressive disorder recognition

Sun Haohao1, Shao Zhuhong1, Shang Yuanyuan1, Sun Xiaoni1, Hu Qiang2,3, Kong Youyong4(1.College of Information Engineering, Capital Normal University, Beijing 100048, China;2.Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai 200030, China;3.School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China;4.School of Computer Science and Engineering, Southeast University, Nanjing 210096, China)

Abstract
Objective Depressive disorder has disabled of individuals behavior ability. According to the Diagnostic and Statistical Manual of Mental Disorders (fifth edition), the typical symptoms of depression is likely to be low mood, loss of interest and lack of energy. The complicated somatic symptoms are manifested inattention, insomnia, slow reaction time, reduced activity and fatigue. Recent assessment of depression severity mainly depends on the interview between the qualified psychiatrist and the patient or interview with their family member or caregiver, combined with their own prior experience. However, depression is a heterogeneous disease derived from different causes and manifestations. Basically, this diagnosis is relatively subjective and unidentified, which is lack of standardization to measure it. Current researches have revealed that the symptoms of depression patients are usually presented through a variety of visual signals, including the patient's facial expressions and body postures, especially the former are significant indicators of depression. Emerging machine learning technique can be used to capture the subtle expression changes, which can be applied to the automatic assessment of depression severity further. The early symptoms of depression can be clarified in time, and even independent diagnostic evaluation can be carried out. In order to achieve clearer evaluation of depression severity, depth features extraction and its application in automatic recognition of depression has been investigated based on facial images. Method In order to obtain effective face images-derived global information and make full use of the emotion-rich local areas like eyes and mouth parts in face images, we develop a channel-wise attention mechanism- integrated multi-branches convolutional neural network to extract multiple visual features for automatic depression recognition. First, the original video is sampled every 10 frames to reduce the redundancy of video frames, and the multi-task cascade convolutional neural networks are used to detect and locate the key points of the face. According to the coordinates of the detected key points, the areas of the whole face, eyes and mouth are cropped conversely. Then, 1) to obtain global features, the completed face images are input into a deep convolutional neural network in combination with channel-wise attention mechanism, and 2) to obtain local features, the latter are input into another two deep convolutional neural networks as well. During the training, the images are preprocessed and normalized and the data is enhanced by flipping and clipping. At the feature fusion layer, the features are extracted from three branches networks concatenated. Finally, the fully connected layer outputs the depression score. To demonstrate the feasibility and reliability of our proposal, a series of experiments are performed on the commonly-used The Continuous Audio/Visual Emotion and Depression Recognition Challenge 2013(AVEC2013) and AVEC2014 datasets. Result 1) The AVEC2013 depression database:the mean absolute error (MAE) is 6.74 and the root mean square error (RMSE) is 8.70, which decreased by 4.14 and 4.91 in comparison with the Baseline; 2) the AVEC2014 depression database:the MAE and RMSE are 6.56 and 8.56 of each, which decreased by 2.30 and 2.30 in comparison with the Baseline. Furthermore, our algorithms optimize the MAE and RMSE values further on both databases. Experimental results show that the channel-wise attention mechanism not only speeds up the convergence of network, but also reduces the results of MAE and RMSE. To recognize its depression severity, the integration of eyes and mouth features can significantly decrease the error compared to using global features only. This suggests that the local features in the context of eyes and mouth is an effective supplement to global features in depression recognition, especially the salient information-related eyes conveying. It conforms to clinicians' experience in judging the depression severity. Conclusion We develop a multi-features-based automatic depression recognition method through using multi-branches deep conventional neural networks, which is implemented in an end-to-end manner. The features extraction and depression recognition are carried out and optimized in terms of a coordinated framework. To model more distinctive features, the channel-wise attention mechanism is added to adaptively assign weights to channels.
Keywords

订阅号|日报