Current Issue Cover
方面级多模态协同注意图卷积情感分析模型

王顺杰1,2, 蔡国永1,2, 吕光瑞3, 唐炜博1,2(1.桂林电子科技大学计算机与信息安全学院, 桂林 541004;2.广西可信软件重点实验室, 桂林 541004;3.大连海事大学信息科学技术学院, 大连 116026)

摘 要
目的 方面级多模态情感分析日益受到关注,其目的是预测多模态数据中所提及的特定方面的情感极性。然而目前的相关方法大都对方面词在上下文建模、模态间细粒度对齐的指向性作用考虑不够,限制了方面级多模态情感分析的性能。为了解决上述问题,提出一个方面级多模态协同注意图卷积情感分析模型(aspect-level multimodal co-attention graph convolutional sentiment analysis model,AMCGC)来同时建模方面指向的模态内上下文语义关联和跨模态的细粒度对齐,以提升情感分析性能。方法 AMCGC为了获得方面导向的模态内的局部语义相关性,利用正交约束的自注意力机制生成各个模态的语义图。然后,通过图卷积获得含有方面词的文本语义图表示和融入方面词的视觉语义图表示,并设计两个不同方向的门控局部跨模态交互机制递进地实现文本语义图表示和视觉语义图表示的细粒度跨模态关联互对齐,从而降低模态间的异构鸿沟。最后,设计方面掩码来选用各模态图表示中方面节点特征作为情感表征,并引入跨模态损失降低异质方面特征的差异。结果 在两个多模态数据集上与9种方法进行对比,在Twitter-2015数据集中,相比于性能第2的模型,准确率提高了1.76%;在Twitter-2017数据集中,相比于性能第2的模型,准确率提高了1.19%。在消融实验部分则从正交约束、跨模态损失、交叉协同多模态融合分别进行评估,验证了AMCGC模型各部分的合理性。结论 本文提出的AMCGC模型能更好地捕捉模态内的局部语义相关性和模态之间的细粒度对齐,提升方面级多模态情感分析的准确性。
关键词
Aspect-level multimodal co-attention graph convolutional sentiment analysis model

Wang Shunjie1,2, Cai Guoyong1,2, Lyu Guangrui3, Tang Weibo1,2(1.College of Computer and Information Security, Guilin University of Electronic Technology, Guilin 541004, China;2.Key Laboratory of Guangxi Trusted Software, Guilin 541004, China;3.College of Information Science and Technology, Dalian Maritime University, Dalian 116026, China)

Abstract
Objective The main task of aspect-level multimodal sentiment analysis is to determine the sentiment polarity of a given target(i. e., aspect or entity)in a sentence by combining relevant modal data sources. This task is considered a fine-grained target-oriented sentiment analysis task. Traditional sentiment analysis mainly focuses on the content of text data. However, with the increasing amount of audio, image, video, and other media data, merely focusing on the sentiment analysis of single text data would be insufficient. Multimodal sentiment analysis surpasses traditional sentiment analysis based on a single text content in understanding human behavior and hence offers more practical significance and application value. Aspect-level multimodal sentiment analysis(AMSA)has attracted increasing application in revealing the finegrained emotions of social users. Unlike coarse-grained multimodal sentiment analysis, AMSA not only considers the potential correlation between modalities but also focuses on guiding the aspects toward their respective modalities. However, the current AMSA methods do not sufficiently consider the directional effect of aspect words in the context modeling of different modalities and the fine-grained alignment between modalities. Moreover, the fusion of image and text representations is mostly coarse grained, thereby leading to the insufficient mining of collaborative associations between modalities and limiting the performance of aspect-level multimodal sentiment analysis. To solve these problems, the aspect-level multimodal co-attention graph convolutional sentiment analysis model(AMCGC)is proposed to simultaneously consider the aspectoriented intra-modal context semantic association and the fine-grained alignment across the modality to improve sentiment analysis performance. Method AMCGC is an end-to-end aspect-level sentiment analysis method that mainly involves four stages, namely, input embedding, feature extraction, pairwise graph convolution of cross-modality alternating coattention, and aspect mask setting. First, after obtaining the image and text embedding representations, a contextual sequence of text features containing aspect words and a contextual sequence of visual local features incorporating aspect words are constructed. To explicitly model the directional semantics of aspect words, position encoding is added to the context sequences of text and visual local features based on the aspect words. Afterward, the context sequences of different modalities are inputted into bidirectional long short-term memory networks to obtain the context dependencies of the respective modalities. To obtain the local semantic correlations of intra-modality for aspect-oriented modalities, a self-attention mechanism with orthogonal constraints is designed to generate semantic graphs for each modality. A textual semantic graph representation containing aspect words and a visual semantic graph representation incorporating aspect words are then obtained through a graph convolutional network to accurately capture the local semantic correlation within the modality. Among them, the orthogonal constraint can model the local sentiment semantic relationship of data units inside the modality as explicitly as possible and enhance the discrimination of the local features within the modality. A gated local crossmodality interaction mechanism is also designed to embed the text semantic graph representation into the visual semantic graph representation. The graph convolution network is then used again to learn the local dependencies of different modalities'graph representations, and the text embedded in the visual semantic graph representation is inversely embedded into the text semantic graph representation so as to achieve a fine-grained cross-modality association alignment, thereby reducing the heterogeneous gap between modalities. Aspect mask settings are designed to select the aspect node features in the respective modalities'semantic graph representation as the final sentiment representation, and cross-modal loss is introduced to reduce the differences in cross-modal aspect features. Result The performance of the proposed model is compared with that of nine latest methods on two public multimodal datasets. The accuracy(ACC)of the proposed model is improved by 1. 76% and 1. 19% on the Twitter-2015 and Twitter-2017 datasets, respectively, compared to those models with the second-highest performance. Experimental results confirm the advantage of using graph convolutional networks to model the local semantic relation interaction alignment within modalities from a local perspective and highlight the superiority of performing multimodal interaction in a cross-collaborative manner. The model is then subjected to an ablation study from the perspectives of orthogonal constraints, cross-modal loss, cross-coordinated multimodal fusion, and feature redundancy, and experiments are conducted on the Twitter-2015 and Twitter-2017 datasets. Experimental results show that the results of all ablation solutions are inferior to the performance of the AMCGC model, thus validating the rationality of each part of the AMCGC model. Moreover, the orthogonal constraint has the greatest effect, and the absence of this constraint greatly reduces in the effectiveness of the model. Specifically, removing this constraint reduces the ACC of the proposed model by 1. 83% and 3. 81% on the Twitter-2015 and Twitter-2017 datasets, respectively. In addition, the AMCGC+ BERT model, which is based on bidirectional encoder representation from Transformer(BERT)pre-training, outperforms the AMCGC model based on Glove. The ACC of the AMCGC+ BERT model is increased by 1. 93% and 2. 19% on the Twitter-2015 and Twitter-2017 datasets, thereby suggesting that the large-scale pretraining-based model has more advantages in obtaining word representations. The hyperparameters of this model are set through extensive experiments, such as determining the number of image regions and the weights of the orthogonal constraint terms. Visualization experiments prove that the AMCGC model can capture the local semantic correlation within modalities. Conclusion The proposed AMCGC model can efficiently capture the local semantic correlation within modalities under the constraint of orthogonal terms. This model can also effectively achieve a fine-grained alignment between multimodalities and improve the accuracy of aspect-level multimodal sentiment analysis.
Keywords

订阅号|日报