共识图学习驱动的自监督集成聚类
耿伟峰1,2, 王翔1,2, 景丽萍1,2, 于剑1,2(1. 北京交通大学交通数据分析与挖掘北京市重点实验室, 北京 100044;2. 摘 要
目的 随着实际应用场景中海量数据采集技术的发展和数据标注成本的不断增加,自监督学习成为海量数据分析的一个重要策略。然而,如何从海量数据中抽取有用的监督信息,并该监督信息下开展有效的学习仍然是制约该方向发展的研究难点。为此,提出了一个基于共识图学习的自监督集成聚类框架。方法 框架主要包括3个功能模块。首先,利用集成学习中多个基学习器构建共识图;其次,利用图神经网络分析共识图,捕获节点优化表示和节点的聚类结构,并从聚类中挑选高置信度的节点子集及对应的类标签生成监督信息;再次,在此标签监督下,联合其他无标注样本更新集成成员基学习器。交替迭代上述功能块,最终提高无监督聚类的性能。结果 为验证该框架的有效性,在标准数据集(包括图像和文本数据)上设计了一系列实验。实验结果表明,所提方法在性能上一致优于现有聚类方法。尤其是在MNIST-Test (modified national institute of standards and technology database)上,本文方法实现了97.78%的准确率,比已有最佳方法高出3.85%。结论 该方法旨在利用图表示学习提升自监督学习中监督信息捕获的能力,监督信息的有效获取进一步强化了集成学习中成员构建的能力,最终提升了无监督海量数据本质结构的挖掘性能。
关键词
Consensus graph learning-based self-supervised ensemble clustering
Geng Weifeng1,2, Wang Xiang1,2, Jing Liping1,2, Yu Jian1,2(1. Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China;2. Abstract
Objective Clustering is focused on machine learning-related data segmentation for multiple datasets. Its applications are in relevant to such domains like image segmentation and anomaly detection. In addition,to simplify complex tasks optimize its performance,clustering is used in data preprocessing tasks of those are data sub-blocks segmentation, pseudo-labels generation,and abnormal points-removal. Self-supervised learning has become an essential technique for massive data analysis. However,it is challenged to extract effective supervision information and analyze the input data. Method A consensus graph learning based self-supervised ensemble clustering(CGL-SEC)framework is developed. It consists of three main modules:1)to construct the consensus graph based on several ensemble components(i. e. ,the basic clustering methods). 2)to extract the supervision information by learning the consensus graph representation,and 3)its node clustering results,where the subset of nodes with the high-confidence are selected as labeled samples. To optimize the ensemble components and the corresponding consensus graph,t basic clustering methods are re-trained in related the option of samples-labeled and other samples-unlabeled. The final clustering results can be optimized iteratively until the learning process converges. Result A series of experiments are carried out on benchmarks,including both image and textual datasets. Especially,CGL-SEC is 3. 85% over baseline in terms of clustering evaluation metric on themodified national institute of standards and technology database(MNIST-Test). First,to optimize data representation and cluster assignment at the same time,deep embedding clustering can be focused on data itself as the supervision information and auto-encoder with the reconstruction loss is pre-trained. The soft cluster assignment of features-embedded is then calculated,and the KL(Kullback-Leibler)divergence is minimized between the soft cluster assignment and the auxiliary target distribution. To improve the performance of the model further,following deep clustering network(DCN)can use hard clustering instead of soft allocation,and local constraints are applied by improved deep embedding clustering(IDEC). The pseudo-label strategy is implemented as a self-supervised learning method that uses the prediction results of the neural network as the label to simulate the supervision information compared to using data itself as the supervision information. Deepcluster-based K-means clustering is used to generate pseudo-labels to guide the training of convolutional networks. However,the generated pseudo-labels have lower confidence and are prone to trivial solutions in the initial stage of network training. Deep embedding clustering with data augmentation(DEC-DA)and MixMatch-based prediction of data-enhanced samples are used as the supervision information of the original data,which improves the accuracy of the supervision information to a certain extent,but this method is difficult to extend to text and other fields. Deep adaptive clustering-based high-confidence pseudo-label subsets-selected are iteratively trained the network in the prediction results,but lowconfidence samples-involved data distribution information is ignored. Pseudo-semi-supervised clustering votes are used to select a subset of high-confidence pseudo-labels,and all samples are used to train semi-supervised neural network. Although the ensemble strategy can improve the confidence of the pseudo-label,the voting strategy is concerned of category representation only without the feature representation of the sample itself,which can reduce the clustering performance in some cases. The ensemble learning is regarded as a representative machine learning method that reflects the ability of "group intelligence",whereas a learning method can improve the overall prediction performance via multiple base learners training and their coordinated prediction results. In pseudo-label-based clustering tasks,it can coordinate multiple base learners to obtain high-confidence pseudo-labels. However,the effectiveness of the supervision information acquisition is still to be resolved. The category information of the sample is considered for current pseudo-label-based ensemble clustering method only when the label is captured and some effective information are ignored like the feature representation of the sample itself and the clustering structure between samples. Conclusion Graph neural network is composed of content information of nodes and the structural information between nodes at the same time. To design a self-supervised ensemble clustering method based on consensus graph representation learning,it is required to make full use of sample features and relationships between samples in ensemble learning. To obtain higher confidence pseudo-labels as supervised information and improve the performance of self-supervised clustering,it is necessary to mine global and local information at the same time. We illustrate a learnable data ensemble representation through graph neural network. The confidence of pseudo-labels is improved,and the entire model is trained in self-supervision iteratively. To be summarized:1)Commonly-used consensus graph learning-integrated clustering framework is developed,which can use multi-level information like clusteringintegrated sample characteristics and category structure. 2)Self-supervision method is proposed,which uses graph neural network to mine the global and local information of the consensus graph,and high-confidence pseudo-labels are obtained as supervised information. 3)Experiments are demonstrated that the consensus graph learning ensemble clustering method has its potentials on image and text datasets.
Keywords
ensemble clustering self-supervised clustering graph representation learning consensus graph pseudolabel confidence
|