摘要:近年来,多标签学习在图像识别和文本分类等多个领域得到了广泛关注,具有越来越重要的潜在应用价值.尽管多标签学习的发展日新月异,但仍然存在两个主要挑战,即如何利用标签间的相关性以及如何处理大规模的多标签数据.针对上述问题,基于MLHN算法,提出一种能有效利用标签相关性且能处理大数据集的基于Spark的多标签超网络集成算法SEI-MLHN.该算法首先引入代价敏感,使其适应不平衡数据集.其次,改良了超网络演化学习过程,并优化了损失函数,降低了算法时间复杂度.最后,进行了选择性集成,使其适应大规模数据集.在11个不同规模的数据集上进行实验,结果表明,该算法具有较好的分类性能,较低的时间复杂度且具备良好的处理大规模数据集的能力.%Multi-label learning has attracted a great deal of attention in recent years and has a wide range of potential real-world applications, including image identification and text categorization. Although great effort has been expended in the development of multi-label learning, two main challenges remain, i. e., how to utilize the correlation between labels and how to tackle large-scale multi-label data. To solve these challenges, based on the multi-label hypernetwork ( MLHN) algorithm, in this paper, we propose a Spark-based multi-label hypernetwork ensemble algorithm ( SEI-MLHN) that effectively utilizes label correlation and can deal with large-scale multi-label datasets. First, the algorithm introduces cost sensitivity to enable it to adapt to unbalanced datasets. Secondly, it improves the hypernetwork evolution learning process, optimizes the loss function, and reduces the inherent time complexity. Lastly, it uses selective ensemble learning to enable it to adapt to large-scale datasets. We conducted experiments on 11 datasets wit different scales. The results show that the proposed algorithm demonstrates excellent categorization performance, low time complexity, and the capability to handle large-scale datasets.