欢迎访问《兵工学报》官方网站,今天是

兵工学报 ›› 2020, Vol. 41 ›› Issue (2): 324-331.doi: 10.3969/j.issn.1000-1093.2020.02.014

• 论文 • 上一篇    下一篇

基于多目标蚁群优化的单类支持向量机相似重复记录检测

吕国俊1, 曹建军2, 郑奇斌1, 常宸1, 翁年凤2, 彭琮2   

  1. (1.陆军工程大学 指挥控制工程学院, 江苏 南京 210007;2.国防科技大学 第六十三研究所, 江苏 南京 210007)
  • 收稿日期:2019-04-24 修回日期:2019-04-24 上线日期:2020-04-04
  • 作者简介:吕国俊(1995—),男,硕士研究生,硕士。E-mail: lv_68886@163.com
  • 基金资助:
    国家自然科学基金面上项目(61371196);中国博士后科学基金项目(2015M582832)

Detection of Similar Duplicate Records Based on OCSVM and Multi-objective Ant Colony Optimization

L Guojun1, CAO Jianjun2, ZHENG Qibin1, CHANG Chen1, WENG Nianfeng2, PENG Cong2   

  1. (1.Institute of Command and Control Engineering, Army Engineering University, Nanjing 210007, Jiangsu, China;2.The 63rd Research Institute, National University of Defense Technology, Nanjing 210007, Jiangsu, China)
  • Received:2019-04-24 Revised:2019-04-24 Online:2020-04-04

摘要: 为解决数据源中相似重复记录样本稀少问题,提出一种基于多目标蚁群优化的单类支持向量机相似重复记录分类检测方法。根据记录对中2条记录是否相似,将相似重复记录检测建模为二分类问题,用单类支持向量机进行分类,并且只用不相似重复记录样本对进行训练;选择合适的属性相似度函数计算记录对之间的相似特征向量,将其作为单类支持向量机分类器的输入进行二分类检测;建立以查准率、查全率、特征数量综合最优为目标的多目标特征选择模型,结合训练样本为单类样本的特点,将启发式因子定义为类内散度最小化约束,设计了求解模型的多目标蚁群算法。通过将单类支持向量机算法和支持向量域描述算法、传统二分类支持向量机算法进行对比,结果验证了单类支持向量机算法的有效性和优越性。

关键词: 数据清洗, 相似重复记录检测, 多目标蚁群算法, 特征选择, 单类支持向量机, 支持向量域描述

Abstract: A classification method based on one-class support vector machine (OCSVM) and multi-objective ant colony optimization is proposed for solving the problem of a small number of similar duplicately recorded samples. Based on whether the two records are similar, the detection of similar duplicate records is modeled as a two-class classification problem, the classification is performed by OCSVM, and the classifier is trained by only using the dissimilar duplicately recorded sample pairs. Appropriate attribute similarity function is selected to calculate the similar feature vectors of two records which are taken as the OCSVM's input. A multi-object model for feature selection based on the integrated optimization of recall ratio, precision ratio and feature set's size is set up. According to the characteristic of the single class training samples, a multi-object ant colony algorithm is designed to solve the model, in which the heuristic factor is defined as the minimization constraint of intra-class divergence. The proposed method is validated by comparing OCSVM with other algorithms, such as support vector domain description algorithm and traditional two-class support vector machine. Key

Key words: datacleaning, similarduplicaterecorddetection, multi-objectiveantcolonyalgorithm, featureselection, one-classsupportvectormachine, supportvectordomaindescription

中图分类号: