Welcome to Acta Armamentarii ! Today is Share:

Acta Armamentarii ›› 2020, Vol. 41 ›› Issue (2): 324-331.doi: 10.3969/j.issn.1000-1093.2020.02.014

• Paper • Previous Articles     Next Articles

Detection of Similar Duplicate Records Based on OCSVM and Multi-objective Ant Colony Optimization

L Guojun1, CAO Jianjun2, ZHENG Qibin1, CHANG Chen1, WENG Nianfeng2, PENG Cong2   

  1. (1.Institute of Command and Control Engineering, Army Engineering University, Nanjing 210007, Jiangsu, China;2.The 63rd Research Institute, National University of Defense Technology, Nanjing 210007, Jiangsu, China)
  • Received:2019-04-24 Revised:2019-04-24 Online:2020-04-04

Abstract: A classification method based on one-class support vector machine (OCSVM) and multi-objective ant colony optimization is proposed for solving the problem of a small number of similar duplicately recorded samples. Based on whether the two records are similar, the detection of similar duplicate records is modeled as a two-class classification problem, the classification is performed by OCSVM, and the classifier is trained by only using the dissimilar duplicately recorded sample pairs. Appropriate attribute similarity function is selected to calculate the similar feature vectors of two records which are taken as the OCSVM's input. A multi-object model for feature selection based on the integrated optimization of recall ratio, precision ratio and feature set's size is set up. According to the characteristic of the single class training samples, a multi-object ant colony algorithm is designed to solve the model, in which the heuristic factor is defined as the minimization constraint of intra-class divergence. The proposed method is validated by comparing OCSVM with other algorithms, such as support vector domain description algorithm and traditional two-class support vector machine. Key

Key words: datacleaning, similarduplicaterecorddetection, multi-objectiveantcolonyalgorithm, featureselection, one-classsupportvectormachine, supportvectordomaindescription

CLC Number: