Welcome to Acta Armamentarii ! Today is Share:

Acta Armamentarii ›› 2022, Vol. 43 ›› Issue (5): 1107-1116.doi: 10.12382/bgxb.2021.0262

• Paper • Previous Articles     Next Articles

Dual Encoding Integrating Key Frame Extraction for Video-text Cross-modal Entity Resolution

ZENG Zhixian1,2, CAO Jianjun1,2, WENG Nianfeng1,2, JIANG Guoquan1,2, FAN Qiang1,2   

  1. (1. College of Computer Science and Technology,National University of Defense Technology,Changsha 410003,Hunan,China;2. The 63rd Research Institute,National University of Defense Technology,Nanjing 210007,Jiangsu,China)
  • Online:2022-03-17

Abstract: Existing video-text cross-modal entity resolution methods all adopt a method of uniformly extracting frames in video processing,which inevitably leads to the loss of video information and increases the model complexity.A dual encoding integrating key frame extraction (DEIKFE) is proposed for video-text cross-modal entity resolution. On the premise of fully retaining the video information,a key frame extraction algorithm is designed to extract the key frames in the video,which makes up the video key frame set. For the video key frame set and the text,a multi-level encoding method is adopted to extract the global,local,and time-series features,which are spliced to form a multi-level encoding representation. And the encoding representation is mapped into a common embedding space,and the model parameters are optimized by cross-modal triplet ranking loss based on the hard negative sample to make the matched video-text similarity greater and the unmatched video-text similarity smaller. The experiments on MSR-VTT and VATEX datasets show that the overall performance of R@sum is increased by 9.22% and 2.86%,respectively,comparedwith the existing methods, which can fully demonstrate the superiority of the proposed method.

Key words: cross-modalentityresolution, keyframeextraction, commonembeddingspace, dualencoding, hardnegativesample

CLC Number: