结合关键帧提取的视频-文本跨模态实体分辨双重编码方法

doi:10.12382/bgxb.2021.0262

摘要/Abstract

摘要： 现有的视频-文本跨模态实体分辨方法在视频处理上均采用均匀取帧的方法，必然导致视频信息的丢失，增加问题的复杂度。针对这一问题，提出一种结合关键帧提取的视频-文本跨模态实体分辨双重编码方法（DEIKFE）。以充分保留视频信息表征为前提，设计关键帧提取算法提取视频中的关键帧，获得视频关键帧集合表示。对于视频关键帧集合和文本，采用多级编码的方法，分别提取表征视频和文本的全局、局部和时序的特征，将其进行拼接形成多级编码表示。将该编码表示映射至共同嵌入空间，采用强负样本跨模态三元组损失对模型参数进行优化，使得匹配的视频-文本相似度越大，而不匹配的视频-文本相似度越小。通过在MSR-VTT、VATEX两个数据集上进行实验验证，与现有方法进行对比，在总体性能R@sum上分别提升了9.22%、2.86%，证明了该方法的优越性。

关键词: 跨模态实体分辨, 关键帧提取, 共同嵌入空间, 双重编码, 强负样本

Abstract: Existing video-text cross-modal entity resolution methods all adopt a method of uniformly extracting frames in video processing，which inevitably leads to the loss of video information and increases the model complexity.A dual encoding integrating key frame extraction (DEIKFE) is proposed for video-text cross-modal entity resolution. On the premise of fully retaining the video information，a key frame extraction algorithm is designed to extract the key frames in the video，which makes up the video key frame set. For the video key frame set and the text，a multi-level encoding method is adopted to extract the global，local，and time-series features，which are spliced to form a multi-level encoding representation. And the encoding representation is mapped into a common embedding space，and the model parameters are optimized by cross-modal triplet ranking loss based on the hard negative sample to make the matched video-text similarity greater and the unmatched video-text similarity smaller. The experiments on MSR-VTT and VATEX datasets show that the overall performance of R@sum is increased by 9.22% and 2.86%，respectively，comparedwith the existing methods, which can fully demonstrate the superiority of the proposed method.

Key words: cross-modalentityresolution, keyframeextraction, commonembeddingspace, dualencoding, hardnegativesample

中图分类号:

TP311

曾志贤，曹建军，翁年凤，蒋国权，范强. 结合关键帧提取的视频-文本跨模态实体分辨双重编码方法[J]. 兵工学报, 2022, 43(5): 1107-1116.

ZENG Zhixian， CAO Jianjun， WENG Nianfeng， JIANG Guoquan， FAN Qiang. Dual Encoding Integrating Key Frame Extraction for Video-text Cross-modal Entity Resolution[J]. Acta Armamentarii, 2022, 43(5): 1107-1116.

参考文献

［1］ LIU S，CHEN Z Z，LIU H Y，et al. User-video co-attention network for personalized micro-video recommendation［C］∥Proceedings of the World Wide Web Conference.San Francisco，CA，US: Association for Computing Machinery，2019: 3020-3026.
［2］彭宇新，綦金玮，黄鑫.多媒体内容理解的研究现状与展望［J］.计算机研究与发展，2019，56(1): 183-208.
PENG Y X，QI J W，HUANG X. Current research status and prospects on multimedia content understanding［J］.Journal of Computer Research and Development，2019，56(1): 183-208. (in Chinese)
［3］杜鹏飞，李小勇，高雅丽. 多模态视觉语言表征学习研究综述［J］. 软件学报，2021，32(2): 327-348.
DU P F，LI X Y，GAO Y L. Survey on multimodal visual language representation learning［J］. Journal of Software， 2021，32(2): 327-348. (in Chinese)
［4］ CHANG X，YANG Y，HAUPTMANN A，et al. Semantic concept discovery for large-scale zero-shot event detection［C］∥Proceedings of the 24th International Joint Conference on Artificial Intelligence. Buenos Aires， Argentina: AAAI Press，2015: 2234-2240.
［5］ HABIBIAN A，MENSINK T，SNOEK C G M. Composite concept discovery for zero-shot video event detection［C］∥Proceedings of International Conference on Multimedia Retrieval. Glasgow，UK: Association for Computing Machinery，2014: 17-24.
［6］ JIANG B，YANG J C，L Z H，et al. Internet cross-media retrieval based on deep learning［J］.Journal of Visual Communication and Image Representation，2017，48: 356-366.
［7］ FROME A，CORRADO G，SHLENS J，et al. Devise: a deep visual-semantic embedding model［C］∥Proceedings of the 27th Annual Conference on Neural Information Processing Systems.Lake Tahoe，NV，US:Neural Information Processing Systems Foundation,Inc.,2013: 251-260.
［8］ PENG Y X，QI J W, YUAN T X. CM-GANs: cross-modal generative adversarial networks for common representation learning［J］. ACM Transactions on Multimedia Computing， Communications， and Applications，2019，15(1): 3284750.
［9］ RASIWASIA N，COSTA P J，COVIELLO E，et al. A new approach to cross-modal multimedia retrieval［C］∥Proceedings of the 18th ACM International Conference on Multimedia. Firenze，Italy:Association for Computing Machinery，2010: 251-260.
［10］ DONG J F，LI X R，XU C X，et al. Dual encoding for zero-example video retrieval［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach，CA，US:IEEE，2019: 9346-9355.
［11］ SONG Y，SOLEYMANI M.Polysemous visual-semantic embedding for cross-modal retrieval［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach，CA，US:IEEE，2019: 1979-1988.
［12］ DONG J F，LI X R，XU C X，et al.Dual encoding for video retrieval by text［J］.IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，14(22): 3059295.
［13］ ALEX K，LI F F.Deep visual-semantic alignments for generating image descriptions［C］∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston，MA，US: IEEE Computer Society，2015:3128-3137.
［14］ PENG Y X，QI J W，ZHUO Y K.MAVA:multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism［J］.IEEE Transactions on Image Processing，2019，29:2728-2741.
［15］ LI K H，CHEN X，HUA G，et al.Stacked cross attention for image-text matching［C］∥Proceedings of the European Conference on Computer Vision.Munich，Germany: Proceedings，Part IV，2018: 201-216.
［16］ CHEN S Z，ZHAO Y D，JIN Q，et al. Fine-grained video-text retrieval with hierarchical graph reasoning［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle，WA，US: IEEE，2020: 10638-10647.
［17］ LI W J，YAO J G，DONG T Z，et al. Improved interframe difference and a Gaussian model［C］∥Proceeding of the 8th International Congress on Image and Signal Processing. Shenyang，China: IEEE Computer Society， 2015: 969-973.
［18］ DENG J，DONG W，SOCHER R，et al. Imagenet: a large-scale hierarchical image database［C］∥Proceeding of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami，FL，US: IEEE，2009: 248-255.
［19］ HE K M，ZHANG X Y，REN S Q，et al. Deep residual learning for image recognition［C］∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas，NV，US: IEEE，2016: 770-778.
［20］ TRAN D，BOURDEV L，FERGUS R，et al. Learning spatiotemporal features with 3d convolutional networks［C］∥Proceedings of the IEEE Rnternational Conference on Computer Vision. Santiago，Chile: IEEE，2015: 4489-4497.
［21］ MARKATOPOULOU F，GALANOPOULOS D，MEZARIS V，et al. Query and keyframe representations for ad-hoc video search［C］∥Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. Bucharest，Romania:ACM，2017:407-411.
［22］ XU R，XIONG C M，CHEN W，et al. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework［C］∥Proceedings of the AAAI Conference on Artificial Intelligence. Austin，TX，US: AAAI，2015: 2346-2352.
［23］ HOCHREITER S，SCHMIDHUBER J. Long short-term memory［J］. Neural Computation，1997，9(8): 1735-1780.
［24］ CHO K，MERRIENBOER B V，GULCEHRE C，et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation［J］.Computer Science，2014，28(9): 2372-2385.
［25］ DONG J，LI X，SNOEK C G M. Predicting visual features from text for image and video caption retrieval［J］. IEEE Transactions on Multimedia，2018，20(12): 3377-3388.
［26］张永梅，赖裕平，马健喆，等. 基于视频的装甲车和飞机检测跟踪及轨迹预测算法［J］.兵工学报， 2021， 42(3): 545-554.
ZHANG Y M，LAI Y P，MA J Z，et al. A vdeo-based prediction algorithm for armored vehicle and aircraft detection/tracking and trajectory［J］. Acta Armamentarii， 2021，42(3): 545-554. (in Chinese)
［27］ AVGERINAKIS K，MOUMTZIDOU A，GALANOPOULOS D，et al.ITI-CERTH participation in TRECVID 2018［C］∥Proceedings of 2018 TREC Video Retrieval Evaluation. Gaithersburg， MD， US: National Institute of Standards and Technology，2018.
［28］ YOUNG P，LAI A，HODOSH M，et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions［J］. Transactions of the Association for Computational Linguistics，2014，2: 67-78.
［29］ MIKOLOV T，CHEN K，CORRADO G，et al. Efficient estimation of word pepresentations in vector space［C］∥Proceedings of the 1st International Conference on Learning Representations.Scottsdale，AZ，US: Workshop Track Proceedings，2013: 213-225.
［30］ FAGHRI F，FLEET D J，KIROS J R，et al.Vse++: improving visual-semantic embeddings with hard negatives［C］∥Proceedings of British Machine Vision Conference. Newcastle，UK: BMVA Press，2018: 740-755.
［31］ XU J，MEI T，YAO T，et al. MSR-VTT:a large video description dataset for bridging video and language［C］∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas，NV，US: IEEE，2016: 5288-5296.
［32］ WANG X，WU J W，CHEN J K，et al. Vatex: a large-scale，high-quality multilingual dataset for video-and-language research［C］∥Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul，Korea: IEEE，2019: 4581-4591.
［33］ KIROS R，SALAKHUTDINOV R，EMEL R S. Unifying visual-semantic embeddings with multimodal neural language Models［J］. Computer Science，2014，15(1): 1-24.
［34］ MITHUN N C，LI J C，METZE F，et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval［C］∥Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. Yokohama，Japan: Association for Computing Machinery，2018.

[1]	吕国俊，曹建军，郑奇斌，常宸，翁年凤，彭琮. 基于多目标蚁群优化的单类支持向量机相似重复记录检测[J]. 兵工学报, 2020, 41(2): 324-331.
[2]	曹敬瑜, 柴玮岩, 王博, 郭永红. 嵌入式分布计算环境下的高效软件构件化框架研究[J]. 兵工学报, 2013, 34(4): 451-458.
[3]	徐晓刚, 欧立铭, 邵承永. 单人/多人虚拟维修训练开发平台[J]. 兵工学报, 2012, 33(7): 886-891.
[4]	李伟，马吉胜，狄长春，吴大林. 考虑参数随机性的供输弹系统动力学及动作可靠性仿真研究[J]. 兵工学报, 2012, 33(6): 747-752.
[5]	颜南明，张豫南，王和源，胡国庆. 新型电传动车辆驱动控制系统设计[J]. 兵工学报, 2012, 33(2): 232-236.