ZENG Zhixian1,2, CAO Jianjun1,2, WENG Nianfeng1,2, JIANG Guoquan1,2, FAN Qiang1,2
(1. College of Computer Science and Technology,National University of Defense Technology,Changsha 410003,Hunan,China;2. The 63rd Research Institute,National University of Defense Technology,Nanjing 210007,Jiangsu,China)
[1] LIU S,CHEN Z Z,LIU H Y,et al. User-video co-attention network for personalized micro-video recommendation[C]∥Proceedings of the World Wide Web Conference.San Francisco,CA,US: Association for Computing Machinery,2019: 3020-3026. [2] 彭宇新,綦金玮,黄鑫.多媒体内容理解的研究现状与展望[J].计算机研究与发展,2019,56(1): 183-208. PENG Y X,QI J W,HUANG X. Current research status and prospects on multimedia content understanding[J].Journal of Computer Research and Development,2019,56(1): 183-208. (in Chinese) [3] 杜鹏飞,李小勇,高雅丽. 多模态视觉语言表征学习研究综述[J]. 软件学报,2021,32(2): 327-348. DU P F,LI X Y,GAO Y L. Survey on multimodal visual language representation learning[J]. Journal of Software, 2021,32(2): 327-348. (in Chinese) [4] CHANG X,YANG Y,HAUPTMANN A,et al. Semantic concept discovery for large-scale zero-shot event detection[C]∥Proceedings of the 24th International Joint Conference on Artificial Intelligence. Buenos Aires, Argentina: AAAI Press,2015: 2234-2240. [5] HABIBIAN A,MENSINK T,SNOEK C G M. Composite concept discovery for zero-shot video event detection[C]∥Proceedings of International Conference on Multimedia Retrieval. Glasgow,UK: Association for Computing Machinery,2014: 17-24. [6] JIANG B,YANG J C,L Z H,et al. Internet cross-media retrieval based on deep learning[J].Journal of Visual Communication and Image Representation,2017,48: 356-366. [7] FROME A,CORRADO G,SHLENS J,et al. Devise: a deep visual-semantic embedding model[C]∥Proceedings of the 27th Annual Conference on Neural Information Processing Systems.Lake Tahoe,NV,US:Neural Information Processing Systems Foundation,Inc.,2013: 251-260. [8] PENG Y X,QI J W, YUAN T X. CM-GANs: cross-modal generative adversarial networks for common representation learning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications,2019,15(1): 3284750. [9] RASIWASIA N,COSTA P J,COVIELLO E,et al. A new approach to cross-modal multimedia retrieval[C]∥Proceedings of the 18th ACM International Conference on Multimedia. Firenze,Italy:Association for Computing Machinery,2010: 251-260. [10] DONG J F,LI X R,XU C X,et al. Dual encoding for zero-example video retrieval[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach,CA,US:IEEE,2019: 9346-9355. [11] SONG Y,SOLEYMANI M.Polysemous visual-semantic embedding for cross-modal retrieval[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach,CA,US:IEEE,2019: 1979-1988. [12] DONG J F,LI X R,XU C X,et al.Dual encoding for video retrieval by text[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,14(22): 3059295. [13] ALEX K,LI F F.Deep visual-semantic alignments for generating image descriptions[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston,MA,US: IEEE Computer Society,2015:3128-3137. [14] PENG Y X,QI J W,ZHUO Y K.MAVA:multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism[J].IEEE Transactions on Image Processing,2019,29:2728-2741. [15] LI K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]∥Proceedings of the European Conference on Computer Vision.Munich,Germany: Proceedings,Part IV,2018: 201-216. [16] CHEN S Z,ZHAO Y D,JIN Q,et al. Fine-grained video-text retrieval with hierarchical graph reasoning[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle,WA,US: IEEE,2020: 10638-10647. [17] LI W J,YAO J G,DONG T Z,et al. Improved interframe difference and a Gaussian model[C]∥Proceeding of the 8th International Congress on Image and Signal Processing. Shenyang,China: IEEE Computer Society, 2015: 969-973. [18] DENG J,DONG W,SOCHER R,et al. Imagenet: a large-scale hierarchical image database[C]∥Proceeding of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami,FL,US: IEEE,2009: 248-255. [19] HE K M,ZHANG X Y,REN S Q,et al. Deep residual learning for image recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas,NV,US: IEEE,2016: 770-778. [20] TRAN D,BOURDEV L,FERGUS R,et al. Learning spatiotemporal features with 3d convolutional networks[C]∥Proceedings of the IEEE Rnternational Conference on Computer Vision. Santiago,Chile: IEEE,2015: 4489-4497. [21] MARKATOPOULOU F,GALANOPOULOS D,MEZARIS V,et al. Query and keyframe representations for ad-hoc video search[C]∥Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. Bucharest,Romania:ACM,2017:407-411. [22] XU R,XIONG C M,CHEN W,et al. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework[C]∥Proceedings of the AAAI Conference on Artificial Intelligence. Austin,TX,US: AAAI,2015: 2346-2352. [23] HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997,9(8): 1735-1780. [24] CHO K,MERRIENBOER B V,GULCEHRE C,et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].Computer Science,2014,28(9): 2372-2385. [25] DONG J,LI X,SNOEK C G M. Predicting visual features from text for image and video caption retrieval[J]. IEEE Transactions on Multimedia,2018,20(12): 3377-3388. [26] 张永梅,赖裕平,马健喆,等. 基于视频的装甲车和飞机检测跟踪及轨迹预测算法[J].兵工学报, 2021, 42(3): 545-554. ZHANG Y M,LAI Y P,MA J Z,et al. A vdeo-based prediction algorithm for armored vehicle and aircraft detection/tracking and trajectory[J]. Acta Armamentarii, 2021,42(3): 545-554. (in Chinese) [27] AVGERINAKIS K,MOUMTZIDOU A,GALANOPOULOS D,et al.ITI-CERTH participation in TRECVID 2018[C]∥Proceedings of 2018 TREC Video Retrieval Evaluation. Gaithersburg, MD, US: National Institute of Standards and Technology,2018. [28] YOUNG P,LAI A,HODOSH M,et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics,2014,2: 67-78. [29] MIKOLOV T,CHEN K,CORRADO G,et al. Efficient estimation of word pepresentations in vector space[C]∥Proceedings of the 1st International Conference on Learning Representations.Scottsdale,AZ,US: Workshop Track Proceedings,2013: 213-225. [30] FAGHRI F,FLEET D J,KIROS J R,et al.Vse++: improving visual-semantic embeddings with hard negatives[C]∥Proceedings of British Machine Vision Conference. Newcastle,UK: BMVA Press,2018: 740-755. [31] XU J,MEI T,YAO T,et al. MSR-VTT:a large video description dataset for bridging video and language[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas,NV,US: IEEE,2016: 5288-5296. [32] WANG X,WU J W,CHEN J K,et al. Vatex: a large-scale,high-quality multilingual dataset for video-and-language research[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul,Korea: IEEE,2019: 4581-4591. [33] KIROS R,SALAKHUTDINOV R,EMEL R S. Unifying visual-semantic embeddings with multimodal neural language Models[J]. Computer Science,2014,15(1): 1-24. [34] MITHUN N C,LI J C,METZE F,et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C]∥Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. Yokohama,Japan: Association for Computing Machinery,2018.