融合注意力机制和多层动态形变卷积的多视图立体视觉重建方法

doi:10.12382/bgxb.2023.0740

摘要/Abstract

摘要：

针对现有多视图立体视觉(Multi-View Stereo, MVS)技术提取弱纹理区域和非郎伯体曲面特征信息不充分及重建效果不理想问题,提出一种融合注意力机制和多层动态形变卷积的AMDC-PatchmatchNet方法。构建一种融合坐标注意力的特征提取网络,能更准确地捕捉重建对象的边缘形状和纹理特征,同时融合一种基于动态形变卷积的自适应感受野模块,根据不同尺度的特征自适应调整感受野的大小和形状,获得兼具全局和细节的特征表示。在DTU数据集上的测试结果表明,所提方法相较于主流MVS方法,点云重建整体性指标提高2.8%,并且在航空影像数据集上验证了模型的泛化能力。

关键词: 多视图立体视觉, 注意力机制, 动态形变卷积, 深度学习

Abstract:

The existing multi-view stereo vision technology is not enough to extract the feature information of weak texture region and non-Lambert surface, and its reconstruction effect is not ideal. An AMDC-PatchmatchNet method with fusion attention mechanism and multi-layer dynamic deformable convolution is proposed for the problems above. In this method, a feature extraction network integrating the coordinate attention is constructed, which can capture the edge shape and texture features of reconstructed objects more accurately. At the same time, an adaptive receptive field module based on dynamic deformable convolution is integrated in the feature extraction network, and the size and shape of receptive field can be adjusted adaptively according to different scales of features to obtain both global and detailed feature representation. The generalization ability of the AMDC-PatchmatchNet method is verified on the aerial image data sets. The test results on DTU data sets show that the overall index of point cloud reconstruction of the proposed method is improved by 2.8% compared with those of mainstream MVS methods.

Key words: multi-view stereo vision, attention mechanism, dynamic deformable convolution, deep learning

中图分类号:

TP183
TP391

孙凯, 张成, 詹天, 苏迪. 融合注意力机制和多层动态形变卷积的多视图立体视觉重建方法[J]. 兵工学报, 2024, 45(10): 3631-3641.

SUN Kai, ZHANG Cheng, ZHAN Tian, SU Di. Multi-view Stereo Vision Reconstruction Network with Fusion Attention Mechanism and Multi-layer Dynamic Deformable Convolution[J]. Acta Armamentarii, 2024, 45(10): 3631-3641.

图/表 14

图1 AMDC-PatchmatchNet的总体框架

Fig.1 Overall framework of AMDC-PatchmatchNet

图2 CAB结构

Fig.2 Coordinate attention block structure

图3 两种3×3卷积核不同的采样形式

Fig.3 Two different 3×3 sampling forms of convolutional kernel

图4 可变形卷积特征的提取过程

Fig.4 Extraction process of deformable convolution feature

图5 基于DCN的ARFB结构

Fig.5 Structure diagram of DCN-based adaptive receptive field block

图6 绝对误差值大于1mm的深度估计误差率曲线图

Fig.6 Depth estimation error rate with absolute error value greater than 1mm

表1 不同方法在DTU数据集上的定量测试结果

Table 1 Quantitative test results of different methods on DTU datasets

方法	准确性误差/mm	完整性误差/mm	整体性误差/mm
Camp	0.835	0.554	0.695
Furu	0.613	0.941	0.777
Tola	0.342	1.190	0.766
Gipuma	0.283	0.873	0.578
Colmap	0.532	0.400	0.664
MVSNet	0.396	0.527	0.462
R-MVSNet	0.383	0.452	0.417
P-MVSNet	0.406	0.434	0.420
Fast-MVSNet	0.336	0.403	0.370
CVP-MVSNet	0.296	0.406	0.351
PatchmatchNet	0.427	0.277	0.352
本文方法	0.406	0.279	0.342

表2 DTU数据集部分模型稠密重建点云对比

Table 2 Point clouds reconstructed by different methods on DTU dataset

表3 消融实验定量结果对比

Table 3 Comparison of quantitative results of ablation experiments

方法	准确性误差/mm	完整性误差/mm	整体性误差/mm	GPU内存消耗/MB
无CAB、无ARFB	0.429	0.335	0.382	10 877
加入CAB、无ARFB	0.420	0.304	0.362	10 885
无CAB、加入ARFB	0.397	0.321	0.359	10 887
本文方法	0.406	0.279	0.342	10 885

表4 融合CAB前后不同阶段特征图可视化图像对比

Table 4 Visualization comparison of feature maps at different stages before and after CAB

图7 输出深度图对比

Fig.7 Comparison of output depth maps

图8 航空影像数据集示例

Fig.8 Example of an aerial imagery dataset

表5 泛化能力实验定量结果对比

Table 5 Comparison of quantitative results of generalization ability experiments

方法	平均距离	标准差
PatchmatchNet方法	0.127894	0.0891436
本文方法	0.103431	0.0813845

表6 航空影像数据集测试结果对比

Table 6 Comparison of aerial imagery dataset test results

参考文献 32

[1]	蒋超, 崔玉伟, 王辉. 基于图像的无人机战场态势感知技术综述[J]. 测控技术, 2021, 40(12): 14-19.
	JIANG C, CUI Y W, WANG H. Review of image-based UAV battlefield situation awareness technology[J]. Measurement and Control Technology, 2021, 40(12): 14-19. (in Chinese)
[2]	纪广, 郝建国, 张振伟. 面向无人机作战的虚拟孪生系统设计方案[J]. 兵工学报, 2022, 43(8): 1902-1912.
	JI G, HAO J G, ZHANG Z W. Design scheme of virtual twin system for UAV combat[J]. Acta Armamentarii, 2022, 43(8): 1902-1912. (in Chinese) doi: 10.12382/bgxb.2021.0408
[3]	龙霄潇, 程新景, 朱昊, 等. 三维视觉前沿进展[J]. 中国图象图形学报, 2021, 26(6): 1389-1428.
	LONG X X, CHENG X J, ZHU H, et al. Advances in 3D vision[J]. Journal of Image and Graphics, 2021, 26(6): 1389-1428. (in Chinese)
[4]	张宗华, 刘巍, 刘国栋, 等. 三维视觉测量技术及应用进展[J]. 中国图象图形学报, 2021, 26(6): 1483-1502.
	ZHANG Z H, LIU W, LIU G D, et al. Progress of 3D visual measurement technology and its application[J]. Journal of Image and Graphics, 2019, 26(6): 1483-1502. (in Chinese)
[5]	赵双赫. 基于双目立体视觉的实时三维重建系统研究[D]. 西安: 西安电子科技大学, 2022.
	ZHAO S H. Research on real-time 3D reconstruction system based on binocular stereo vision[D]. Xi’an: Xidian University, 2022. (in Chinese)
[6]	GALLIANI S, LASINGER K, SCHINDLER K. Massively parallel multiview stereopsis by surface normal diffusion[C]// Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile:IEEE, 2015: 873-881.
[7]	SCHÖNBERGER J L, FRAHM J M. Structure-from-motion revisited[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, US:IEEE, 2016: 4104-4113.
[8]	朱红军, 高潮, 郭永彩. 基于计算机视觉的非朗伯表面三维重构[J]. 强激光与粒子束, 2014, 26(1): 295-305.
	ZHU H J, GAO C, GUO Y C. 3D reconstruction of non-Lambertian surfaces based on computer vision[J]. High Power Laser and Particle Beams, 2014, 26(1): 295-305. (in Chinese)
[9]	VORONIN V, FRANTC V, SEMENISHCHEV E, et al. 3D shape object reconstruction with non-Lambertian surface from multiple views based on deep learning[C]// Proceedings of 2022 SPIE The International Society for Optical Engineering. Orlando, FL, US:SPIE, 2022: 296-303.
[10]	陈龙, 张建林, 彭昊, 等. 多尺度注意力与领域自适应的小样本图像识别[J]. 光电工程, 2023, 50(4): 67-80.
	CHEN L, ZHANG J L, PENG H, et al. Multi-scale attention and domain adaptive small sample image recognition[J]. Opto-Electronic Engineering, 2023, 50(4): 67-80. (in Chinese)
[11]	杜小强, 李卓林, 马锃宏, 等. 基于空间注意力和可变形卷积的无人机田间障碍物检测[J]. 农业机械学报, 2023, 54(2): 275-283.
	DU X Q, LI Z L, MA Z H, et al. Unmanned aerial vehicle field obstacle detection based on spatial attention and deformable Convolution[J]. Transactions of the Chinese Society for Agricultural Machinery, 2023, 54(2): 275-283. (in Chinese)
[12]	YAO Y, LUO Z X, LI S W, et al. MVSNet: depth inference for unstructured multi-view stereo[C]// Proceedings of 2018 Springer Verlag European Conference on Computer Vision.Munich, Germany: Springer Verlag, 2018: 767-783.
[13]	CHEN R, HAN S F, XU J, et al. Point-based multi-view stereo network[C]// Proceedings of 2019 IEEE/CVF International Conference on Computer Vision.Seoul, Korea (South): IEEE, 2019,: 1538-1547.
[14]	LUO K Y, GUAN T, JU L L, et al. P-MVSNet: learning patch-wise matching confidence aggregation for multi-view stereo[C]// Proceedings of 2019 IEEE/CVF International Conference on Computer Vision.Seoul, Korea (South): IEEE, 2019: 10451-10460.
[15]	YANG J Y, MAO W, LIU M M, et al. Cost volume pyramid based depth inference for multiview stereo[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle, WA, US:IEEE, 2020: 4876-4885.
[16]	WANG F J H, GALLIANI S, VOGEL C, et al. PatchmatchNet: learned multi-view patchmatch stereo[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, US:IEEE, 2021: 14189-14198.
[17]	TSOTSOS J K. A computational perspective on visual attention[M]. Cambridge, MA, US: MIT Press, 2021.
[18]	LIU J J, HOU Q, CHENG M M, et al. Improving convolutional networks with self-calibrated convolutions[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, US:IEEE, 2020: 10093-10102.
[19]	BELLO I, ZOPH B, LE Q, et al. Attention augmented convolutional networks[C]// Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE, 2019: 3285-3294.
[20]	FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]// Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, US:IEEE, 2019: 3141-3149.
[21]	SHEN Z, NGUYEN C. Temporal 3D RetinaNet for fish detection[C]// Proceedings of 2020 IEEE Digital Image Computing:Techniques and Applications.Melbourne, Australia:IEEE, 2020: 1-5.
[22]	HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, US: IEEE, 2021: 13708-13717.
[23]	WEI S E, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]// Proceedings of IEEE Conference Computer Vision and Pattern Recognition. Las Vegas, NV, US:IEEE, 2016: 4724-4732.
[24]	DAI J, QI H, XIONG Y, et al. Deformable convolutional networks[C]// Proceedings of IEEE International Conference on Computer Vision.Venice, Italy:IEEE, 2017: 764-773.
[25]	JENSEN R, DAHL A, VOGIATZIS G, et al. Large scale multi-view stereopsis evaluation[C]//.Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, US:IEEE, 2014: 406-413.
[26]	Discover a wide range of drone datasets senseFlyDS.(2013-12-26)[2022-07-08]. https://www.sensefly.com/education/datasets/.
[27]	CAMPBELL N, VOGIATZIS G, HERNÁNDEZ C, et al. Using multiple hypotheses to improve depth maps for multi-view stereo[M]. Berlin, Germany: Springer-Verlag, 2008: 766-779.
[28]	FURUKAWA Y, PONCE J. Accurate, dense, and robust multiview stereopsis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(8): 1362-1376. doi: 10.1109/TPAMI.2009.161 pmid: 20558871
[29]	ENGIN T, CHRISTOPH S, PASCAL F. Efficient large-scale multi-view stereo for ultra high-resolution image sets[J]. Machine Vision and Applications, 2011, 23(5): 903-920.
[30]	YAO Y, LUO Z X, LI S W, et al. Recurrent MVSNet for high-resolution multi-view stereo depth inference[C]// Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Long Beach, CA, US:IEEE, 2019: 5520-5529.
[31]	YU Z H, GAO S H. Fast-MVSNet: sparse-to-dense multi-view stereo with learned propagation and Gauss-Newton refinement[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, US:IEEE, 2020: 1946-1955.
[32]	SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]// Proceedings of 2017 IEEE International Conference on Computer Vision.Venice, Italy:IEEE, 2017: 618-626.