基于YOLOv5的增强多尺度目标检测方法

doi:10.12382/bgxb.2022.1147

摘要/Abstract

摘要：

针对复杂场景下初始锚框难以匹配目标及多尺度检测能力不强的问题,提出一种基于YOLOv5的增强多尺度目标检测(EM-YOLOv5)方法。通过Kmeans++聚类算法,获得适应当前检测场景下的多尺度初始化锚框,使得网络更容易捕捉到不同尺度目标;在Bottleneck结构中增加多条不同尺度的并行卷积支路,在保留原有特征信息的同时融合多尺度的特征信息,增强模型的全局感知能力。在VisDrone2019、COCO2017和PASCAL VOC2012数据集上对提出的EM-YOLOv5s模型进行测试。实验结果表明,与YOLOv5s模型相比,mAP@0.5∶0.95、mAP@0.5等关键指标均有一定提升,在PASCAL VOC2012上,mAP@0.5∶0.95提升5.2%,而检测时间仅增加1.9ms,说明EM-YOLOv5模型能够有效地提升通用复杂场景下的目标检测精度。

关键词: YOLOv5, 目标检测, 聚类算法, 多尺度卷积, 特征融合

Abstract:

To address the problem that the initial anchor box is difficult to match the target and its multi-scale detection ability is not strong in complex scenes, an enhanced multi-scale target detection method based on YOLOv5 is proposed. Through the Kmeans++ clustering algorithm, the multi-scale initialization anchors suitable for the current detection scene is obtained, which makes it easier for the network to capture targets with different scales; then, a number of parallel convolution branches with different scales are added to the Bottleneck structure. While retaining the original feature information, the multi-scale feature information is fused to enhance the global perception ability of the model. The EM-YOLOv5s model proposed is tested on VisDrone2019, COCO2017, and PASCAL VOC2012 datasets. The experimental results show that: compared with the YOLOv5s model, the key indicators such as mAP@0.5∶0.95 and mAP@0.5 are improved; on PASCAL VOC2012, mAP @0.5∶0.95 is increased by 5.2%, while the detection time is only increased by 1.9ms, indicating that EM-YOLOv5 model can effectively improve the target detection accuracy in general complex scenes.

Key words: YOLOv5 model, target detection, clustering algorithm, multi-scale convolution, feature fusion

中图分类号:

TP391.41

惠康华, 杨卫, 刘浩翰, 张智, 郑锦, 百晓. 基于YOLOv5的增强多尺度目标检测方法[J]. 兵工学报, 2023, 44(9): 2600-2610.

HUI Kanghua, YANG Wei, LIU Haohan, ZHANG Zhi, ZHENG Jin, BAI Xiao. Enhanced Multi-scale Target Detection Method Based on YOLOv5[J]. Acta Armamentarii, 2023, 44(9): 2600-2610.

图/表 16

参考文献 34

[1]	杨传栋, 钱立志, 薛松, 等. 图像自寻的弹药目标检测方法综述[J]. 兵工学报, 2022, 43(10): 2687-2704.
	YANG C D, QIAN L Z, XUE S, et al. Review on target detection of image homing ammunition[J]. Acta Armamentarii, 2022, 43(10): 2687-2704. (in Chinese) doi: 10.12382/bgxb.2021.0610
[2]	龚诗雄, 王旭, 孔国杰, 等. 多车协同目标跟踪方法[J]. 兵工学报, 2022, 43(10): 2429-2442.
	GONG S X, WANG X, KONG G J, et al. Methods for multi-vehicle cooperative object tracking[J]. Acta Armamentarii, 2022, 43(10): 2429-2442. (in Chinese) doi: 10.12382/bgxb.2021.0462
[3]	VIOLA P, JONES M. Rapid object detection using a boosted cascade of simple features[C]//Proceeding of the 2001 IEEE Conference on Computer Vision and Pattern Recognition. Kauai, HI, US: IEEE, 2001: 511-518.
[4]	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceeding of the 2005 IEEE Conference on Computer Vision and Pattern Recognition. San Diego, CA, US: IEEE, 2005: 886-893.
[5]	FELZENSZWALB P, MCALLESTER D, RAMANAN D. A discriminatively trained, multiscale, deformable part model[C]//Proceeding of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, US: IEEE, 2008: 1-8.
[6]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. doi: 10.1145/3065386 URL
[7]	GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceeding of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, US: IEEE, 2014: 580-587.
[8]	HE K M, ZHANG X Y, REN S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916. doi: 10.1109/TPAMI.2015.2389824 pmid: 26353135
[9]	GIRSHICK R. Fast R-CNN[C]//Proceeding of the 2015 IEEE Conference on Computer Vision. Santigao, Chile: IEEE, 2015: 1440-1448.
[10]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.
[11]	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceeding of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, US: IEEE, 2016: 779-788.
[12]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[J]. IEEE Transactions of Pattern Analysis and Machine Intelligerce, 2020, 42(2):318-327.
[13]	LAW H, DENG J. Cornernet: detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision. Munich, Germany: IEEE, 2018: 765-781.
[14]	TIAN Z, SHEN C H, CHEN H, et al. Fcos: fully convolutional one-stage object detection[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision. Seoul, Korea: IEEE, 2019: 9627-9636.
[15]	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE International Conference on Computer Vision. Montreal, Canada: IEEE, 2021: 9992-10002.
[16]	YANG Z D, LI Z, JIANG X, et al. Focal and global knowledge distillation for detectors[C]//Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition. New Orleans, LA, US: IEEE, 2022: 4633-4642.
[17]	LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer, 2016:21-37.
[18]	REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceeding of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, US: IEEE, 2017: 6517-6525.
[19]	REDMON J, FARHADI A. Yolov3:an incremental improvement: arXiv:1804.02767v1[R/OL]. Ithaca, NY, US: Cornell University, 2018(2018-04-08)[2022-05-17].
[20]	BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection: arXiv:2004.10934[R/OL]. Ithaca, NY, US: Cornell University, 2020(2020-04-23)[2022-04-08].
[21]	王千, 王成, 冯振元, 等. K-means聚类算法研究综述[J]. 电子设计工程, 2012, 20(7): 21-24.
	WANG Q, WANG C, FENG Z Y, et al. Reriew of K-means clustering algorithm[J]. Electronic Design Engineering, 2012, 20(7): 21-24. (in Chinese)
[22]	伍育红. 聚类算法综述[J]. 计算机科学, 2015, 42(增刊6):491-499.
	WU Y H. General overview on clustering algorithms[J]. Computer Science, 2015, 42(S6): 491-499. (in Chinese)
[23]	张达为, 刘绪崇, 周维, 等. 基于改进YOLOv3的实时交通标志检测算法[J]. 计算机应用, 2022, 42(7): 2219-2226. doi: 10.11772/j.issn.1001-9081.2021050731
	ZHANG D W, LIU X C, ZHOU W, et al. Real-time traffic sign detection algorithm based on improved YOLOv3[J]. Journal of Computer Applications, 2022, 42(7): 2219-2226. (in Chinese) doi: 10.11772/j.issn.1001-9081.2021050731
[24]	蒋镕圻, 彭月平, 谢文宣, 等. 嵌入scSE模块的改进YOLOv4小目标检测算法[J]. 图学学报, 2021, 42(4): 546-555.
	JIANG R X, PENG Y P, XIE W X, et al. Improved YOLOv4 small target detection algorithm with embedded scSE module[J]. Journal of Graphics, 2021, 42(4): 546-555. (in Chinese)
[25]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition: arXiv: 1409.1556[J]. Ithaca, NY, US: Correlt Uriversity, 2014(2014-09-04).
[26]	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceeding of 2015 Conference on Computer Vision and Pattern Recognition. Boston, MA, US: IEEE, 2015: 1-9.
[27]	WANG C Y, LIAO H Y M, WU Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]//Proceeding of the 2020 IEEE/CVR Conference on Computer Vision and Pattern Recognition Workshops. Seattle, WA, US: IEEE, 2020: 390-391.
[28]	EVERINGHAM M, ESLAMI S M A, VAN GOOL L, et al. The pascal visual object classes challenge:a retrospective[J]. International Journal of Computer Vision, 2015, 111: 98-136. doi: 10.1007/s11263-014-0733-5 URL
[29]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]//Proceedings of the 2014 European Conference on Conputer Vision. Zurich, Switzerland: Springer International Publishing, 2014: 740-755.
[30]	ZHU P F, WEN L Y, DU D W, et al. VisDrone-VDT2018: the vision meets drone video detection and tracking challenge results[C]//Proceedings of the 2018 European Conference on Computer Vision. Munich, Germany: Springer-Verlag, 2018: 496-518.
[31]	DAY W H E, EDELSBRUNNER H. Efficient algorithms for agglomerative hierarchical clustering methods[J]. Journal of Classification, 1984, 1(1): 7-24. doi: 10.1007/BF01890115 URL
[32]	REYNOLDS D. Gaussian mixture models[J]. Encyclopedia of Biometrics, 2009, 741:659-663.
[33]	TAN M X, LE Q V.Efficientnetv2: smaller models and faster training[C]//Proceeding of International Conference on Machine Learning. Honolulu, HI, US: IEEE, 2021: 10096-10106.
[34]	HAN K, WANG Y H, TIAN Q, et al. Ghostnet: more features from cheap operations[C]//Proceeding of 2020 IEEE/CVR Conference on Computer Vision and Pattern Recognition. Seattle, WA, US: IEEE, 2020: 1577-1586.

锚框算法	数据集	召回率/%	mAP (0.5)/%	mAP 0.5∶0.95/%
Kmeans	VOC2012	78.8	84.1	60.4
Kmeans++	VOC2012	79.8	84.7	60.8
GMM	VOC2012	79.6	84.8	61.0
Agglomerative	VOC2012	76.6	83.8	58.7
Kmeans	COCO	51.7	57.2	37.8
Kmeans++	COCO	52.7	57.4	38.1
GMM	COCO	50.2	54.3	34.6
Agglomerative	COCO	50.8	55.6	35.4
Kmeans	VisDrone	33.5	34.5	18.8
Kmeans++	VisDrone	35	35.4	19.4
GMM	VisDrone	32.6	33.8	16.6
Agglomerative	VisDrone	33.2	34.1	17.2

锚框算法	数据集	召回率/%	mAP (0.5)/%	mAP 0.5∶0.95/%
Kmeans	VOC2012	78.8	84.1	60.4
Kmeans++	VOC2012	79.8	84.7	60.8
GMM	VOC2012	79.6	84.8	61.0
Agglomerative	VOC2012	76.6	83.8	58.7
Kmeans	COCO	51.7	57.2	37.8
Kmeans++	COCO	52.7	57.4	38.1
GMM	COCO	50.2	54.3	34.6
Agglomerative	COCO	50.8	55.6	35.4
Kmeans	VisDrone	33.5	34.5	18.8
Kmeans++	VisDrone	35	35.4	19.4
GMM	VisDrone	32.6	33.8	16.6
Agglomerative	VisDrone	33.2	34.1	17.2

结构	数据集	召回率/%	mAP@ 0.5/%	mAP@ 0.5∶0.95/%	参数量
C3	VOC2012	78.8	84.1	60.4	7.0
C3	VisDrone	33.5	34.5	18.8	7.0
EM-C3	VOC2012	79.9	85.7	65.4	8.4
EM-C3	VisDrone	35.2	35.8	20.4	8.4

结构	数据集	召回率/%	mAP@ 0.5/%	mAP@ 0.5∶0.95/%	参数量
C3	VOC2012	78.8	84.1	60.4	7.0
C3	VisDrone	33.5	34.5	18.8	7.0
EM-C3	VOC2012	79.9	85.7	65.4	8.4
EM-C3	VisDrone	35.2	35.8	20.4	8.4

模型	召回率/%	mAP @0.5/%	mAP@ 0.5∶0.95/%	参数量/M
YOLOv5s	78.8	84.1	60.4	7.0
YOLOv5m	82.0	87.6	67.1	20.9
YOLOv5l	85.1	89.1	69.3	46.2
YOLOv5s-EM_C3	79.9	85.7	65.4	8.4
YOLOv5m-EM_C3	84.2	88.3	69.2	24.2
YOLOv5l-EM_C3	85.5	89.3	70.6	52.1