1. 中国民航大学 计算机科学与技术学院, 天津 300300
2. 北京航空航天大学 计算机学院, 北京 100191
*邮箱: zhangz@cauc.edu.cn
收稿:2022-11-30,
网络出版:2023-09-25,
纸质出版:2023-09-20
移动端阅览
惠康华, 杨卫, 刘浩翰, 等. 基于YOLOv5的增强多尺度目标检测方法[J]. 兵工学报, 2023,44(9):2600-2610.
Kanghua HUI, Wei YANG, Haohan LIU, et al. Enhanced Multi-scale Target Detection Method Based on YOLOv5[J]. Acta Armamentarii, 2023, 44(9): 2600-2610.
惠康华, 杨卫, 刘浩翰, 等. 基于YOLOv5的增强多尺度目标检测方法[J]. 兵工学报, 2023,44(9):2600-2610. DOI: 10.12382/bgxb.2022.1147.
Kanghua HUI, Wei YANG, Haohan LIU, et al. Enhanced Multi-scale Target Detection Method Based on YOLOv5[J]. Acta Armamentarii, 2023, 44(9): 2600-2610. DOI: 10.12382/bgxb.2022.1147.
针对复杂场景下初始锚框难以匹配目标及多尺度检测能力不强的问题
提出一种基于YOLOv5的增强多尺度目标检测(EM-YOLOv5)方法。通过Kmeans++聚类算法
获得适应当前检测场景下的多尺度初始化锚框
使得网络更容易捕捉到不同尺度目标;在Bottleneck结构中增加多条不同尺度的并行卷积支路
在保留原有特征信息的同时融合多尺度的特征信息
增强模型的全局感知能力。在VisDrone2019、COCO2017和PASCAL VOC2012数据集上对提出的EM-YOLOv5s模型进行测试。实验结果表明
与YOLOv5s模型相比
mAP@0.5∶0.95、mAP@0.5等关键指标均有一定提升
在PASCAL VOC2012上
mAP@0.5∶0.95提升5.2%
而检测时间仅增加1.9ms
说明EM-YOLOv5模型能够有效地提升通用复杂场景下的目标检测精度。
To address the problem that the initial anchor box is difficult to match the target and its multi-scale detection ability is not strong in complex scenes
an enhanced multi-scale target detection method based on YOLOv5 is proposed. Through the Kmeans++ clustering algorithm
the multi-scale initialization anchors suitable for the current detection scene is obtained
which makes it easier for the network to capture targets with different scales; then
a number of parallel convolution branches with different scales are added to the Bottleneck structure. While retaining the original feature information
the multi-scale feature information is fused to enhance the global perception ability of the model. The EM-YOLOv5s model proposed is tested on VisDrone2019
COCO2017
and PASCAL VOC2012 datasets. The experimental results show that: compared with the YOLOv5s model
the key indicators such as mAP@0.5∶0.95 and mAP@0.5 are improved; on PASCAL VOC2012
mAP @0.5∶0.95 is increased by 5.2%
while the detection time is only increased by 1.9ms
indicating that EM-YOLOv5 model can effectively improve the target detection accuracy in general complex scenes.
杨传栋 , 钱立志 , 薛松 , 等 . 图像自寻的弹药目标检测方法综述 [J ] . 兵工学报 , 2022 , 43 ( 10 ): 2687 - 2704 .
YANG C D , QIAN L Z , XUE S , et al . Review on target detection of image homing ammunition [J ] . Acta Armamentarii , 2022 , 43 ( 10 ): 2687 - 2704 . (in Chinese) DOI: 10.12382/bgxb.2021.0610 http://doi.org/10.12382/bgxb.2021.0610 The onboard image target detection method is the key technology to realize the autonomous attack on the target by the “fire-and-forget” image homing ammunition. At present, the image homing of ammunition is faced with some problems, such as bad imaging environment, rapid change of targets' characteristics, and strict requirements for algorithm volume and speed. Firstly, the target detection methods based on deep learning are divided into methods based on anchor box, methods without anchor box and methods based on transformer, and the main technical progress of various methods is reviewed. Then, the key technologies in onboard image target detection model deployment, such as lightweight feature extraction network, enhancement of feature map for prediction, non-maximum suppression post-processing algorithm, sample equalization in training, and model compression, are studied. Finally, the performances of the typical detection algorithms on ImageNet, COCO and datasets for onboard image are compared, and the possible development in the future is looked into.
龚诗雄 , 王旭 , 孔国杰 , 等 . 多车协同目标跟踪方法 [J ] . 兵工学报 , 2022 , 43 ( 10 ): 2429 - 2442 .
GONG S X , WANG X , KONG G J , et al . Methods for multi-vehicle cooperative object tracking [J ] . Acta Armamentarii , 2022 , 43 ( 10 ): 2429 - 2442 . (in Chinese) DOI: 10.12382/bgxb.2021.0462 http://doi.org/10.12382/bgxb.2021.0462 Multi-vehicle information fusion technology is an important way to improve the perception of the environment of ground unmanned systems. To address the problem of discontinuous and unstable object tracking in single-vehicle sensors caused by vision occlusion and blind spots, a result-level fusion system model for centralized multi-vehicle cooperative perception is proposed. The system model uses lidar as the vehicle perception sensor and stands on the D-S evidence theory to fuse the environment grid maps constructed by different vehicles at the main control terminal to obtain a global static environment map. Based on this environment model, a multi-vehicle cooperative object detection and tracking method is designed. First, a maximum value suppression method is used to resolve the fusion conflict of detected objects. Then, a cascaded dynamic object matching and tracking management method is designed to complete object prediction and tracking and send the results to vehicles. The test results of a real-vehicle system composed of two unmanned vehicles suggest that when the object is occluded, the proposed multi-vehicle cooperative object detection and tracking architecture can obtain more comprehensive environmental information of the object than a single-vehicle perception system. No tracking object is missed, and no jump occurs. The error between the tracker's output position state result and the detection result is small. The state of the tracked object can be accurately estimated, and the tracking trajectory remains continuous, thus effectively improving the field of vision of the single-vehicle environment.
VIOLA P , JONES M . Rapid object detection using a boosted cascade of simple features [C ] //Proceeding of the 2001 IEEE Conference on Computer Vision and Pattern Recognition. Kauai, HI , US : IEEE , 2001 : 511 - 518 .
DALAL N , TRIGGS B . Histograms of oriented gradients for human detection [C ] //Proceeding of the 2005 IEEE Conference on Computer Vision and Pattern Recognition. San Diego, CA , US : IEEE , 2005 : 886 - 893 .
FELZENSZWALB P , MCALLESTER D , RAMANAN D . A discriminatively trained, multiscale, deformable part model [C ] //Proceeding of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK , US : IEEE , 2008 : 1 - 8 .
KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks [J ] . Communications of the ACM , 2017 , 60 ( 6 ): 84 - 90 . DOI: 10.1145/3065386 http://doi.org/10.1145/3065386 https://dl.acm.org/doi/10.1145/3065386 https://dl.acm.org/doi/10.1145/3065386 We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully connected layers we employed a recently developed regularization method called \"dropout\" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
GIRSHICK R , DONAHUE J , DARRELL T , et al . Rich feature hierarchies for accurate object detection and semantic segmentation [C ] //Proceeding of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH , US : IEEE , 2014 : 580 - 587 .
HE K M , ZHANG X Y , REN S Q , et al . Spatial pyramid pooling in deep convolutional networks for visual recognition [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2015 , 37 ( 9 ): 1904 - 1916 . DOI: 10.1109/TPAMI.2015.2389824 http://doi.org/10.1109/TPAMI.2015.2389824 Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224 × 224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102 × faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.
GIRSHICK R . Fast R-CNN [C ] //Proceeding of the 2015 IEEE Conference on Computer Vision. Santigao , Chile : IEEE , 2015 : 1440 - 1448 .
REN S Q , HE K M , GIRSHICK R , et al . Faster R-CNN: towards real-time object detection with region proposal networks [J ] . IEEE Transactions on Pattern Analysis & Machine Intelligence , 2017 , 39 ( 6 ): 1137 - 1149 .
REDMON J , DIVVALA S , GIRSHICK R , et al . You only look once: unified, real-time object detection [C ] //Proceeding of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV , US : IEEE , 2016 : 779 - 788 .
LIN T Y , GOYAL P , GIRSHICK R , et al . Focal loss for dense object detection [J ] . IEEE Transactions of Pattern Analysis and Machine Intelligerce , 2020 , 42 ( 2 ): 318 - 327 .
LAW H , DENG J . Cornernet: detecting objects as paired keypoints [C ] //Proceedings of the European Conference on Computer Vision. Munich , Germany : IEEE , 2018 : 765 - 781 .
TIAN Z , SHEN C H , CHEN H , et al . Fcos: fully convolutional one-stage object detection [C ] //Proceedings of the 2019 IEEE International Conference on Computer Vision. Seoul , Korea : IEEE , 2019 : 9627 - 9636 .
LIU Z , LIN Y T , CAO Y , et al . Swin transformer: hierarchical vision transformer using shifted windows [C ] //Proceedings of the IEEE International Conference on Computer Vision. Montreal , Canada : IEEE , 2021 : 9992 - 10002 .
YANG Z D , LI Z , JIANG X , et al . Focal and global knowledge distillation for detectors [C ] //Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition. New Orleans, LA , US : IEEE , 2022 : 4633 - 4642 .
LIU W , ANGUELOV D , ERHAN D , et al . SSD: single shot multibox detector [C ] //Proceedings of the 14th European Conference on Computer Vision . Amsterdam, the Netherlands : Springer , 2016 : 21 - 37 .
REDMON J , FARHADI A . YOLO9000: better, faster, stronger [C ] //Proceeding of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI , US : IEEE , 2017 : 6517 - 6525 .
REDMON J , FARHADI A . Yolov3:an incremental improvement: arXiv:1804.02767v1 [R/OL ] . Ithaca, NY , US : Cornell University , 2018 (2018-04-08)[2022-05-17 ] . https://arXiv.org/abs/1804.02767v1 https://arXiv.org/abs/1804.02767v1 .
BOCHKOVSKIY A , WANG C Y , LIAO H Y M . YOLOv4: optimal speed and accuracy of object detection: arXiv:2004.10934 [R/OL ] . Ithaca, NY , US : Cornell University , 2020 (2020-04-23)[2022-04-08 ] . https://arXiv.org/pdf/2004.10934.pdf. https://arXiv.org/pdf/2004.10934.pdf.
王千 , 王成 , 冯振元 , 等 . K-means聚类算法研究综述 [J ] . 电子设计工程 , 2012 , 20 ( 7 ): 21 - 24 .
WANG Q , WANG C , FENG Z Y , et al. Reriew of K-means clustering algorithm [J ] . Electronic Design Engineering , 2012 , 20 ( 7 ): 21 - 24 . (in Chinese)
伍育红 . 聚类算法综述 [J ] . 计算机科学 , 2015 , 42 ( 增刊6 ): 491 - 499 .
WU Y H . General overview on clustering algorithms [J ] . Computer Science , 2015 , 42 ( S6 ): 491 - 499 . (in Chinese)
张达为 , 刘绪崇 , 周维 , 等 . 基于改进YOLOv3的实时交通标志检测算法 [J ] . 计算机应用 , 2022 , 42 ( 7 ): 2219 - 2226 . DOI: 10.11772/j.issn.1001-9081.2021050731 http://doi.org/10.11772/j.issn.1001-9081.2021050731 针对目前我国智能驾驶辅助系统识别道路交通标志检测速度慢、识别精度低等问题,提出一种基于YOLOv3的改进的道路交通标志检测算法。首先,将MobileNetv2作为基础特征提取网络引入YOLOv3以形成目标检测网络模块MN-YOLOv3,在MN-YOLOv3主干网络中引入两条Down-up连接进行特征融合,从而减少检测算法的模型参数,提高了检测模块的运行速度,增强了多尺度特征图之间的信息融合;然后,根据交通标志目标形状的特点,使用K-Means++算法产生先验框的初始聚类中心,并在边界框回归中引入距离交并比(DIOU)损失函数来将DIOU与非极大值抑制(NMS)结合;最后,将感兴趣区域(ROI)与上下文信息通过ROI Align统一尺寸后融合,从而增强目标特征表达。实验结果表明,所提算法性能更好,在长沙理工大学中国交通标志检测(CCTSDB)数据集上的平均准确率均值(mAP)可达96.20%。相较于Faster R-CNN、YOLOv3、Cascaded R-CNN检测算法,所提算法拥有具有更好的实时性和更高的检测精度,对各种环境变化具有更好的鲁棒性。
ZHANG D W , LIU X C , ZHOU W , et al . Real-time traffic sign detection algorithm based on improved YOLOv3 [J ] . Journal of Computer Applications , 2022 , 42 ( 7 ): 2219 - 2226 . (in Chinese) DOI: 10.11772/j.issn.1001-9081.2021050731 http://doi.org/10.11772/j.issn.1001-9081.2021050731 Aiming at the problems of slow detection and low recognition accuracy of road traffic signs in Chinese intelligent driving assistance system, an improved road traffic sign detection algorithm based on YOLOv3 (You Only Look Once version 3) was proposed. Firstly, MobileNetv2 was introduced into YOLOv3 as the basic feature extraction network to construct an object detection network module MN-YOLOv3 (MobileNetv2-YOLOv3). And two Down-up links were added to the backbone network of MN-YOL Ov3 for feature fusion, thereby reducing the model parameters, and improving the running speed of the detection module as well as information fusion performance of the multi-scale feature maps. Then, according to the shape characteristics of traffic sign objects, K -Means++ algorithm was used to generate the initial cluster center of the anchor, and the DIOU (Distance Intersection Over Union) loss function was introduced to combine DIOU and Non-Maximum Suppression (NMS) for the bounding box regression. Finally, the Region Of Interest (ROI) and the context information were unified by ROI Align and merged to enhance the object feature expression. Experimental results show that the proposed algorithm has better performance, and the mean Average Precision (mAP) of the algorithm on the dataset CSUST (ChangSha University of Science and Technology) Chinese Traffic Sign Detection Benchmark (CCTSDB) can reach 96.20%. Compared with Faster R-CNN (Region Convolutional Neural Network), YOLOv3 and Cascaded R-CNN detection algorithms, the proposed algorithm has better real-time performance, higher detection accuracy, and is more robustness to various environmental changes.
蒋镕圻 , 彭月平 , 谢文宣 , 等 . 嵌入scSE模块的改进YOLOv4小目标检测算法 [J ] . 图学学报 , 2021 , 42 ( 4 ): 546 - 555 .
JIANG R X , PENG Y P , XIE W X , et al . Improved YOLOv4 small target detection algorithm with embedded scSE module [J ] . Journal of Graphics , 2021 , 42 ( 4 ): 546 - 555 . (in Chinese)
SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition: arXiv: 1409.1556 [J ] . Ithaca, NY, US: Correlt Uriversity , 2014 (2014-09-04). https://arXiv.org/abs/1409.1566 https://arXiv.org/abs/1409.1566 .
SZEGEDY C , LIU W , JIA Y Q , et al . Going deeper with convolutions [C ] //Proceeding of 2015 Conference on Computer Vision and Pattern Recognition. Boston, MA , US : IEEE , 2015 : 1 - 9 .
WANG C Y , LIAO H Y M , WU Y H , et al . CSPNet: a new backbone that can enhance learning capability of CNN [C ] //Proceeding of the 2020 IEEE/CVR Conference on Computer Vision and Pattern Recognition Workshops. Seattle, WA , US : IEEE , 2020 : 390 - 391 .
EVERINGHAM M , ESLAMI S M A , VAN GOOL L , et al . The pascal visual object classes challenge:a retrospective [J ] . International Journal of Computer Vision , 2015 , 111 : 98 - 136 . DOI: 10.1007/s11263-014-0733-5 http://doi.org/10.1007/s11263-014-0733-5 http://link.springer.com/10.1007/s11263-014-0733-5 http://link.springer.com/10.1007/s11263-014-0733-5
LIN T Y , MAIRE M , BELONGIE S , et al . Microsoft coco: Common objects in context [C ] //Proceedings of the 2014 European Conference on Conputer Vision. Zurich , Switzerland : Springer International Publishing , 2014 : 740 - 755 .
ZHU P F , WEN L Y , DU D W , et al . VisDrone-VDT2018: the vision meets drone video detection and tracking challenge results [C ] //Proceedings of the 2018 European Conference on Computer Vision. Munich , Germany : Springer-Verlag , 2018 : 496 - 518 .
DAY W H E , EDELSBRUNNER H . Efficient algorithms for agglomerative hierarchical clustering methods [J ] . Journal of Classification , 1984 , 1 ( 1 ): 7 - 24 . DOI: 10.1007/BF01890115 http://doi.org/10.1007/BF01890115 http://link.springer.com/10.1007/BF01890115 http://link.springer.com/10.1007/BF01890115
REYNOLDS D . Gaussian mixture models [J ] . Encyclopedia of Biometrics , 2009 , 741 : 659 - 663 .
TAN M X , LE Q V .Efficientnetv2: smaller models and faster training [C ] //Proceeding of International Conference on Machine Learning. Honolulu, HI , US : IEEE , 2021 : 10096 - 10106 .
HAN K , WANG Y H , TIAN Q , et al . Ghostnet: more features from cheap operations [C ] //Proceeding of 2020 IEEE/CVR Conference on Computer Vision and Pattern Recognition. Seattle, WA , US : IEEE , 2020 : 1577 - 1586 .
0
浏览量
1074
下载量
0
CNKI被引量
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024360号