Multi-agent Coverage Path Planning Based on Security Reinforcement Learning

doi:10.12382/bgxb.2023.0881

Abstract

Abstract:

The purpose of coverage path planning is to find a safe path for an agent, which can not only effectively cover the task area, but also avoid obstacles and neighboring agents. Complex and large task areas are always unavoidable when the coverage tasks are performed, so it is worth exploring how to ensure the safety of agents and enhance the collaboration between agents to improve the task efficiency and capacity of cluster. Therefore, a discrete coverage path planning mathematical model is established using raster maps, a secure multi-agent reinforcement learning algorithm based on value decomposition network is proposed, and its reasonableness is theoretically demonstrated. The proposed algorithm helps to strengthen the learning of collaborative coverage strategies among the agents by decomposing the group value function to avoid the false rewards of the agents, thus improving the convergence speed of the algorithm. The safety of the agent during an entire task is guaranteed by introducing a shield in the training process to correct the behaviors of the agent, such as out-of-bounds and collision. The simulated and semi-physical experiment results show that the algorithm can not only ensure the coverage efficiency of the agents, but also effectively maintain the safety of the agents.

Key words: multi-agent system, coverage path planning, safe reinforcement learning, value decomposition network

CLC Number:

TP183

LI Song, MA Zhuangzhuang, ZHANG Yunlin, SHAO Jinliang. Multi-agent Coverage Path Planning Based on Security Reinforcement Learning[J]. Acta Armamentarii, 2023, 44(S2): 101-113.

Figures/Tables 23

Fig.1 Schematic diagram of multi-agent coverage path

Fig.2 Schematic diagram of rasterization

Fig.3 Framework of VDN algorithm

算法1 安全约束模块
1:收集t时刻智能体i的状态 $s i t$ 和预执行动作 $a ¯ i t$ 2:预测t+1时刻智能体状态 $s ¯ i t + 1$ 3:if $s ¯ i t + 1$ 出界or碰撞do 4: $a i t$ =stop, $R i t$ =-c,c∈R⁺ 5:else 6: $a i t$ = $a ¯ i t$ 7: 执行动作 $a i t$ ,观测奖励 $R i t$ 和 $s i t + 1$ 8:end if

算法1 安全约束模块
1:收集t时刻智能体i的状态 $s i t$ 和预执行动作 $a ¯ i t$ 2:预测t+1时刻智能体状态 $s ¯ i t + 1$ 3:if $s ¯ i t + 1$ 出界or碰撞do 4: $a i t$ =stop, $R i t$ =-c,c∈R⁺ 5:else 6: $a i t$ = $a ¯ i t$ 7: 执行动作 $a i t$ ,观测奖励 $R i t$ 和 $s i t + 1$ 8:end if

Fig.4 Recurrent neural network structure

算法2 ε-贪婪法方法
1:获取当前轮数episode,随机数rand 2:ε= $(ε m a x - ε m i n) ε m a x e p i$ ×episode+ε_min 3:if rand≥ε do 4: $a ¯ t i$ =random(A) 5:else 6: $a ¯ i t$ =arg $m a x a i t$ ( $Φ i t$ (s_t), $θ i t$ ) 7:end if

算法2 ε-贪婪法方法
1:获取当前轮数episode,随机数rand 2:ε= $(ε m a x - ε m i n) ε m a x e p i$ ×episode+ε_min 3:if rand≥ε do 4: $a ¯ t i$ =random(A) 5:else 6: $a ¯ i t$ =arg $m a x a i t$ ( $Φ i t$ (s_t), $θ i t$ ) 7:end if

算法3 策略网络训练

1:随机初始化网络

Q ˜ 1

…

Q ˜ n

参数θ₁…θ_n
2:for episode=1,2,…,episode_max do
3: reset环境,

R i t

←0
4: for step=1,2,…,step_max do
5: for i=1,2,…,N do
6: 根据算法2获取预执行动作

a ¯ i t

7: 收集

a ¯ i t

,根据算法1获取有效样本
8: 更新网络

Q ˜ i

参数θ_i
9: end for
10: end for
11: if mean(

R i t

)≥goal_th do
12: break
13: end if
14:end for

算法3 策略网络训练

1:随机初始化网络

Q ˜ 1

…

Q ˜ n

参数θ₁…θ_n
2:for episode=1,2,…,episode_max do
3: reset环境,

R i t

←0
4: for step=1,2,…,step_max do
5: for i=1,2,…,N do
6: 根据算法2获取预执行动作

a ¯ i t

7: 收集

a ¯ i t

,根据算法1获取有效样本
8: 更新网络

Q ˜ i

参数θ_i
9: end for
10: end for
11: if mean(

R i t

)≥goal_th do
12: break
13: end if
14:end for

Table 1 Hyperparameters used in experiments

超参数	值
网络学习率	0.001
单隐藏层神经元数	64
折扣因子	0.99
网络更新间隔/轮	20
最小贪婪系数	0.05
最大贪婪系数	0.95
到达最大贪婪系数轮数	2000
最大运行轮数	2000
奖励阈值	4500
地图尺寸	10
平均分栈容量	100
最大运行步数	30/120
智能体数量	4/1

Fig.5 Comparison chart of reward curves

Table 2 Comparison of algorithm overlay performances

算法	覆盖率/%	重复率/%
VDN_safe	100.0	28.1
center_safe	93.8	35.4
singal_safe	99.0	24.0
VDN_unsafe	99.0	30.2
center_unsafe	89.5	39.5
singal_unsafe	95.8	29.1

Fig.6 Coverage path curves for each algorithm

Fig.7 Coverage efficiency comparison chart

Fig.8 Comparison chart of reward curves with and without security constraints

Fig.9 Comparison of coverage paths with and without safety constraints in a map with tight spaces

Fig.10 Coverage path curve after changing the number and position of obstacles

Fig.11 Coverage path curve after expanding the area

Fig.12 Coverage path curve after expanding the area and increasing the obstacle

Table 3 Results of the comparative experiments

实验	覆盖率/%	重复率/%
10×10单障碍物	100.0	28.1
10×10多障碍物	100.0	32.6
20×20单障碍物	99.5	22.0
20×20多障碍物	98.9	22.9

Table 4 Coverage performances of different number of agents

智能体个数	覆盖率/%	重复率/%
3	97.9	26.5
4	98.9	22.9
5	98.9	24.2
6	99.2	25.5

Fig.13 Schematic diagram of semi-physical platform frame

Fig.14 Controller structure

Fig.15 Schematic diagram of semi-physical simulation

Fig.16 Comparison chart of planning path and actual path of unmanned ground vehicle

References 24

[1]	TAN C S, MOHD-MOKHTAR R, ARSHAD M R. A comprehensive review of coverage path planning in robotics using classical and heuristic algorithms[J]. IEEE Access, 2021, 9: 119310-119342. doi: 10.1109/ACCESS.2021.3108177 URL
[2]	李波, 杨志鹏, 贾卓然, 等. 一种无监督学习型神经网络的无人机全区域侦察路径规划[J]. 西北工业大学学报, 2021, 39(1):77-84.
	LI B, YANG Z P, JIA Z R, et al. An unsupervised learning neural network for planning UAV full-area reconnaissance path[J]. Journal of Northwestern Polytechnical University, 2021, 39(1):77-84. (in Chinese) doi: 10.1051/jnwpu/20213910077 URL
[3]	吴文超, 黄长强, 宋磊, 等. 不确定环境下的多无人机协同搜索航路规划[J]. 兵工学报, 2011, 32(11): 1337-1342.
	WU W C, HUANG C Q, SONG L, et al. Cooperative search and path planning of multi-unmanned air vehicles in uncertain environment[J]. Acta Armamentarii, 2011, 32(11): 1337-1342. (in Chinese)
[4]	SANTOS L C, SANTOS F N, PIRES E J S, et al. Path planning for ground robots in agriculture: A short review[C]// Proceedings of the 2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC). Ponta Delgada, Portugal: IEEE, 2020: 61-66.
[5]	GHODKE V, MADAKE J. Navigational path-planning for all-terrain autonomous agricultural robot[J]. arXiv preprint, [2021-09-05]. https://doi.org/10.48550/arXiv.2109.02015.
[6]	DE CARVALHO R N, VIDAL H A, VIEIRA P, et al. Complete coverage path planning and guidance for cleaning robots[C]// Proceeding of the IEEE International Symposium on Industrial Electronics.Kyoto, Japan: IEEE, 1997: 677-682.
[7]	HASAN K M, REZA K J. Path planning algorithm development for autonomous vacuum cleaner robots[C]// Proceedings of the 2014 International Conference on Informatics, Electronics & Vision.Dhaka, Bangladesh: IEEE, 2014: 1-6.
[8]	ROTTMANN N, DENZ R, BRUDER R, et al. A probabilistic approach for complete coverage path planning with low-cost systems[C]// Proceedings of the 2021 European Conference on Mobile Robots.Bonn, Germany: IEEE, 2021: 1-8.
[9]	CHANG S J, DAN B J. Free movimg pattern’s online spanning tree coverage algorithm[C]// Proceedings of the 2006 SICE-ICASE International Joint Conference.Busan, Korea: IEEE, 2006: 2935-2938.
[10]	LUO H C, LIN H F, ZHU T, et al. Complete coverage path planning of UUV for marine mine countermeasure using grid division and spanning tree[C]// Proceedings of the 2019 Chinese Control and Decision Conference. Nanchang, China: IEEE, 2019: 5016-5021.
[11]	DOGRU S, MARQUES L. A^*-based solution to the coverage path planning problem[C]// Proceedings of the Iberian Robotics Conference 2017.Seville, Spain:Springer International Publishing, 2017: 240-248.
[12]	李御驰, 闫军涛, 宋志华, 等. 基于遗传算法的无人机监视覆盖航路规划算法研究[J]. 计算机科学与应用, 2019, 9(6): 1208-1215.
	LI Y C, YAN J T, SONG Z H, et al. Research on algorithm of UAV monitoring coverage path planning based on genetic algorithm[J]. Computer Science and Application, 2019, 9(6): 1208-1215. (in Chinese) doi: 10.12677/CSA.2019.96135 URL
[13]	PIARDI L, LIMA J, PEREIRA A I, et al. Coverage path planning optimization based on Q-learning algorithm[C]// Proceedings of the 16th International Conference of Numerical Analysis and Applied Mathematics. Rhodes, Greece: AIP Conference Proceedings, 2019, 2133: 220002.
[14]	HEYDARI J, SAHA O, GANAPATHY V. Reinforcement learning-based coverage path planning with implicit cellular decomposition[J]. arXiv preprint, [2021-10-18]. https://doi.org/10.48550/arXiv.2110.09018.
[15]	BIALAS J, DOLLER M. Coverage path planning for unmanned aerial vehicles in complex 3D environments with deep reinforcement learning[C]// Proceedings of the 2022 IEEE International Conference on Robotics and Biomimetics.Xishuangbanna, China: IEEE, 2022: 1080-1085.
[16]	张伟, 王乃新, 魏世琳, 等. 水下无人潜航器集群发展现状及关键技术综述[J]. 哈尔滨工程大学学报, 2020, 41(2): 289-297.
	ZHANG W, WANG N X, WEI S L, et al. Overview of unmanned underwater vehicle swarm development status and key technologies[J]. Journal of Harbin Engineering University, 2020, 41(2): 289-297. (in Chinese)
[17]	罗志远, 刘小峰, 陈俊风, 等. 一种基于分步遗传算法的多无人清洁车区域覆盖路径规划方法[J]. 电子测量与仪器学报, 2020, 34(8):43-50.
	LUO Z Y, LIU X F, CHEN J F, et al. Method of area coverage path planning of multi-unmanned cleaning vehicles based on step by step genetic algorithm[J]. Journal of Electronic Measurement and Instrumentation, 2020, 34(8):43-50. (in Chinese)
[18]	SANNA G, GODIO S, GUGLIERI G. Neural network based algorithm for multi-UAV coverage path planning[C]// Proceedings of the 2021 International Conference on Unmanned Aircraft Systems (ICUAS). Athens, Greece: IEEE, 2021: 1210-1217.
[19]	LI W H, ZHAO T, DIAN S Y. Multirobot coverage path planning based on deep Q-network in unknown environment[J]. Journal of Robotics, 2022(2):1-15. DOI:10.1155/2022/6825902.
[20]	SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning[J]. arXiv preprint, [2017-06-16].https://doi.org/10.48550/arXiv.1706.05296.
[21]	王雪松, 王荣荣, 程玉虎. 安全强化学习综述[J]. 自动化学报, 2023, 49(9): 1813-1835.
	WANG X S, WANG R R, CHENG Y H. Safe reinforcement learning: a survey[J]. Acta Automatica Sinica, 2023, 49(9): 1813-1835. (in Chinese)
[22]	MATIGNON L, LAURENT G J, LE FORT-PIAT N. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems[J]. The Knowledge Engineering Review, 2012, 27(1): 1-31. doi: 10.1017/S0269888912000057 URL
[23]	ZAREMBA W, SUTSKEVER I, VINYALS O. Recurrent neural network regularization[J]. arXiv preprint, [2015-02-19]. https://doi.org/10.48550/arXiv.1409.2329.
[24]	HU J L, WELLMAN M P. Nash Q-learning for general sum stochasticgames[J]. Journal of Machine Learning Research, 2003, 4: 1039-1069.