基于安全强化学习的多智能体覆盖路径规划

doi:10.12382/bgxb.2023.0881

摘要/Abstract

摘要：

覆盖路径规划的目的是为智能体找到一条安全的轨迹,其不仅可以有效覆盖任务区域,而且可以避开障碍物与邻近智能体。在执行覆盖任务时,复杂的大面积任务区域总是不可避免的。如何在保证智能体安全的前提下加强智能体之间的协同合作,以改善集群任务效率低、能力不足的缺点是值得探索的问题。为此,利用栅格地图建立离散的覆盖路径规划数学模型,提出一种基于值分解网络的安全多智能体强化学习算法,并通过理论证明论证其合理性。该算法通过分解群体价值函数以避免智能体的虚假奖励,有助于加强智能体之间协同覆盖策略的学习,以提高算法的收敛速度。通过在训练过程中引入屏蔽器以修正智能体的出界和碰撞等行为,保证智能体在整个任务过程中的安全。仿真和半实物实验结果表明,新算法不仅可以保证智能体的覆盖效率,同时还能有效维护智能体的安全。

关键词: 多智能体系统, 覆盖路径规划, 安全强化学习, 值分解网络

Abstract:

The purpose of coverage path planning is to find a safe path for an agent, which can not only effectively cover the task area, but also avoid obstacles and neighboring agents. Complex and large task areas are always unavoidable when the coverage tasks are performed, so it is worth exploring how to ensure the safety of agents and enhance the collaboration between agents to improve the task efficiency and capacity of cluster. Therefore, a discrete coverage path planning mathematical model is established using raster maps, a secure multi-agent reinforcement learning algorithm based on value decomposition network is proposed, and its reasonableness is theoretically demonstrated. The proposed algorithm helps to strengthen the learning of collaborative coverage strategies among the agents by decomposing the group value function to avoid the false rewards of the agents, thus improving the convergence speed of the algorithm. The safety of the agent during an entire task is guaranteed by introducing a shield in the training process to correct the behaviors of the agent, such as out-of-bounds and collision. The simulated and semi-physical experiment results show that the algorithm can not only ensure the coverage efficiency of the agents, but also effectively maintain the safety of the agents.

Key words: multi-agent system, coverage path planning, safe reinforcement learning, value decomposition network

中图分类号:

TP183

李松, 麻壮壮, 张蕴霖, 邵晋梁. 基于安全强化学习的多智能体覆盖路径规划[J]. 兵工学报, 2023, 44(S2): 101-113.

LI Song, MA Zhuangzhuang, ZHANG Yunlin, SHAO Jinliang. Multi-agent Coverage Path Planning Based on Security Reinforcement Learning[J]. Acta Armamentarii, 2023, 44(S2): 101-113.

图/表 23

图1 多智能体覆盖路径示意图

Fig.1 Schematic diagram of multi-agent coverage path

图2 栅格化示意图

Fig.2 Schematic diagram of rasterization

图3 VDN算法框架

Fig.3 Framework of VDN algorithm

算法1 安全约束模块
1:收集t时刻智能体i的状态 $s i t$ 和预执行动作 $a ¯ i t$ 2:预测t+1时刻智能体状态 $s ¯ i t + 1$ 3:if $s ¯ i t + 1$ 出界or碰撞do 4: $a i t$ =stop, $R i t$ =-c,c∈R⁺ 5:else 6: $a i t$ = $a ¯ i t$ 7: 执行动作 $a i t$ ,观测奖励 $R i t$ 和 $s i t + 1$ 8:end if

算法1 安全约束模块
1:收集t时刻智能体i的状态 $s i t$ 和预执行动作 $a ¯ i t$ 2:预测t+1时刻智能体状态 $s ¯ i t + 1$ 3:if $s ¯ i t + 1$ 出界or碰撞do 4: $a i t$ =stop, $R i t$ =-c,c∈R⁺ 5:else 6: $a i t$ = $a ¯ i t$ 7: 执行动作 $a i t$ ,观测奖励 $R i t$ 和 $s i t + 1$ 8:end if

图4 循环神经网络结构

Fig.4 Recurrent neural network structure

算法2 ε-贪婪法方法
1:获取当前轮数episode,随机数rand 2:ε= $(ε m a x - ε m i n) ε m a x e p i$ ×episode+ε_min 3:if rand≥ε do 4: $a ¯ t i$ =random(A) 5:else 6: $a ¯ i t$ =arg $m a x a i t$ ( $Φ i t$ (s_t), $θ i t$ ) 7:end if

算法2 ε-贪婪法方法
1:获取当前轮数episode,随机数rand 2:ε= $(ε m a x - ε m i n) ε m a x e p i$ ×episode+ε_min 3:if rand≥ε do 4: $a ¯ t i$ =random(A) 5:else 6: $a ¯ i t$ =arg $m a x a i t$ ( $Φ i t$ (s_t), $θ i t$ ) 7:end if

算法3 策略网络训练

1:随机初始化网络

Q ˜ 1

…

Q ˜ n

参数θ₁…θ_n
2:for episode=1,2,…,episode_max do
3: reset环境,

R i t

←0
4: for step=1,2,…,step_max do
5: for i=1,2,…,N do
6: 根据算法2获取预执行动作

a ¯ i t

7: 收集

a ¯ i t

,根据算法1获取有效样本
8: 更新网络

Q ˜ i

参数θ_i
9: end for
10: end for
11: if mean(

R i t

)≥goal_th do
12: break
13: end if
14:end for

算法3 策略网络训练

1:随机初始化网络

Q ˜ 1

…

Q ˜ n

参数θ₁…θ_n
2:for episode=1,2,…,episode_max do
3: reset环境,

R i t

←0
4: for step=1,2,…,step_max do
5: for i=1,2,…,N do
6: 根据算法2获取预执行动作

a ¯ i t

7: 收集

a ¯ i t

,根据算法1获取有效样本
8: 更新网络

Q ˜ i

参数θ_i
9: end for
10: end for
11: if mean(

R i t

)≥goal_th do
12: break
13: end if
14:end for

表1 实验中用到的超参数

Table 1 Hyperparameters used in experiments

超参数	值
网络学习率	0.001
单隐藏层神经元数	64
折扣因子	0.99
网络更新间隔/轮	20
最小贪婪系数	0.05
最大贪婪系数	0.95
到达最大贪婪系数轮数	2000
最大运行轮数	2000
奖励阈值	4500
地图尺寸	10
平均分栈容量	100
最大运行步数	30/120
智能体数量	4/1

图5 奖励曲线对比图

Fig.5 Comparison chart of reward curves

表2 算法覆盖性能对比

Table 2 Comparison of algorithm overlay performances

算法	覆盖率/%	重复率/%
VDN_safe	100.0	28.1
center_safe	93.8	35.4
singal_safe	99.0	24.0
VDN_unsafe	99.0	30.2
center_unsafe	89.5	39.5
singal_unsafe	95.8	29.1

图6 各个算法的覆盖路径曲线

Fig.6 Coverage path curves for each algorithm

图7 覆盖效率对比图

Fig.7 Coverage efficiency comparison chart

图8 有无安全约束的奖励曲线对比图

Fig.8 Comparison chart of reward curves with and without security constraints

图9 狭小空间地图中有无安全约束的覆盖路径对比图

Fig.9 Comparison of coverage paths with and without safety constraints in a map with tight spaces

图10 改变障碍物的数量和位置后的覆盖路径曲线

Fig.10 Coverage path curve after changing the number and position of obstacles

图11 扩大区域面积后的覆盖路径曲线

Fig.11 Coverage path curve after expanding the area

图12 扩大区域面积并增加障碍物的覆盖路径曲线

Fig.12 Coverage path curve after expanding the area and increasing the obstacle

表3 对比实验结果

Table 3 Results of the comparative experiments

实验	覆盖率/%	重复率/%
10×10单障碍物	100.0	28.1
10×10多障碍物	100.0	32.6
20×20单障碍物	99.5	22.0
20×20多障碍物	98.9	22.9

表4 不同智能体个数的覆盖性能

Table 4 Coverage performances of different number of agents

智能体个数	覆盖率/%	重复率/%
3	97.9	26.5
4	98.9	22.9
5	98.9	24.2
6	99.2	25.5

图13 半实物平台框架示意图

Fig.13 Schematic diagram of semi-physical platform frame

图14 控制器结构

Fig.14 Controller structure

图15 半实物仿真示意图

Fig.15 Schematic diagram of semi-physical simulation

图16 规划轨迹与无人车实际运动轨迹对比图

Fig.16 Comparison chart of planning path and actual path of unmanned ground vehicle

参考文献 24

[1]	TAN C S, MOHD-MOKHTAR R, ARSHAD M R. A comprehensive review of coverage path planning in robotics using classical and heuristic algorithms[J]. IEEE Access, 2021, 9: 119310-119342. doi: 10.1109/ACCESS.2021.3108177 URL
[2]	李波, 杨志鹏, 贾卓然, 等. 一种无监督学习型神经网络的无人机全区域侦察路径规划[J]. 西北工业大学学报, 2021, 39(1):77-84.
	LI B, YANG Z P, JIA Z R, et al. An unsupervised learning neural network for planning UAV full-area reconnaissance path[J]. Journal of Northwestern Polytechnical University, 2021, 39(1):77-84. (in Chinese) doi: 10.1051/jnwpu/20213910077 URL
[3]	吴文超, 黄长强, 宋磊, 等. 不确定环境下的多无人机协同搜索航路规划[J]. 兵工学报, 2011, 32(11): 1337-1342.
	WU W C, HUANG C Q, SONG L, et al. Cooperative search and path planning of multi-unmanned air vehicles in uncertain environment[J]. Acta Armamentarii, 2011, 32(11): 1337-1342. (in Chinese)
[4]	SANTOS L C, SANTOS F N, PIRES E J S, et al. Path planning for ground robots in agriculture: A short review[C]// Proceedings of the 2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC). Ponta Delgada, Portugal: IEEE, 2020: 61-66.
[5]	GHODKE V, MADAKE J. Navigational path-planning for all-terrain autonomous agricultural robot[J]. arXiv preprint, [2021-09-05]. https://doi.org/10.48550/arXiv.2109.02015.
[6]	DE CARVALHO R N, VIDAL H A, VIEIRA P, et al. Complete coverage path planning and guidance for cleaning robots[C]// Proceeding of the IEEE International Symposium on Industrial Electronics.Kyoto, Japan: IEEE, 1997: 677-682.
[7]	HASAN K M, REZA K J. Path planning algorithm development for autonomous vacuum cleaner robots[C]// Proceedings of the 2014 International Conference on Informatics, Electronics & Vision.Dhaka, Bangladesh: IEEE, 2014: 1-6.
[8]	ROTTMANN N, DENZ R, BRUDER R, et al. A probabilistic approach for complete coverage path planning with low-cost systems[C]// Proceedings of the 2021 European Conference on Mobile Robots.Bonn, Germany: IEEE, 2021: 1-8.
[9]	CHANG S J, DAN B J. Free movimg pattern’s online spanning tree coverage algorithm[C]// Proceedings of the 2006 SICE-ICASE International Joint Conference.Busan, Korea: IEEE, 2006: 2935-2938.
[10]	LUO H C, LIN H F, ZHU T, et al. Complete coverage path planning of UUV for marine mine countermeasure using grid division and spanning tree[C]// Proceedings of the 2019 Chinese Control and Decision Conference. Nanchang, China: IEEE, 2019: 5016-5021.
[11]	DOGRU S, MARQUES L. A^*-based solution to the coverage path planning problem[C]// Proceedings of the Iberian Robotics Conference 2017.Seville, Spain:Springer International Publishing, 2017: 240-248.
[12]	李御驰, 闫军涛, 宋志华, 等. 基于遗传算法的无人机监视覆盖航路规划算法研究[J]. 计算机科学与应用, 2019, 9(6): 1208-1215.
	LI Y C, YAN J T, SONG Z H, et al. Research on algorithm of UAV monitoring coverage path planning based on genetic algorithm[J]. Computer Science and Application, 2019, 9(6): 1208-1215. (in Chinese) doi: 10.12677/CSA.2019.96135 URL
[13]	PIARDI L, LIMA J, PEREIRA A I, et al. Coverage path planning optimization based on Q-learning algorithm[C]// Proceedings of the 16th International Conference of Numerical Analysis and Applied Mathematics. Rhodes, Greece: AIP Conference Proceedings, 2019, 2133: 220002.
[14]	HEYDARI J, SAHA O, GANAPATHY V. Reinforcement learning-based coverage path planning with implicit cellular decomposition[J]. arXiv preprint, [2021-10-18]. https://doi.org/10.48550/arXiv.2110.09018.
[15]	BIALAS J, DOLLER M. Coverage path planning for unmanned aerial vehicles in complex 3D environments with deep reinforcement learning[C]// Proceedings of the 2022 IEEE International Conference on Robotics and Biomimetics.Xishuangbanna, China: IEEE, 2022: 1080-1085.
[16]	张伟, 王乃新, 魏世琳, 等. 水下无人潜航器集群发展现状及关键技术综述[J]. 哈尔滨工程大学学报, 2020, 41(2): 289-297.
	ZHANG W, WANG N X, WEI S L, et al. Overview of unmanned underwater vehicle swarm development status and key technologies[J]. Journal of Harbin Engineering University, 2020, 41(2): 289-297. (in Chinese)
[17]	罗志远, 刘小峰, 陈俊风, 等. 一种基于分步遗传算法的多无人清洁车区域覆盖路径规划方法[J]. 电子测量与仪器学报, 2020, 34(8):43-50.
	LUO Z Y, LIU X F, CHEN J F, et al. Method of area coverage path planning of multi-unmanned cleaning vehicles based on step by step genetic algorithm[J]. Journal of Electronic Measurement and Instrumentation, 2020, 34(8):43-50. (in Chinese)
[18]	SANNA G, GODIO S, GUGLIERI G. Neural network based algorithm for multi-UAV coverage path planning[C]// Proceedings of the 2021 International Conference on Unmanned Aircraft Systems (ICUAS). Athens, Greece: IEEE, 2021: 1210-1217.
[19]	LI W H, ZHAO T, DIAN S Y. Multirobot coverage path planning based on deep Q-network in unknown environment[J]. Journal of Robotics, 2022(2):1-15. DOI:10.1155/2022/6825902.
[20]	SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning[J]. arXiv preprint, [2017-06-16].https://doi.org/10.48550/arXiv.1706.05296.
[21]	王雪松, 王荣荣, 程玉虎. 安全强化学习综述[J]. 自动化学报, 2023, 49(9): 1813-1835.
	WANG X S, WANG R R, CHENG Y H. Safe reinforcement learning: a survey[J]. Acta Automatica Sinica, 2023, 49(9): 1813-1835. (in Chinese)
[22]	MATIGNON L, LAURENT G J, LE FORT-PIAT N. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems[J]. The Knowledge Engineering Review, 2012, 27(1): 1-31. doi: 10.1017/S0269888912000057 URL
[23]	ZAREMBA W, SUTSKEVER I, VINYALS O. Recurrent neural network regularization[J]. arXiv preprint, [2015-02-19]. https://doi.org/10.48550/arXiv.1409.2329.
[24]	HU J L, WELLMAN M P. Nash Q-learning for general sum stochasticgames[J]. Journal of Machine Learning Research, 2003, 4: 1039-1069.