稀疏奖励下基于强化学习的无人集群自主决策与智能协同

doi:10.12382/bgxb.2022.0177

摘要/Abstract

摘要：

无人集群将深刻地塑造战争样式,为提升无人集群自主决策算法能力,对异构无人集群攻防对抗自主决策方法进行研究。对无人集群对抗模型设计进行总体概述,并对无人集群攻防对抗场景进行模型设计;针对无人集群自主决策采用强化学习技术广泛存在的稀疏奖励问题,提出基于局部回报重塑的奖励机制设定方法;在此基础上叠加优先经验回放,有效地改善稀疏奖励问题;通过程序仿真和演示系统设计,验证该方法的优越性。该方法的研究将加速基于强化学习技术的无人集群自主决策算法网络收敛过程,对无人集群自主决策算法研究具有重要意义。

关键词: 多智能体, 无人智能, 博弈对抗, 强化学习, 稀疏奖励

Abstract:

UAV swarms will profoundly shape the pattern of warfare. In order to improve the autonomous decision-making algorithm capability of UAV swarms, the autonomous decision-making method for heterogeneous UAV swarm attack-defense confrontation scenarios is studied. An overview of the design of the UAV swarm confrontation model and the model design of the UAV swarm attack-defense confrontation scenario are carried out. To solve the sparse reward problem which widely exists in the reinforcement learning technology in the autonomous decision-making of the UAV swarm, a reward mechanism setting method based on local reward reshaping is proposed. And then, the prioritized experience replay is superimposed, which effectively improves the sparse reward problem. Finally, the superiority of this method is verified by simulation and demonstration system design. This study will accelerate the network convergence process of the autonomous decision-making algorithm for UAV swarms based on reinforcement learning technology, which is of great significance to the research on autonomous decision-making algorithms of UAV swarms.

Key words: multiple agents, UAV intelligence, game confrontation, reinforcement learning, sparse reward

李超, 王瑞星, 黄建忠, 江飞龙, 魏雪梅, 孙延鑫. 稀疏奖励下基于强化学习的无人集群自主决策与智能协同[J]. 兵工学报, 2023, 44(6): 1537-1546.

LI Chao, WANG Ruixing, HUANG Jianzhong, JIANG Feilong, WEI Xuemei, SUN Yanxin. Autonomous Decision-making and Intelligent Collaboration of UAV Swarms Based on Reinforcement Learning with Sparse Rewards[J]. Acta Armamentarii, 2023, 44(6): 1537-1546.

图/表 12

图1 多智能体强化学习原理图

Fig.1 Schematic diagram of multi-agent reinforcement learning

图2 无人集群对抗模型构成

Fig.2 Composition of the UAV swarm confrontation model

图3 红蓝无人集群攻防对抗仿真模型初始站位图

Fig.3 Initial site map of attack-defense confrontation simulation model of the red and blue UAV swarms

图4 红方智能单元自主决策原理示意图

Fig.4 Schematic diagram of the autonomous decision-making principle for the redintelligent units

图5 蓝方智能单元博弈对抗策略示意图

Fig.5 Schematic diagram of the game confrontation strategy of the blueintelligent units

图6 无人集群攻防对抗场景下基于局部回报重塑的奖励工程设定

Fig.6 Reward engineering setting based on local reward reshaping in UAV swarm attack-defense confrontation scenarios

图7 局部回报重塑方法下奖励稀疏性示意

Fig.7 Reward sparsity under the local reward reshaping

图8 基于局部回报重塑及PER的无人集群对抗自主决策与智能协同策略学习方法框架

Fig.8 Framework for autonomous decision-making and intelligent collaboration strategy learning method for UAV swarm confrontation based on local reward reshaping and prioritized experience replay

图9 无人集群攻防对抗算法效率对比

Fig.9 Efficiency comparison for attack-defense confrontation algorithms of UAV swarms

表1 无人集群攻防对抗算法效率对比

Table 1 Efficiency comparison forattack-defense confrontation algorithms of UAV swarms

算法	算法效果	性能提升
DQN +局部回报重塑算法	训练2000代,策略收敛,胜率约80%。
Double DQN+局部回报重塑算法	训练1500代,策略收敛,胜率约80%。	提升25%
Double DQN+局部回报重塑+PER算法	训练700代,策略收敛,胜率约80%。	提升65%

图10 红蓝无人集群攻防对抗仿真对局态势图

Fig.10 Situation forattack-defense confrontation simulation of red and blue UVA swarms

图11 无人集群攻防对抗任务场景演示面板

Fig.11 Demonstration panel for attack-defense confrontation scenario of UVA swarms

参考文献 24

[1]	王莉. 人工智能在军事领域的渗透与应用思考[J]. 科技导报, 2017, 35(15):15-19.
	WANG L. The penetration and application of artificial intelligence in the military field[J]. Science & Technology Review, 2017, 35(15): 15-19. (in Chinese)
[2]	罗德林, 徐扬, 张金鹏. 无人机集群对抗技术新进展[J]. 科技导报, 2017, 35(7): 26-31.
	LUO D L, XU Y, ZHANG J P. New progresses on UAV swarm confrontation[J]. Science & Technology Review, 2017, 35(7): 26-31. (in Chinese)
[3]	梁晓龙, 侯岳奇, 胡利平, 等. 无人集群试验评估研究现状分析及理论方法[J]. 南京航空航天大学学报, 2020, 52(6): 846-854.
	LIANG X L, HOU Y Q, HU L P, et al. Review on evaluation and theoretical methods of un-manned swarm test[J]. Journal of Nanjing University of Aeronautics & Astronautics, 2020, 52(6): 846-854. (in Chinese)
[4]	朱建文, 赵长见, 李小平, 等. 基于强化学习的集群多目标分配与智能决策方法[J]. 兵工学报, 2021, 42(9): 2040-2048.
	ZHU J W, ZHAO C J, LI X P, et al. Multi-target assignment and intelligent decision based on reinforcement learning[J]. Acta Armamentarii, 2021, 42(9): 2040-2048. (in Chinese) doi: 10.3969/j.issn.1000-1093.2021.09.025
[5]	杜威, 丁世飞. 多智能体强化学习综述[J]. 计算机科学, 2019, 46(8):1-8. doi: 10.11896/j.issn.1002-137X.2019.08.001
	DU W, DING S F. Overview on multi-agent reinforcement learning[J]. Computer Science, 2019, 46(8): 1-8. (in Chinese) doi: 10.11896/j.issn.1002-137X.2019.08.001
[6]	郭宪, 方勇纯. 深入浅出强化学习[M]. 北京: 电子工业出版社, 2018:1-10.
	GUO X, FANG Y C. Reinforcement learning in a simple and in-depth way[M]. Beijing: Publishing House of Electronics Industry, 2018:1-10. (in Chinese)
[7]	陈智超. 基于深度强化学习的无人潜航器智能对抗决策[D]. 哈尔滨: 哈尔滨工业大学, 2020.
	CHEN Z C. UUV intelligent countermeasure decision making based on deep reinforcement learning[D]. Harbin: Harbin Institute of Technology, 2020. (in Chinese)
[8]	JAGODNIK K M, THOMAS P S, VAN DEN BOGERT A J, et al. Training an actor-critic reinforcement learning controller for arm movement using human-generated rewards[J]. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2017, 25(10): 1892-1905. doi: 10.1109/TNSRE.2017.2700395 pmid: 28475063
[9]	HARE J. Dealing with sparse rewards in reinforcement learning: arXiv:1910.09281v2[R]. Ithaca, NY, US: Cornell University, 2019.
[10]	BENGIO Y, LOURADOUR J, COLLOBERT R, et al. Curriculum learning[C]//Proceedings of the 26th annual international conference on machine learning. Montreal, Canada: International Machine Learning Society, 2009:41-48.
[11]	ANDRYCHOWICZ M, WOLSKI F, RAY A, et al. Hindsight experience replay: arXiv:1707.01495v3[R]. Ithaca, NY, US: Cornell University, 2018.
[12]	RAFATI J, NOELLE D C. Learning representations in model-free hierarchical reinforcement learning: arXiv:1810.10096v3[R]. Ithaca, NY, US: Cornell University, 2019.
[13]	杨瑞, 严江鹏, 李秀. 强化学习稀疏奖励算法研究——理论与实验[J]. 智能系统学报, 2020, 15(5): 888-899.
	YANG R, YAN J P, LI X. Summary of sparse reward algorithms in reinforcement learning—theory and experiment[J]. CAAI Transactions on Intelligent Systems, 2020, 15(5): 888-899. (in Chinese)
[14]	方嘉良. 基于强化学习的稀疏奖励问题研究[D]. 北京: 中国地质大学, 2020:29-39.
	FANG J L. Research on Sparse Reward Based on Reinforcement Learning[D]. Beijing: China University of Geosciences, 2020: 29-39. (in Chinese)
[15]	杨惟轶, 白辰甲, 蔡超, 等. 深度强化学习中稀疏奖励问题研究综述[J]. 计算机科学, 2020, 47(3):182-191. doi: 10.11896/jsjkx.190200352
	YANG W Y, BAI C J, CAI C, et al. Survey on sparse reward in deep reinforcement learning[J]. Computer Science, 2020, 47(3):182-191. (in Chinese)
[16]	王瑞星. 含有稀疏奖励的异构多智能体强化学习对抗方法研究[D]. 哈尔滨: 哈尔滨工业大学, 2021.
	WANG R X. Research on reinforcement learning countermeasures for heterogeneous multi-agents with sparse rewards[D]. Harbin: Harbin Institute of Technology, 2021. (in Chinese)
[17]	王瑞星, 董诗音, 江飞龙, 等. 稀疏奖励下基于强化学习的异构多智能体对抗[J]. 信息技术, 2021(5):12-20.
	WANG R X, DONG S Y, JIANG F L, et al. Heterogeneous multi-agent confrontation based on reinforcement learning under the sparse reward[J]. Information Technology, 2021(5):12-20. (in Chinese)
[18]	李理, 李旭光, 郭凯杰, 等. 国产化环境下基于强化学习的地空协同作战仿真[J]. 兵工学报, 2022, 43(增刊1): 74-81.
	LI L, LI X G, GUO K J, et al. Simulation of ground-air cooperative combat based on reinforcement learning in localization environment[J]. Acta Armamentarii, 2022, 43(S1): 74-81. (in Chinese) doi: 10.12382/bgxb.2022.A005
[19]	HE Y M, XING L N, CHEN Y W, et al. A generic Markov decision process model and reinforcement learning method for scheduling agile earth observation satellites[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2020.
[20]	LIU H, LI X M, WU G H, et al. An iterative two-phase optimization method based on divide and conquer framework for integrated scheduling of multiple UAVs[J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 22(9): 5926-5938. doi: 10.1109/TITS.2020.3042670 URL
[21]	LI B J, WU G H, HE Y M, et al. An overview and experimental study of learning-based optimization algorithms for vehicle routing problem: arXiv:2107.07076v2[R]. Ithaca, NY, US: Cornell University, 2022.
[22]	WANG R X, LI Y Q, ZHANG H L, et al. Satellite mission support efficiency evaluation based on cascade decomposition and Bayesian network[C]//Proceedings of International Conference on Wireless and Satellite Systems. Nanjing, China: Springer, 2020: 46-60.
[23]	SCHAUL T, QUAN J, ANTONOGLOU I, et al. Prioritized experience replay: arXiv:1511.05952v4[R]. Ithaca, NY, US: Cornell University, 2016.
[24]	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning: arXiv:1312.5602v1[J]. Ithaca, NY, US:Cornell University, 2013.