基于知识辅助深度强化学习的巡飞弹组动态突防决策

doi:10.12382/bgxb.2023.0827

摘要/Abstract

摘要：

巡飞弹组(Loitering Munition Group,LMG)突防控制决策是提高巡飞弹群组作战自主性与智能性的关键。针对存在截击拦截器和临机防空火力区的动态环境中弹组突防机动指令在线生成困难的问题,提出一种基于知识辅助强化学习方法的LMG突防控制决策算法。结合领域知识、规则知识改进状态空间和回报函数设计提高算法泛化能力与训练收敛速度。构建基于软动作-评价方法的LMG突防控制决策框架,以提高算法探索效率。利用专家经验和模仿学习方法改善多弹多威胁带来的解空间狭窄、算法初始高效训练经验匮乏的问题。实验结果表明,新算法能够在动态环境中实时生成有效的突防机动指令,相较于对比方法效果更好,验证了算法的有效性。

关键词: 巡飞弹组, 知识辅助深度强化学习, Soft Actor-Critic算法, 动态环境突防, 控制决策

Abstract:

The loitering munition group penetration control decision (LMGPCD) is the key to improve the autonomy and intelligence of loitering munition group combat. A knowledge-assisted reinforcement learning-based LMGPCD algorithm is proposed to solve the issue due to the difficult online generation of penetration maneuver command for loitering munition group in the dynamic environment containing interceptors and air defenses. The state space and reward function are improved by domain knowledge and rule knowledge to enhance the generalization ability and training convergence speed of the algorithm. A LMGPCD decision framework based on the soft actor-critic (SAC) algorithm is constructed to increase the exploration efficiency of the algorithm. An expert experience applying and imitation learning method is utilized against the lacking of initial efficient training experience for the algorithm due to the narrow solution space caused by increasing number of missiles and threats. The experimental results show that the proposed algorithm can generate more effective penetration maneuver command in real time in a dynamic environment compared to other algorithm, which verifies the effectiveness of the proposed algorithm.

Key words: loitering munition group, knowledge-assisted deep reinforcement learning, soft actor-critic algorithm, dynamic environment penetration, control decision

中图分类号:

V279

孙浩, 黎海青, 梁彦, 马超雄, 吴翰. 基于知识辅助深度强化学习的巡飞弹组动态突防决策[J]. 兵工学报, 2024, 45(9): 3161-3176.

SUN Hao, LI Haiqing, LIANG Yan, MA Chaoxiong, WU Han. Dynamic Penetration Decision of Loitering Munition Group Based on Knowledge-assisted Reinforcement Learning[J]. Acta Armamentarii, 2024, 45(9): 3161-3176.

图/表 23

图1 动态环境LMG突防任务场景示意图

Fig.1 Schematic diagram of LMGPCD in dynamic environments

图2 弹目相对运动关系图

Fig.2 Relative motion between loitering munition and target

表1 巡飞弹模型角度定义

Table 1 Angle definition for loitering munition model

符号	含义
$ψ i M$	巡飞弹速度方向矢量与X轴夹角
$γ i M$	巡飞弹速度方向矢量与OXY平面夹角
$β i M, T$	巡飞弹与目标的视线连线在OXY平面上的投影与X轴之间的夹角
$ε i M, T$	巡飞弹与目标的视线连线与OXY平面的夹角

表1 巡飞弹模型角度定义

Table 1 Angle definition for loitering munition model

符号	含义
$ψ i M$	巡飞弹速度方向矢量与X轴夹角
$γ i M$	巡飞弹速度方向矢量与OXY平面夹角
$β i M, T$	巡飞弹与目标的视线连线在OXY平面上的投影与X轴之间的夹角
$ε i M, T$	巡飞弹与目标的视线连线与OXY平面的夹角

图3 防空火力区几何关系图

Fig.3 Schematic diagram of air defense zone

表2 知识规则对照

Table 2 Comparison of knowledge and rules

对应知识	规则名称
突防任务领域知识	突防机动规则
	任务边界型成败规则
	燃料限制型成败规则
打击任务成败规则知识	区域拒止型成败规则
	动态拦截型成败规则
	有效毁伤型成败规则
	目标指向引导偏好规则
	目标距离引导偏好规则
作战操纵领域知识	有限机动约束规则
	规避拦截约束规则
	协同飞行安全约束规则

图4 基于KADRL的LMG突防算法框架

Fig.4 Algorithm framework of KADRL-based LMGPCD

图5 策略网络结构图

Fig.5 Structure diagram of strategy network

图6 评价网络和目标网络结构图

Fig.6 Structure diagram of critic networks and target networks

图7 LMG突防决策算法伪代码

Fig.7 Pseudocode of LMG penetration decision algorithm

表3 LMG突防决策算法参数

Table 3 Parameters of LMG penetration decision algorithm

名称	取值
优化器	Adam
策略网络学习率	0.001
评价网络学习率	0.001
经验池大小	100000
采样数据规模	128
奖励折扣因子	0.99
温度系数	0.2
滑动平均更新系数	0.995
动作探索方差	0.5
随机数	0
动作约束	[-1,1]
策略更新开始时刻	30000

表4 LMG突防场景基本参数

Table 4 Parameters of LMG penetration scenario

名称	取值
场景边界l_x×l_y×l_z/km	10×10×2
巡飞弹速度 $v i M$ /(m·s^-1)	100
巡飞弹可用控制过载 $n y_m a x M$ /g	5
拦截器速度 $v i I$ /(m·s^-1)	200
拦截器可用控制过载 $n y_m a x I$ /g	3
巡飞弹巡航高度H/m	1000
巡飞弹有效杀伤范围R^MT/m	20
拦截器有效杀伤范围R^IM/m	20
拦截器最大工作时间 $t m a x I$ /s	50
拦截器比例制导律系数ξ	4
巡飞弹最大工作时间 $t m a x M$ /s	200
巡飞弹最小安全距离R^MM/m	20
防空火力区危险边界厚度L^D-R^D/m	200

表4 LMG突防场景基本参数

Table 4 Parameters of LMG penetration scenario

名称	取值
场景边界l_x×l_y×l_z/km	10×10×2
巡飞弹速度 $v i M$ /(m·s^-1)	100
巡飞弹可用控制过载 $n y_m a x M$ /g	5
拦截器速度 $v i I$ /(m·s^-1)	200
拦截器可用控制过载 $n y_m a x I$ /g	3
巡飞弹巡航高度H/m	1000
巡飞弹有效杀伤范围R^MT/m	20
拦截器有效杀伤范围R^IM/m	20
拦截器最大工作时间 $t m a x I$ /s	50
拦截器比例制导律系数ξ	4
巡飞弹最大工作时间 $t m a x M$ /s	200
巡飞弹最小安全距离R^MM/m	20
防空火力区危险边界厚度L^D-R^D/m	200

表5 典型突防场景1初始状态

Table 5 Initial status of typical penetration scenario 1

名称	x₀/m	y₀/m	z₀/m	v₀/ (m·s^-1)	φ₀/ (°)	R^D/m
巡飞弹1	-3000	0	1000	100	90
巡飞弹2	0	0	1000	100	90
巡飞弹3	2000	0	1000	100	90
目标	-1000	9000	1000	0
防空区1	-4000	5500	0			1000
防空区2	-3500	4000	0			1500
防空区3	2000	5000	0			2000
拦截器1	-2000	8000	1000	200	-90
拦截器2	-1200	7500	1000	200	-90
拦截器3	3000	8200	1000	200	-90

图8 场景1回报曲线变化图

Fig.8 Reward curve of Scenario 1

图9 场景1下LMG任务成功率变化曲线

Fig.9 Success rate curve of LMG mission in Scenario 1

表6 场景1下100次蒙特卡洛仿真结果对比

Table 6 Monte Carlo simulation results of Scenario 1

算法	任务成功	被拦截器击中	撞击障碍区	超出边界约束	超出时间约束	相互碰撞坠毁
KASAC	299	1	0	0	0	0
SAC	197	3	99	1	0	0
VAAPF	133	55	88	16	6	2

图10 场景1突防策略生成单步计算时间

Fig.10 One-step calculation time for penetration strategy generation of Scenario 1

图11 场景1下LMG典型突防轨迹图

Fig.11 Typical penetration trajectory in Scenario 1

图12 场景1下LMG典型突防态势分析图

Fig.12 Analysis of LMG penetration situation in Scenario 1

图13 场景1下LMG过载变化曲线图

Fig.13 LMG overload variation curve of Scenario 1

表7 典型突防场景2初始状态

Table 7 Initial status of typical penetration scenario 2

名称	x₀/m	y₀/m	z₀/m	v₀/ (m·s^-1)	φ₀/ (°)	R^D/m
巡飞弹1	-3000	0	1000	100	90
巡飞弹2	-500	-500	1000	100	90
巡飞弹3	1500	0	1000	100	90
目标	-1000	9000	1000	0
防空区1	3000	40000	0			1800
防空区2	-3500	4000	0			1600
防空区3	-1000	6000	0			1200
拦截器1	-4000	8000	1000	200	-90
拦截器2	-1800	7800	1000	200	-90
拦截器3	2000	8200	1000	200	-90

表8 场景2下100次蒙特卡洛仿真结果对比

Table 8 Monte Carlo simulation results of Scenario 2

算法	任务成功	被拦截器击中	撞击障碍区	超出边界约束	超出时间约束	相互碰撞坠毁
KASAC	295	0	0	3	2	0
SAC	200	0	100	0	0	0
VAAPF	90	84	103	22	0	1

图14 场景2下LMG典型突防轨迹图

Fig.14 Typical penetration trajectory of Scenario 2

图15 场景2巡飞弹过载变化曲线图

Fig.15 LMG overload variation curve of Scenario 2

参考文献 24

[1]	孙亚楠, 钟选明, 王俐云, 等. 天基信息支持远程精确打击作战及其体系建设的需求[J]. 战术导弹技术, 2018(5):13-18.
	SUN Y N, ZHONG X M, WANG L Y, et al. Space-based information supports long-range precision strike operations and its system construction requirements[J]. Tactical Missile Technology, 2018(5):13-18. (in Chinese)
[2]	张堃, 刘泽坤, 华帅, 等. 基于T/S-SAS的多无人机四维协同攻击航线生成[J]. 兵工学报, 2023, 44(6):1576-1587. doi: 10.12382/bgxb.2022.0211
	ZHANG K, LIU Z K, HUA S, et al. Influence of different bore structures on engraving process on projectile[J]. Acta Armamentarii, 2023, 44(6):1576-1587. (in Chinese) doi: 10.12382/bgxb.2022.0211
[3]	YANG L, ZHANG X J, ZHANG Y, et al. Collision free 4D path planning for multiple UAVs based on spatial refined voting mechanism and PSO approach[J]. Chinese Journal of Aeronautics, 2019, 32(6):1504-1519.
[4]	王宁宇, 白瑜亮, 魏金鹏, 等. 多弹最优协同诱导突防制导律[J]. 宇航学报, 2022, 43(4):434-444.
	WANG N Y, BAI Y L, WEI J P, et al. Guidance law for multi-missile optimal cooperative lured penetration[J]. Journal of Astronautics, 2022, 43(4):434-444. (in Chinese)
[5]	赵军民, 何浩哲, 王少奇, 等. 复杂环境下多无人机目标跟踪与避障联合航迹规划研究[J]. 兵工学报, 2023, 44(9):2685-2696. doi: 10.12382/bgxb.2022.0525
	ZHAO J M, HE H Z, WANG S Q, et al. Research on joint path planning for multiple UAVs target tracking and obstacle avoidance in complicated environment[J]. Acta Armamentarii, 2023, 44(9):2685-2696. (in Chinese)
[6]	郭华, 郭小和. 改进速度障碍法的无人机局部路径规划算法[J]. 航空学报, 2023, 44(11):271-281.
	GUO H, GUO X H. Local path planning algorithm for UAV based on improved velocity obstacle method[J]. Acta Aeronautica et Astronautica Sinica, 2023, 44(11):271-281. (in Chinese)
[7]	SU W S, YAO D N, LI K B, et al. A novel biased proportional navigation guidance law for close approach phase[J]. Chinese Journal of Aeronautics, 2016, 19(1):228-237.
[8]	ZHANG N, GAI W D, ZHONG M Y, et al. A fast finite-time convergent guidance law with nonlinear disturbance observer for unmanned aerial vehicles collision avoidance[J]. Aerospace Science & Technology, 2019, 86(Mar.): 204-214.
[9]	QIAN M S, WU Z, JIANG B. Cerebellar model articulation neural network-based distributed fault tolerant tracking control with obstacle avoidance for fixed-wing UAVs[J]. IEEE Transactions on Aerospace and Electronic Systems, 2023, 59(5): 6841-6852.
[10]	王永雄, 田永永, 李璇, 等. 穿越稠密障碍物的自适应动态窗口法[J]. 控制与决策, 2019, 34(5):927-936.
	WANG Y X, TIAN Y Y, LI X, et al. Self-adaptive dynamic window approach in dense obstacles[J]. Control and Decision, 2019, 34(5):927-936. (in Chinese)
[11]	KONDO K, TSUCHIYA T. Predictive a.pngicial potential field for UAV obstacle avoidance[C]//Proceedings of the 2021 Asia-Pacific International Symposium on Aerospace Technology. Singapore: Springer, 2022: 493-506.
[12]	BAI C C, YAN P, PIAO H Y, et al. Learning-based multi-UAV flocking control with limited visual field and instinctive repulsion[J]. IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2023.3246985.
[13]	JIN Y, WEI S Q, YUAN J, et al. Hierarchical and stable multiagent reinforcement learning for cooperative navigation control[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(1):90-103.
[14]	YAN C, WANG C, XIANG X J, et al. Collision-avoiding flocking with multiple fixed-wing UAVs in obstacle-cluttered environments:a task-specific curriculum-based MADRL approach[J]. IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2023.3245124.
[15]	LIANG C Q, LIU L, LIU C. Multi-UAV autonomous collision avoidance based on PPO-GIC algorithm with CNN-LSTM fusion network[J]. Neural Networks, 2023, 162:21-33. doi: 10.1016/j.neunet.2023.02.027 pmid: 36878168
[16]	蒲志强, 易建强, 刘振, 等. 知识和数据协同驱动的群体智能决策方法研究综述[J]. 自动化学报, 2022, 48(3):627-643.
	PU Z Q, YI J Q, LIU Z, et al. Knowledge-based and data-driven integrating methodologies for collective intelligence decision making: a survey[J]. Acta Automatica Sinica, 2022, 48(3):627-643. (in Chinese)
[17]	WU C B, YU W N, LI G, et al. Deep reinforcement learning with dynamic window approach based collision avoidance path planning for maritime autonomous surface ships[J]. Ocean Engineering, 2023, 284:115208.
[18]	王珂, 穆朝絮, 蔡光斌, 等. 基于安全自适应强化学习的自主避障控制方法[J]. 中国科学:信息科学, 2022, 52(9):1672-1686.
	WANG K, MU C X, CAI G B, et al. Autonomous obstacle avoidance control method based on safe adaptive reinforcement learning[J]. Scientia Sinica Informationis, 2022, 52(9):1672-1686. (in Chinese)
[19]	SUI Z Z, PU Z G, YI J Q, et al. Formation control with collision avoidance through deep reinforcement learning using model-guided demonstration[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(6):2358-2372.
[20]	吴玲, 卢俊霖, 许俊飞. 激光武器反无人机集群建模与效能评估[J]. 激光与红外, 2022, 52(6):887-892.
	WU L, LU J L, XU J F. Modeling and effectiveness evaluation on UAV cluster interception using laser weapon systems[J]. Laser and Infrared, 2022, 52(6):887-892. (in Chinese)
[21]	高昂, 董志明, 叶红兵, 等. 基于深度强化学习的巡飞弹突防控制决策[J]. 兵工学报, 2021, 42(5):1101-1110 doi: 10.3969/j.issn.1000-1093.2021.05.023
	GAO A, DONG Z M, YE H B, et al. Loitering munition penetration control decision based on deep reinforcement learning[J]. Acta Armamentarii, 2021, 42(5):1101-1110. (in Chinese) doi: 10.3969/j.issn.1000-1093.2021.05.023
[22]	HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor: arXiv:1801.01290[R]. Ithaca,NY, US: Cornell University,2018:1801.01290.
[23]	BELLEMARE M G, DABNEY W, et al. A distributional perspective on reinforcement learning:arXiv:1707.06887[R]. Ithaca,NY, US: Cornell University, 2017:1707.06887.
[24]	张立华, 刘全, 黄志刚, 等. 逆向强化学习研究综述[J]. 软件学报, 2023, 34(10):4772-4803.
	ZHANG L H, LIU Q, HUANG Z G, et al. Survey on inverse reinforcement learning[J]. Journal of Software, 2023, 34(10): 4772-4803. (in Chinese)